Thread: Support for N synchronous standby servers - take 2

Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:

There was a discussion on support for N synchronous standby servers started by Michael. Refer http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com . The use of hooks and dedicated language was suggested, however, it seemed to be an overkill for the scenario and there was no consensus on this. Exploring GUC-land was preferred.

Please find attached a patch,  built on Michael's patch from above mentioned thread, which supports choosing different number of nodes from each set i.e. k nodes from set 1, l nodes from set 2, so on.
The format of synchronous_standby_names has been updated to standby name followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The transaction waits for all the specified number of standby in each group. Any extra nodes with the same name will be considered potential. The special entry * for the standby name is also supported.

Thanks,

Beena Emerson

Attachment

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> There was a discussion on support for N synchronous standby servers started
> by Michael. Refer
> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
> . The use of hooks and dedicated language was suggested, however, it seemed
> to be an overkill for the scenario and there was no consensus on this.
> Exploring GUC-land was preferred.

Cool.

> Please find attached a patch,  built on Michael's patch from above mentioned
> thread, which supports choosing different number of nodes from each set i.e.
> k nodes from set 1, l nodes from set 2, so on.
> The format of synchronous_standby_names has been updated to standby name
> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The
> transaction waits for all the specified number of standby in each group. Any
> extra nodes with the same name will be considered potential. The special
> entry * for the standby name is also supported.

I don't think that this is going in the good direction, what was
suggested mainly by Robert was to use a micro-language that would
allow far more extensibility that what you are proposing. See for
example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com
for some ideas. IMO, before writing any patch in this area we should
find a clear consensus on what we want to do. Also, unrelated to this
patch, we should really get first the patch implementing the... Hum...
infrastructure for regression tests regarding replication and
archiving to be able to have actual tests for this feature (working on
it for next CF).

+        if (!SplitIdentifierString(standby_detail, '-', &elemlist2))
+        {
+            /* syntax error in list */
+            pfree(rawstring);
+            list_free(elemlist1);
+            return 0;
+        }
At quick glance, this looks problematic to me if application_name has an hyphen.

Regards,
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> There was a discussion on support for N synchronous standby servers started
>> by Michael. Refer
>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>> . The use of hooks and dedicated language was suggested, however, it seemed
>> to be an overkill for the scenario and there was no consensus on this.
>> Exploring GUC-land was preferred.
>
> Cool.
>
>> Please find attached a patch,  built on Michael's patch from above mentioned
>> thread, which supports choosing different number of nodes from each set i.e.
>> k nodes from set 1, l nodes from set 2, so on.
>> The format of synchronous_standby_names has been updated to standby name
>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The
>> transaction waits for all the specified number of standby in each group. Any
>> extra nodes with the same name will be considered potential. The special
>> entry * for the standby name is also supported.
>
> I don't think that this is going in the good direction, what was
> suggested mainly by Robert was to use a micro-language that would
> allow far more extensibility that what you are proposing. See for
> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com
> for some ideas. IMO, before writing any patch in this area we should
> find a clear consensus on what we want to do. Also, unrelated to this
> patch, we should really get first the patch implementing the... Hum...
> infrastructure for regression tests regarding replication and
> archiving to be able to have actual tests for this feature (working on
> it for next CF).

The dedicated language for multiple sync replication would be more
extensibility as you said, but I think there are not a lot of user who
want to or should use this.
IMHO such a dedicated extensible feature could be extension module,
i.g. contrib. And we could implement more simpler feature into
PostgreSQL core with some restriction.

Regards,

-------
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sat, May 16, 2015 at 5:58 PM, Sawada Masahiko wrote:
> The dedicated language for multiple sync replication would be more
> extensibility as you said, but I think there are not a lot of user who
> want to or should use this.
> IMHO such a dedicated extensible feature could be extension module,
> i.g. contrib. And we could implement more simpler feature into
> PostgreSQL core with some restriction.

As proposed, this feature does not bring us really closer to quorum
commit, and AFAIK this is what we are more or less aiming at recalling
previous discussions. Particularly with the syntax proposed above, it
is not possible to do some OR conditions on subgroups of nodes, the
list of nodes is forcibly using AND because it is necessary to wait
for all the subgroups. Also, users may want to track nodes from the
same group with different application_name.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

> I don't think that this is going in the good direction, what was 
> suggested mainly by Robert was to use a micro-language that would 
> allow far more extensibility that what you are proposing. 

I agree, the micro-language would give far more extensibility. However, as
stated ibefore, the previous discussions concluded that GUC was a preferred
way because it is more user-friendly.

> See for 
> example [hidden email]
> for some ideas. IMO, before writing any patch in this area we should 
> find a clear consensus on what we want to do. Also, unrelated to this 
> patch, we should really get first the patch implementing the... Hum... 
> infrastructure for regression tests regarding replication and 
> archiving to be able to have actual tests for this feature (working on 
> it for next CF). 

We could decide and work on patch for n-sync along with setting up
regression test infrastructure. 

> At quick glance, this looks problematic to me if application_name has an
> hyphen. 

Yes, I overlooked the fact that application name could have a hyphen. This
can be modified.

Regards,

Beena Emerson



-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849711.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
> As proposed, this feature does not bring us really closer to quorum 
> commit, and AFAIK this is what we are more or less aiming at recalling 
> previous discussions. Particularly with the syntax proposed above, it 
> is not possible to do some OR conditions on subgroups of nodes, the 
> list of nodes is forcibly using AND because it is necessary to wait 
> for all the subgroups. Also, users may want to track nodes from the
> same group with different application_name.

The patch assumes that all standbys of a group share a name and so the "OR"
condition would be taken care of that way. 
Also, since uniqueness of standby_name cannot be enforced, the same name
could be repeated across groups!. 

Regards, 

Beena





-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849712.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, May 18, 2015 at 8:42 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
>> I don't think that this is going in the good direction, what was
>> suggested mainly by Robert was to use a micro-language that would
>> allow far more extensibility that what you are proposing.
>
> I agree, the micro-language would give far more extensibility. However, as
> stated before, the previous discussions concluded that GUC was a preferred
> way because it is more user-friendly.

Er, I am not sure I follow here. The idea proposed was to define a
string formatted with some infra-language within the existing GUC
s_s_names.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

> Er, I am not sure I follow here. The idea proposed was to define a 
> string formatted with some infra-language within the existing GUC 
> s_s_names.

I am sorry, I misunderstood. I thought the  "language" approach meant use of 
hooks and module.
As you mentioned the first step would be to reach the consensus on the
method.

If I understand correctly, s_s_names should be able to define:
- a count of sync rep from a given group of names ex : 2 from A,B,C.
- AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y
AND 2 from A,B,C.

In this case, we can give the same priority to all the names specified in a
group. The standby_names cannot be repeated across groups. 

Robert had also talked about a little more complex scenarios of choosing
either A or both B and C.
Additionally, preference for a standby could also be specified. Ex: among
A,B and C, A can have higher priority and would be selected if an standby
with name A is connected.
This can make the language very complicated. 

Should all these scenarios be covered in the n-sync selection or can we
start with the basic 2 and then update later?


Thanks & Regards,

Beena Emerson 




-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849736.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Mon, May 18, 2015 at 9:40 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Er, I am not sure I follow here. The idea proposed was to define a
>> string formatted with some infra-language within the existing GUC
>> s_s_names.
>
> I am sorry, I misunderstood. I thought the  "language" approach meant use of
> hooks and module.
> As you mentioned the first step would be to reach the consensus on the
> method.
>
> If I understand correctly, s_s_names should be able to define:
> - a count of sync rep from a given group of names ex : 2 from A,B,C.
> - AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y
> AND 2 from A,B,C.
>
> In this case, we can give the same priority to all the names specified in a
> group. The standby_names cannot be repeated across groups.
>
> Robert had also talked about a little more complex scenarios of choosing
> either A or both B and C.
> Additionally, preference for a standby could also be specified. Ex: among
> A,B and C, A can have higher priority and would be selected if an standby
> with name A is connected.
> This can make the language very complicated.
>
> Should all these scenarios be covered in the n-sync selection or can we
> start with the basic 2 and then update later?

If it were me, I'd just go implement a scanner using flex and a parser
using bison and use that to parse the format I suggested before, or
some similar one.  This may sound hard, but it's really not: I put
together the patch that became commit
878fdcb843e087cc1cdeadc987d6ef55202ddd04 in just a few hours.  I don't
see why this would be particularly harder.  Then instead of arguing
about whether some stop-gap implementation is good enough until we do
the real thing, we can just have the real thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> There was a discussion on support for N synchronous standby servers started
>> by Michael. Refer
>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>> . The use of hooks and dedicated language was suggested, however, it seemed
>> to be an overkill for the scenario and there was no consensus on this.
>> Exploring GUC-land was preferred.
>
> Cool.
>
>> Please find attached a patch,  built on Michael's patch from above mentioned
>> thread, which supports choosing different number of nodes from each set i.e.
>> k nodes from set 1, l nodes from set 2, so on.
>> The format of synchronous_standby_names has been updated to standby name
>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The
>> transaction waits for all the specified number of standby in each group. Any
>> extra nodes with the same name will be considered potential. The special
>> entry * for the standby name is also supported.
>
> I don't think that this is going in the good direction, what was
> suggested mainly by Robert was to use a micro-language that would
> allow far more extensibility that what you are proposing. See for
> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com
> for some ideas.

Doesn't this approach prevent us from specifying the "potential" synchronous
standby server? For example, imagine the case where you want to treat
the server AAA as synchronous standby. You also want to use the server BBB
as synchronous standby only if the server AAA goes down. IOW, you want to
prefer to the server AAA as synchronous standby rather than BBB.
Currently we can easily set up that case by just setting
synchronous_standby_names as follows.
   synchronous_standby_names = 'AAA, BBB'

However, after we adopt the quorum commit feature with the proposed
macro-language, how can we set up that case? It seems impossible...
I'm afraid that this might be a backward compatibility issue.

Or we should extend the proposed micro-language so that it also can handle
the priority of each standby servers? Not sure that's possible, though.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> There was a discussion on support for N synchronous standby servers started
>>> by Michael. Refer
>>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>>> . The use of hooks and dedicated language was suggested, however, it seemed
>>> to be an overkill for the scenario and there was no consensus on this.
>>> Exploring GUC-land was preferred.
>>
>> Cool.
>>
>>> Please find attached a patch,  built on Michael's patch from above mentioned
>>> thread, which supports choosing different number of nodes from each set i.e.
>>> k nodes from set 1, l nodes from set 2, so on.
>>> The format of synchronous_standby_names has been updated to standby name
>>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The
>>> transaction waits for all the specified number of standby in each group. Any
>>> extra nodes with the same name will be considered potential. The special
>>> entry * for the standby name is also supported.
>>
>> I don't think that this is going in the good direction, what was
>> suggested mainly by Robert was to use a micro-language that would
>> allow far more extensibility that what you are proposing. See for
>> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com
>> for some ideas.
>
> Doesn't this approach prevent us from specifying the "potential" synchronous
> standby server? For example, imagine the case where you want to treat
> the server AAA as synchronous standby. You also want to use the server BBB
> as synchronous standby only if the server AAA goes down. IOW, you want to
> prefer to the server AAA as synchronous standby rather than BBB.
> Currently we can easily set up that case by just setting
> synchronous_standby_names as follows.
>
>     synchronous_standby_names = 'AAA, BBB'
>
> However, after we adopt the quorum commit feature with the proposed
> macro-language, how can we set up that case? It seems impossible...
> I'm afraid that this might be a backward compatibility issue.

Like that:
synchronous_standby_names = 'AAA, BBB'
The thing is that we need to support the old grammar as well to be
fully backward compatible, and that's actually equivalent to that in
the grammar: 1(AAA,BBB,CCC). This is something I understood was
included in Robert's draft proposal.

> Or we should extend the proposed micro-language so that it also can handle
> the priority of each standby servers? Not sure that's possible, though.

I am not sure that's really necessary, we need only to be able to
manage priorities within each subgroup. Putting it in a shape that
user can understand easily in pg_stat_replication looks more
challenging though. We are going to need a new view like
pg_stat_replication group that shows up the priority status of each
group, with one record for each group, taking into account that a
group can be included in another one.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> and that's actually equivalent to that in
>> the grammar: 1(AAA,BBB,CCC).
>
> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
> two servers AAA and BBB are running, the master server may return a success
> of the transaction to the client just after it receives the ACK from BBB.
> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
> the ACK from AAA to arrive before completing the transaction. And then,
> if AAA goes down, BBB should become synchronous standby.

Ah. Right. I missed your point, that's a bad day... We could have
multiple separators to define group types then:
- "()" where the order of acknowledgement does not matter
- "[]" where it does not.
You would find the old grammar with:
1[AAA,BBB,CCC]
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Fri, May 15, 2015 at 9:18 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>>> There was a discussion on support for N synchronous standby servers started
>>>> by Michael. Refer
>>>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com
>>>> . The use of hooks and dedicated language was suggested, however, it seemed
>>>> to be an overkill for the scenario and there was no consensus on this.
>>>> Exploring GUC-land was preferred.
>>>
>>> Cool.
>>>
>>>> Please find attached a patch,  built on Michael's patch from above mentioned
>>>> thread, which supports choosing different number of nodes from each set i.e.
>>>> k nodes from set 1, l nodes from set 2, so on.
>>>> The format of synchronous_standby_names has been updated to standby name
>>>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'.  The
>>>> transaction waits for all the specified number of standby in each group. Any
>>>> extra nodes with the same name will be considered potential. The special
>>>> entry * for the standby name is also supported.
>>>
>>> I don't think that this is going in the good direction, what was
>>> suggested mainly by Robert was to use a micro-language that would
>>> allow far more extensibility that what you are proposing. See for
>>> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com
>>> for some ideas.
>>
>> Doesn't this approach prevent us from specifying the "potential" synchronous
>> standby server? For example, imagine the case where you want to treat
>> the server AAA as synchronous standby. You also want to use the server BBB
>> as synchronous standby only if the server AAA goes down. IOW, you want to
>> prefer to the server AAA as synchronous standby rather than BBB.
>> Currently we can easily set up that case by just setting
>> synchronous_standby_names as follows.
>>
>>     synchronous_standby_names = 'AAA, BBB'
>>
>> However, after we adopt the quorum commit feature with the proposed
>> macro-language, how can we set up that case? It seems impossible...
>> I'm afraid that this might be a backward compatibility issue.
>
> Like that:
> synchronous_standby_names = 'AAA, BBB'
> The thing is that we need to support the old grammar as well to be
> fully backward compatible,

Yep, that's an idea. Supporting two different grammars is a bit messy, though...
If we merge the "priority" concept to the quorum commit,
that's better. But for now I have no idea about how we can do that.

> and that's actually equivalent to that in
> the grammar: 1(AAA,BBB,CCC).

I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
two servers AAA and BBB are running, the master server may return a success
of the transaction to the client just after it receives the ACK from BBB.
OTOH, in the case of AAA,BBB, that never happens. The master must wait for
the ACK from AAA to arrive before completing the transaction. And then,
if AAA goes down, BBB should become synchronous standby.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 25 June 2015 at 05:01, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> and that's actually equivalent to that in
>> the grammar: 1(AAA,BBB,CCC).
>
> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
> two servers AAA and BBB are running, the master server may return a success
> of the transaction to the client just after it receives the ACK from BBB.
> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
> the ACK from AAA to arrive before completing the transaction. And then,
> if AAA goes down, BBB should become synchronous standby.

Ah. Right. I missed your point, that's a bad day... We could have
multiple separators to define group types then:
- "()" where the order of acknowledgement does not matter
- "[]" where it does not.
You would find the old grammar with:
1[AAA,BBB,CCC]

Let's start with a complex, fully described use case then work out how to specify what we want.

I'm nervous of "it would be good ifs" because we do a ton of work only to find a design flaw.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 25 June 2015 at 05:01, Michael Paquier <michael.paquier@gmail.com> wrote:
>>
>> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>> > On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> >> and that's actually equivalent to that in
>> >> the grammar: 1(AAA,BBB,CCC).
>> >
>> > I don't think that they are the same. In the case of 1(AAA,BBB,CCC),
>> > while
>> > two servers AAA and BBB are running, the master server may return a
>> > success
>> > of the transaction to the client just after it receives the ACK from
>> > BBB.
>> > OTOH, in the case of AAA,BBB, that never happens. The master must wait
>> > for
>> > the ACK from AAA to arrive before completing the transaction. And then,
>> > if AAA goes down, BBB should become synchronous standby.
>>
>> Ah. Right. I missed your point, that's a bad day... We could have
>> multiple separators to define group types then:
>> - "()" where the order of acknowledgement does not matter
>> - "[]" where it does not.
>> You would find the old grammar with:
>> 1[AAA,BBB,CCC]
>
> Let's start with a complex, fully described use case then work out how to
> specify what we want.
>
> I'm nervous of "it would be good ifs" because we do a ton of work only to
> find a design flaw.
>

I'm not sure specific implementation yet, but I came up with solution
for this case.

For example,
- s_s_name = '1(a, b), c, d'
The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
i.g, 'b' and 'c' are potential sync node, and the quorum commit is
enable only between 'a' and 'b'.

- s_s_name = 'a, 1(b,c), d'
priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
So the quorum commit with 'b' and 'c' will be enabled after 'a' down.

With this idea, I think that we could use conventional syntax as in the past.
Though?

Regards,

--
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs  wrote:
> Let's start with a complex, fully described use case then work out how to
> specify what we want.

Well, one of the most simple cases where quorum commit and this
feature would be useful for is that, with 2 data centers:
- on center 1, master A and standby B
- on center 2, standby C and standby D
With the current synchronous_standby_names, what we can do now is
ensuring that one node has acknowledged the commit of master. For
example synchronous_standby_names = 'B,C,D'. But you know that :)
What this feature would allow use to do is for example being able to
ensure that a node on the data center 2 has acknowledged the commit of
master, meaning that even if data center 1 completely lost for a
reason or another we have at least one node on center 2 that has lost
no data at transaction commit.

Now, regarding the way to express that, we need to use a concept of
node group for each element of synchronous_standby_names. A group
contains a set of elements, each element being a group or a single
node. And for each group we need to know three things when a commit
needs to be acknowledged:
- Does my group need to acknowledge the commit?
- If yes, how many elements in my group need to acknowledge it?
- Does the order of my elements matter?

That's where the micro-language idea makes sense to use. For example,
we can define a group using separators and like (elt1,...eltN) or
[elt1,elt2,eltN]. Appending a number in front of a group is essential
as well for quorum commits. Hence for example, assuming that '()' is
used for a group whose element order does not matter, if we use that:
- k(elt1,elt2,eltN) means that we need for the k elements in the set
to return true (aka commit confirmation).
- k[elt1,elt2,eltN] means that we need for the first k elements in the
set to return true.

When k is not defined for a group, k = 1. Using only elements
separated by commas for the upper group means that we wait for the
first element in the set (for backward compatibility), hence:
1(elt1,elt2,eltN) <=> elt1,elt2,eltN

We could as well mix each behavior, aka being able to define for a
group to wait for the first k elements and a total of j elements in
the whole set, but I don't think that we need to go that far. I
suspect that in most cases users will be satisfied with only cases
where there is a group of data centers, and they want to be sure that
one or two in each center has acknowledged a commit to master
(performance is not the matter here if centers are not close). Hence
in the case above, you could get the behavior wanted with this
definition:
2(B,(C,D))
With more data centers, like 3 (wait for two nodes in the 3rd set):
3(B,(C,D),2(E,F,G))
Users could define more levels of group, like that:
2(A,(B,(C,D)))
But that's actually something few people would do in real cases.

> I'm nervous of "it would be good ifs" because we do a ton of work only to
> find a design flaw.

That makes sense. Let's continue arguing on it then.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
Hi,

On 2015-06-26 AM 12:49, Sawada Masahiko wrote:
> On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>>
>> I'm nervous of "it would be good ifs" because we do a ton of work only to
>> find a design flaw.
>>
> 
> I'm not sure specific implementation yet, but I came up with solution
> for this case.
> 
> For example,
> - s_s_name = '1(a, b), c, d'
> The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
> i.g, 'b' and 'c' are potential sync node, and the quorum commit is
> enable only between 'a' and 'b'.
> 
> - s_s_name = 'a, 1(b,c), d'
> priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
> So the quorum commit with 'b' and 'c' will be enabled after 'a' down.
> 

Do we really need to add a number like '1' in '1(a, b), c, d'?

The order of writing names already implies priorities like 2 & 3 for c & d,
respectively, like in your example. Having to write '1' for the group '(a, b)'
seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
its necessity was discussed.

So, the order of writing standby names in the list should declare their
relative priorities and parentheses (possibly nested) should help inform about
the grouping (for quorum?)

Thanks,
Amit




Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Jun 26, 2015 at 2:59 PM, Amit Langote wrote:
> Do we really need to add a number like '1' in '1(a, b), c, d'?
> The order of writing names already implies priorities like 2 & 3 for c & d,
> respectively, like in your example. Having to write '1' for the group '(a, b)'
> seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
> its necessity was discussed.

'1' is implied if no number is specified. That's the idea as written
here, not something decided of course :)

> So, the order of writing standby names in the list should declare their
> relative priorities and parentheses (possibly nested) should help inform about
> the grouping (for quorum?)

Yes.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2015-06-26 PM 02:59, Amit Langote wrote:
> On 2015-06-26 AM 12:49, Sawada Masahiko wrote:
>>
>> For example,
>> - s_s_name = '1(a, b), c, d'
>> The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3.
>> i.g, 'b' and 'c' are potential sync node, and the quorum commit is
>> enable only between 'a' and 'b'.
>>
>> - s_s_name = 'a, 1(b,c), d'
>> priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3.
>> So the quorum commit with 'b' and 'c' will be enabled after 'a' down.
>>
> 
> Do we really need to add a number like '1' in '1(a, b), c, d'?
> 
> The order of writing names already implies priorities like 2 & 3 for c & d,
> respectively, like in your example. Having to write '1' for the group '(a, b)'
> seems unnecessary, IMHO. Sorry if I have missed any previous discussion where
> its necessity was discussed.
> 
> So, the order of writing standby names in the list should declare their
> relative priorities and parentheses (possibly nested) should help inform about
> the grouping (for quorum?)
> 

Oh, I missed Michael's latest message that describes its necessity. So, the
number is essentially the quorum for a group.

Sorry about the noise.

Thanks,
Amit




Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
Hi,

On 2015-06-25 PM 01:01, Michael Paquier wrote:
> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>>> and that's actually equivalent to that in
>>> the grammar: 1(AAA,BBB,CCC).
>>
>> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
>> two servers AAA and BBB are running, the master server may return a success
>> of the transaction to the client just after it receives the ACK from BBB.
>> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
>> the ACK from AAA to arrive before completing the transaction. And then,
>> if AAA goes down, BBB should become synchronous standby.
> 
> Ah. Right. I missed your point, that's a bad day... We could have
> multiple separators to define group types then:
> - "()" where the order of acknowledgement does not matter
> - "[]" where it does not.

For '[]', I guess you meant "where it does."

> You would find the old grammar with:
> 1[AAA,BBB,CCC]
> 

Thanks,
Amit




Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Jun 26, 2015 at 5:04 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Hi,
>
> On 2015-06-25 PM 01:01, Michael Paquier wrote:
>> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
>>> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>>>> and that's actually equivalent to that in
>>>> the grammar: 1(AAA,BBB,CCC).
>>>
>>> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
>>> two servers AAA and BBB are running, the master server may return a success
>>> of the transaction to the client just after it receives the ACK from BBB.
>>> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
>>> the ACK from AAA to arrive before completing the transaction. And then,
>>> if AAA goes down, BBB should become synchronous standby.
>>
>> Ah. Right. I missed your point, that's a bad day... We could have
>> multiple separators to define group types then:
>> - "()" where the order of acknowledgement does not matter
>> - "[]" where it does not.
>
> For '[]', I guess you meant "where it does."

Yes, thanks :p
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs  wrote:
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>
> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>
> That's where the micro-language idea makes sense to use. For example,
> we can define a group using separators and like (elt1,...eltN) or
> [elt1,elt2,eltN]. Appending a number in front of a group is essential
> as well for quorum commits. Hence for example, assuming that '()' is
> used for a group whose element order does not matter, if we use that:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.
>
> When k is not defined for a group, k = 1. Using only elements
> separated by commas for the upper group means that we wait for the
> first element in the set (for backward compatibility), hence:
> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN

Nice design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 06/26/2015 09:42 AM, Robert Haas wrote:
> On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier
>> That's where the micro-language idea makes sense to use. For example,
>> we can define a group using separators and like (elt1,...eltN) or
>> [elt1,elt2,eltN]. Appending a number in front of a group is essential
>> as well for quorum commits. Hence for example, assuming that '()' is
>> used for a group whose element order does not matter, if we use that:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>>
>> When k is not defined for a group, k = 1. Using only elements
>> separated by commas for the upper group means that we wait for the
>> first element in the set (for backward compatibility), hence:
>> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN

This really feels like we're going way beyond what we want a single
string GUC.  I feel that this feature, as outlined, is a terrible hack
which we will regret supporting in the future.  You're taking something
which was already a fast hack because we weren't sure if anyone would
use it, and building two levels on top of that.

If we're going to do quorum, multi-set synchrep, then we need to have a
real management interface.  Like, we really ought to have a system
catalog and some built in functions to manage this instead, e.g.

pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)

pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')

pg_modify_sync_set(quorum INT, set_members VARIADIC)

pg_drop_synch_set(set_name NAME)

For users who want the new functionality, they just set
synchronous_standby_names='catalog' in pg.conf.

Having a function interface for this would make it worlds easier for the
DBA to reconfigure in order to accomodate network changes as well.
Let's face it, a DBA with three synch sets in different geos is NOT
going to want to edit pg.conf by hand and reload when the link to Brazil
goes down.  That's a really sucky workflow, and near-impossible to automate.

We'll also want a new system view, pg_stat_syncrep:

pg_stat_synchrepstandby_nameclient_addrreplication_statussynch_setsynch_quorumsynch_status

Alternately, we could overload those columns onto pg_stat_replication,
but that seems messy.
Finally, while I'm raining on everyone's parade: the mechanism of
identifying synchronous replicas by setting the application_name on the
replica is confusing and error-prone; if we're building out synchronous
replication into a sophisticated system, we ought to think about
replacing it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Fri, Jun 26, 2015 at 1:12 PM, Josh Berkus <josh@agliodbs.com> wrote:
> This really feels like we're going way beyond what we want a single
> string GUC.  I feel that this feature, as outlined, is a terrible hack
> which we will regret supporting in the future.  You're taking something
> which was already a fast hack because we weren't sure if anyone would
> use it, and building two levels on top of that.
>
> If we're going to do quorum, multi-set synchrep, then we need to have a
> real management interface.  Like, we really ought to have a system
> catalog and some built in functions to manage this instead, e.g.
>
> pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)
>
> pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')
>
> pg_modify_sync_set(quorum INT, set_members VARIADIC)
>
> pg_drop_synch_set(set_name NAME)
>
> For users who want the new functionality, they just set
> synchronous_standby_names='catalog' in pg.conf.
>
> Having a function interface for this would make it worlds easier for the
> DBA to reconfigure in order to accomodate network changes as well.
> Let's face it, a DBA with three synch sets in different geos is NOT
> going to want to edit pg.conf by hand and reload when the link to Brazil
> goes down.  That's a really sucky workflow, and near-impossible to automate.

I think your proposal is worth considering, but you would need to fill
in a lot more details and explain how it works in detail, rather than
just via a set of example function calls.  The GUC-based syntax
proposal covers cases like multi-level rules and, now, prioritization,
and it's not clear how those would be reflected in what you propose.

> Finally, while I'm raining on everyone's parade: the mechanism of
> identifying synchronous replicas by setting the application_name on the
> replica is confusing and error-prone; if we're building out synchronous
> replication into a sophisticated system, we ought to think about
> replacing it.

I'm not averse to replacing it with something we all agree is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 06/26/2015 11:32 AM, Robert Haas wrote:
> I think your proposal is worth considering, but you would need to fill
> in a lot more details and explain how it works in detail, rather than
> just via a set of example function calls.  The GUC-based syntax
> proposal covers cases like multi-level rules and, now, prioritization,
> and it's not clear how those would be reflected in what you propose.

So what I'm seeing from the current proposal is:

1. we have several defined synchronous sets
2. each set requires a quorum of k  (defined per set)
3. within each set, replicas are arranged in priority order.

One thing which the proposal does not implement is *names* for
synchronous sets.  I would also suggest that if I lose this battle and
we decide to go with a single stringy GUC, that we at least use JSON
instead of defining our out, proprietary, syntax?

Point 3. also seems kind of vaguely defined.  Are we still relying on
the idea that multiple servers have the same application_name to make
them equal, and that anything else is a proritization?  That is, if we have:

replica1: appname=group1
replica2: appname=group2
replica3: appname=group1
replica4: appname=group2
replica5: appname=group1
replica6: appname=group2

And the definition:

synchset: Aquorum: 2members: [ group1, group2 ]

Then the desired behavior would be: we must get acks from at least 2
servers in group1, but if group1 isn't responding, then from group2?

What if *one* server in group1 responds?  What do we do?  Do we fail the
whole group and try for 2 out of 3 in group2?  Or do we only need one in
group2?  In which case, what prioritization is there?  Who could
possibly use anything so complex?

I'm personally not convinced that quorum and prioritization are
compatible.  I suggest instead that quorum and prioritization should be
exclusive alternatives, that is that a synch set should be either a
quorum set (with all members as equals) or a prioritization set (if rep1
fails, try rep2).  I can imagine use cases for either mode, but not one
which would involve doing both together.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs  wrote:
>> Let's start with a complex, fully described use case then work out how to
>> specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>
> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>
> That's where the micro-language idea makes sense to use. For example,
> we can define a group using separators and like (elt1,...eltN) or
> [elt1,elt2,eltN]. Appending a number in front of a group is essential
> as well for quorum commits. Hence for example, assuming that '()' is
> used for a group whose element order does not matter, if we use that:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.
>
> When k is not defined for a group, k = 1. Using only elements
> separated by commas for the upper group means that we wait for the
> first element in the set (for backward compatibility), hence:
> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN
>

I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this
case (for backward compatibility), right?

Regards,

--
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sun, Jun 28, 2015 at 5:52 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs  wrote:
>>> Let's start with a complex, fully described use case then work out how to
>>> specify what we want.
>>
>> Well, one of the most simple cases where quorum commit and this
>> feature would be useful for is that, with 2 data centers:
>> - on center 1, master A and standby B
>> - on center 2, standby C and standby D
>> With the current synchronous_standby_names, what we can do now is
>> ensuring that one node has acknowledged the commit of master. For
>> example synchronous_standby_names = 'B,C,D'. But you know that :)
>> What this feature would allow use to do is for example being able to
>> ensure that a node on the data center 2 has acknowledged the commit of
>> master, meaning that even if data center 1 completely lost for a
>> reason or another we have at least one node on center 2 that has lost
>> no data at transaction commit.
>>
>> Now, regarding the way to express that, we need to use a concept of
>> node group for each element of synchronous_standby_names. A group
>> contains a set of elements, each element being a group or a single
>> node. And for each group we need to know three things when a commit
>> needs to be acknowledged:
>> - Does my group need to acknowledge the commit?
>> - If yes, how many elements in my group need to acknowledge it?
>> - Does the order of my elements matter?
>>
>> That's where the micro-language idea makes sense to use. For example,
>> we can define a group using separators and like (elt1,...eltN) or
>> [elt1,elt2,eltN]. Appending a number in front of a group is essential
>> as well for quorum commits. Hence for example, assuming that '()' is
>> used for a group whose element order does not matter, if we use that:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>>
>> When k is not defined for a group, k = 1. Using only elements
>> separated by commas for the upper group means that we wait for the
>> first element in the set (for backward compatibility), hence:
>> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN
>>
>
> I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this
> case (for backward compatibility), right?

Yes, [] is where the order of items matter. Thanks for the correction.
Still we could do the opposite, there is nothing decided here.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/26/2015 11:32 AM, Robert Haas wrote:
>> I think your proposal is worth considering, but you would need to fill
>> in a lot more details and explain how it works in detail, rather than
>> just via a set of example function calls.  The GUC-based syntax
>> proposal covers cases like multi-level rules and, now, prioritization,
>> and it's not clear how those would be reflected in what you propose.
>
> So what I'm seeing from the current proposal is:
>
> 1. we have several defined synchronous sets
> 2. each set requires a quorum of k  (defined per set)
> 3. within each set, replicas are arranged in priority order.
>
> One thing which the proposal does not implement is *names* for
> synchronous sets.  I would also suggest that if I lose this battle and
> we decide to go with a single stringy GUC, that we at least use JSON
> instead of defining our out, proprietary, syntax?

JSON would be more flexible for making synchronous set, but it will
make us to change how to parse configuration file to enable a value
contains newline.

> Point 3. also seems kind of vaguely defined.  Are we still relying on
> the idea that multiple servers have the same application_name to make
> them equal, and that anything else is a proritization?  That is, if we have:

Yep, I guess that the same application name servers have same
priority, and the servers in same set have same priority.
(The set means here that bunch of application name in GUC).

> replica1: appname=group1
> replica2: appname=group2
> replica3: appname=group1
> replica4: appname=group2
> replica5: appname=group1
> replica6: appname=group2
>
> And the definition:
>
> synchset: A
>         quorum: 2
>         members: [ group1, group2 ]
>
> Then the desired behavior would be: we must get acks from at least 2
> servers in group1, but if group1 isn't responding, then from group2?

In this case, If we want to use quorum commit (i.g., all replica have
same priority),
I guess that we must get ack from 2 *elements* in listed (both group1
and group2).
If quorumm = 1, we must get ack from either group1 or group2.

> What if *one* server in group1 responds?  What do we do?  Do we fail the
> whole group and try for 2 out of 3 in group2?  Or do we only need one in
> group2?  In which case, what prioritization is there?  Who could
> possibly use anything so complex?

If some servers have same application name, the master server will get
each different ack(write, flush LSN) from
same application name servers. We can use the lowest LSN of them to
release backend waiters, for more safety.
But if only one server in group1 returns ack to the master server, and
other two servers are not working,
I guess the master server can use it because other servers is invalid server.
That is, we must get ack at least 1 from each group1 and group2.

> I'm personally not convinced that quorum and prioritization are
> compatible.  I suggest instead that quorum and prioritization should be
> exclusive alternatives, that is that a synch set should be either a
> quorum set (with all members as equals) or a prioritization set (if rep1
> fails, try rep2).  I can imagine use cases for either mode, but not one
> which would involve doing both together.
>

Yep, separating the GUC parameter between prioritization and quorum
could be also good idea.

Also I think that we must enable us to decide which server we should
promote when the master server is down.

Regards,

--
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sat, Jun 27, 2015 at 2:12 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Finally, while I'm raining on everyone's parade: the mechanism of
> identifying synchronous replicas by setting the application_name on the
> replica is confusing and error-prone; if we're building out synchronous
> replication into a sophisticated system, we ought to think about
> replacing it.

I assume that you do not refer to a new parameter in the connection
string like node_name, no? Are you referring to an extension of
START_REPLICATION in the replication protocol to pass an ID?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 06/28/2015 04:36 AM, Sawada Masahiko wrote:
> On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 06/26/2015 11:32 AM, Robert Haas wrote:
>>> I think your proposal is worth considering, but you would need to fill
>>> in a lot more details and explain how it works in detail, rather than
>>> just via a set of example function calls.  The GUC-based syntax
>>> proposal covers cases like multi-level rules and, now, prioritization,
>>> and it's not clear how those would be reflected in what you propose.
>>
>> So what I'm seeing from the current proposal is:
>>
>> 1. we have several defined synchronous sets
>> 2. each set requires a quorum of k  (defined per set)
>> 3. within each set, replicas are arranged in priority order.
>>
>> One thing which the proposal does not implement is *names* for
>> synchronous sets.  I would also suggest that if I lose this battle and
>> we decide to go with a single stringy GUC, that we at least use JSON
>> instead of defining our out, proprietary, syntax?
> 
> JSON would be more flexible for making synchronous set, but it will
> make us to change how to parse configuration file to enable a value
> contains newline.

Right.  Well, another reason we should be using a system catalog and not
a single GUC ...

> In this case, If we want to use quorum commit (i.g., all replica have
> same priority),
> I guess that we must get ack from 2 *elements* in listed (both group1
> and group2).
> If quorumm = 1, we must get ack from either group1 or group2.

In that case, then priority among quorum groups is pretty meaningless,
isn't it?

>> I'm personally not convinced that quorum and prioritization are
>> compatible.  I suggest instead that quorum and prioritization should be
>> exclusive alternatives, that is that a synch set should be either a
>> quorum set (with all members as equals) or a prioritization set (if rep1
>> fails, try rep2).  I can imagine use cases for either mode, but not one
>> which would involve doing both together.
>>
> 
> Yep, separating the GUC parameter between prioritization and quorum
> could be also good idea.

We're agreed, then ...

> Also I think that we must enable us to decide which server we should
> promote when the master server is down.

Yes, and probably my biggest issue with this patch is that it makes
deciding which server to fail over to *more* difficult (by adding more
synchronous options) without giving the DBA any more tools to decide how
to fail over.  Aside from "because we said we'd eventually do it", what
real-world problem are we solving with this patch?

I'm serious.  Only if we define the real reliability/availability
problem we want to solve can we decide if the new feature solves it.
I've seen a lot of technical discussion about the syntax for the
proposed GUC, and zilch about what's going to happen when the master
fails, or who the target audience for this feature is.

On 06/28/2015 05:11 AM, Michael Paquier wrote:> On Sat, Jun 27, 2015 at
2:12 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> Finally, while I'm raining on everyone's parade: the mechanism of
>> identifying synchronous replicas by setting the application_name on the
>> replica is confusing and error-prone; if we're building out synchronous
>> replication into a sophisticated system, we ought to think about
>> replacing it.
>
> I assume that you do not refer to a new parameter in the connection
> string like node_name, no? Are you referring to an extension of
> START_REPLICATION in the replication protocol to pass an ID?

Well, if I had my druthers, we'd have a way to map client_addr (or
replica IDs, which would be better, in case of network proxying) *on the
master* to synchronous standby roles.  Synch roles should be defined on
the master, not on the replica, because it's the master which is going
to stop accepting writes if they've been defined incorrectly.

It's always been a problem that one can accomplish a de-facto
denial-of-service by joining a cluster using the same application_name
as the synch standby, moreso because it's far too easy to do that
accidentally.  One needs to simply make the mistake of copying
recovery.conf from the synch replica instead of the async replica, and
you've created a reliability problem.

Also, the fact that we use application_name for synch_standby groups
prevents us from giving the standbys in the group their own names for
identification purposes.  It's only the fact that synchronous groups are
relatively useless in the current feature set that's prevented this from
being a real operational problem; if we implement quorum commit, then
users are going to want to use groups more often and will want to
identify the members of the group, and not just by IP address.

We *really* should have discussed this feature at PGCon.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/28/2015 04:36 AM, Sawada Masahiko wrote:
>> On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 06/26/2015 11:32 AM, Robert Haas wrote:
>>>> I think your proposal is worth considering, but you would need to fill
>>>> in a lot more details and explain how it works in detail, rather than
>>>> just via a set of example function calls.  The GUC-based syntax
>>>> proposal covers cases like multi-level rules and, now, prioritization,
>>>> and it's not clear how those would be reflected in what you propose.
>>>
>>> So what I'm seeing from the current proposal is:
>>>
>>> 1. we have several defined synchronous sets
>>> 2. each set requires a quorum of k  (defined per set)
>>> 3. within each set, replicas are arranged in priority order.
>>>
>>> One thing which the proposal does not implement is *names* for
>>> synchronous sets.  I would also suggest that if I lose this battle and
>>> we decide to go with a single stringy GUC, that we at least use JSON
>>> instead of defining our out, proprietary, syntax?
>>
>> JSON would be more flexible for making synchronous set, but it will
>> make us to change how to parse configuration file to enable a value
>> contains newline.
>
> Right.  Well, another reason we should be using a system catalog and not
> a single GUC ...

I assume that this takes into account the fact that you will still
need a SIGHUP to reload properly the new node information from those
catalogs and to track if some information has been modified or not.
And the fact that a connection to those catalogs will be needed as
well, something that we don't have now. Another barrier to the catalog
approach is that catalogs get replicated to the standbys, and I think
that we want to avoid that. But perhaps you simply meant having an SQL
interface with some metadata, right? Perhaps I got confused by the
word 'catalog'.

>>> I'm personally not convinced that quorum and prioritization are
>>> compatible.  I suggest instead that quorum and prioritization should be
>>> exclusive alternatives, that is that a synch set should be either a
>>> quorum set (with all members as equals) or a prioritization set (if rep1
>>> fails, try rep2).  I can imagine use cases for either mode, but not one
>>> which would involve doing both together.
>>>
>>
>> Yep, separating the GUC parameter between prioritization and quorum
>> could be also good idea.
>
> We're agreed, then ...

Er, I disagree here. Being able to get prioritization and quorum
working together is a requirement of this feature in my opinion. Using
again the example above with 2 data centers, being able to define a
prioritization set on the set of nodes of data center 1, and a quorum
set in data center 2 would reduce failure probability by being able to
prevent problems where for example one or more nodes lag behind
(improving performance at the same time).

>> Also I think that we must enable us to decide which server we should
>> promote when the master server is down.
>
> Yes, and probably my biggest issue with this patch is that it makes
> deciding which server to fail over to *more* difficult (by adding more
> synchronous options) without giving the DBA any more tools to decide how
> to fail over.  Aside from "because we said we'd eventually do it", what
> real-world problem are we solving with this patch?

Hm. This patch needs to be coupled with improvements to
pg_stat_replication to be able to represent a node tree by basically
adding to which group a node is assigned. I can draft that if needed,
I am just a bit too lazy now...

Honestly, this is not a matter of tooling. Even today if a DBA wants
to change s_s_names without touching postgresql.conf you could just
run ALTER SYSTEM and then reload parameters.

> It's always been a problem that one can accomplish a de-facto
> denial-of-service by joining a cluster using the same application_name
> as the synch standby, moreso because it's far too easy to do that
> accidentally.  One needs to simply make the mistake of copying
> recovery.conf from the synch replica instead of the async replica, and
> you've created a reliability problem.

That's a scripting problem then. There are many ways to do a false
manipulation in this area when setting up a standby. application_name
value is one, you can do worse by pointing to an incorrect IP as well,
miss a firewall filter or point to an incorrect port.

> Also, the fact that we use application_name for synch_standby groups
> prevents us from giving the standbys in the group their own names for
> identification purposes.  It's only the fact that synchronous groups are
> relatively useless in the current feature set that's prevented this from
> being a real operational problem; if we implement quorum commit, then
> users are going to want to use groups more often and will want to
> identify the members of the group, and not just by IP address.

Managing groups in the synchronous protocol is adding one level of
complexity for the operator, while what I had in mind first was to
allow a user to be able to pass to the server a formula that decides
if synchronous_commit is validated or not. In any case this feels like
a different feature thinking of it now.

> We *really* should have discussed this feature at PGCon.

What is done is done. Sawada-san and I have met last weekend, and we
agreed to get a clear image of a spec for this features on this thread
before doing any coding. So let's continue the discussion..
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 06/29/2015 01:01 AM, Michael Paquier wrote:
> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote:

>> Right.  Well, another reason we should be using a system catalog and not
>> a single GUC ...
> 
> I assume that this takes into account the fact that you will still
> need a SIGHUP to reload properly the new node information from those
> catalogs and to track if some information has been modified or not.

Well, my hope was NOT to need a sighup, which is something I see as a
failing of the current system.

> And the fact that a connection to those catalogs will be needed as
> well, something that we don't have now. 

Hmmm?  I was envisioning the catalog being used as one on the master.
Why do we need an additional connection for that?  Don't we already need
a connection in order to update pg_stat_replication?

> Another barrier to the catalog
> approach is that catalogs get replicated to the standbys, and I think
> that we want to avoid that. 

Yeah, it occurred to me that that approach has its downside as well as
an upside.  For example, you wouldn't want a failed-over new master to
synchrep to itself.  Mostly, I was looking for something reactive,
relational, and validated, instead of passing an unvalidated string to
pg.conf and hoping that it's accepted on reload.  Also some kind of
catalog approach would permit incremental changes to the config instead
of wholesale replacement.

> But perhaps you simply meant having an SQL
> interface with some metadata, right? Perhaps I got confused by the
> word 'catalog'.

No, that doesn't make any sense.

>>>> I'm personally not convinced that quorum and prioritization are
>>>> compatible.  I suggest instead that quorum and prioritization should be
>>>> exclusive alternatives, that is that a synch set should be either a
>>>> quorum set (with all members as equals) or a prioritization set (if rep1
>>>> fails, try rep2).  I can imagine use cases for either mode, but not one
>>>> which would involve doing both together.
>>>>
>>>
>>> Yep, separating the GUC parameter between prioritization and quorum
>>> could be also good idea.
>>
>> We're agreed, then ...
> 
> Er, I disagree here. Being able to get prioritization and quorum
> working together is a requirement of this feature in my opinion. Using
> again the example above with 2 data centers, being able to define a
> prioritization set on the set of nodes of data center 1, and a quorum
> set in data center 2 would reduce failure probability by being able to
> prevent problems where for example one or more nodes lag behind
> (improving performance at the same time).

Well, then *someone* needs to define the desired behavior for all
permutations of prioritized synch sets.  If it's undefined, then we're
far worse off than we are now.

>>> Also I think that we must enable us to decide which server we should
>>> promote when the master server is down.
>>
>> Yes, and probably my biggest issue with this patch is that it makes
>> deciding which server to fail over to *more* difficult (by adding more
>> synchronous options) without giving the DBA any more tools to decide how
>> to fail over.  Aside from "because we said we'd eventually do it", what
>> real-world problem are we solving with this patch?
> 
> Hm. This patch needs to be coupled with improvements to
> pg_stat_replication to be able to represent a node tree by basically
> adding to which group a node is assigned. I can draft that if needed,
> I am just a bit too lazy now...
> 
> Honestly, this is not a matter of tooling. Even today if a DBA wants
> to change s_s_names without touching postgresql.conf you could just
> run ALTER SYSTEM and then reload parameters.

You're confusing two separate things.  The primary manageability problem
has nothing to do with altering the parameter.  The main problem is: if
there is more than one synch candidate, how do we determine *after the
master dies* which candidate replica was in synch at the time of
failure?  Currently there is no way to do that.  This proposal plans to,
effectively, add more synch candidate configurations without addressing
that core design failure *at all*.  That's why I say that this patch
decreases overall reliability of the system instead of increasing it.

When I set up synch rep today, I never use more than two candidate synch
servers because of that very problem.  And even with two I have to check
replay point because I have no way to tell which replica was in-sync at
the time of failure.  Even in the current limited feature, this
significantly reduces the utility of synch rep.  In your proposal, where
I could have multiple synch rep groups in multiple geos, how on Earth
could I figure out what to do when the master datacenter dies?

BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
(assuming it's one parameter) instead of some custom syntax.  If it's
JSON, we can validate it in psql, whereas if it's some custom syntax we
have to wait for the db to reload and fail to figure out that we forgot
a comma.  Using JSON would also permit us to use jsonb_set and
jsonb_delete to incrementally change the configuration.

Question: what happens *today* if we have two different synch rep
strings in two different *.conf files?  I wouldn't assume that anyone
has tested this ...

>> It's always been a problem that one can accomplish a de-facto
>> denial-of-service by joining a cluster using the same application_name
>> as the synch standby, moreso because it's far too easy to do that
>> accidentally.  One needs to simply make the mistake of copying
>> recovery.conf from the synch replica instead of the async replica, and
>> you've created a reliability problem.
> 
> That's a scripting problem then. There are many ways to do a false
> manipulation in this area when setting up a standby. application_name
> value is one, you can do worse by pointing to an incorrect IP as well,
> miss a firewall filter or point to an incorrect port.

You're missing the point.  We've created something unmanageable because
we piggy-backed it onto features intended for something else entirely.
Now you're proposing to piggy-back additional features on top of the
already teetering Bejing-acrobat-stack of piggy-backs we already have.
I'm saying that if you want synch rep to actually be a sophisticated,
high-availability system, you need it to actually be high-availability,
not just pile on additional configuration options.

I'm in favor of a more robust and sophisticated synch rep.  But not if
nobody not on this mailing list can configure it, and not if even we
don't know what it will do in an actual failure situation.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/29/2015 01:01 AM, Michael Paquier wrote:
>> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote:
>
>>> Right.  Well, another reason we should be using a system catalog and not
>>> a single GUC ...

The problem by using system catalog to configure the synchronous replication
is that even configuration change needs to wait for its WAL record (i.e., caused
by change of system catalog) to be replicated. Imagine the case where you have
one synchronous standby but it does down. To keep the system up, you'd like
to switch the replication mode to asynchronous by changing the corresponding
system catalog. But that change may need to wait until synchronous standby
starts up again and its WAL record is successfully replicated. This means that
you may need to wait forever...

One approach to address this problem is to introduce something like unlogged
system catalog. I'm not sure if that causes another big problem, though...

> You're confusing two separate things.  The primary manageability problem
> has nothing to do with altering the parameter.  The main problem is: if
> there is more than one synch candidate, how do we determine *after the
> master dies* which candidate replica was in synch at the time of
> failure?  Currently there is no way to do that.  This proposal plans to,
> effectively, add more synch candidate configurations without addressing
> that core design failure *at all*.  That's why I say that this patch
> decreases overall reliability of the system instead of increasing it.

I agree this is a problem even today, but it's basically independent from
the proposed feature *itself*. So I think that it's better to discuss and
work on the problem separately. If so, we might be able to provide
good way to find new master even if the proposed feature finally fails
to be adopted.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Peter Eisentraut
Date:
On 6/26/15 1:46 AM, Michael Paquier wrote:
> - k(elt1,elt2,eltN) means that we need for the k elements in the set
> to return true (aka commit confirmation).
> - k[elt1,elt2,eltN] means that we need for the first k elements in the
> set to return true.

I think the difference between (...) and [...] is not intuitive.  To me,
{...} would be more intuitive to indicate order does not matter.

> When k is not defined for a group, k = 1.

How about putting it at the end?  Like

[foo,bar,baz](2)




Re: Support for N synchronous standby servers - take 2

From
Peter Eisentraut
Date:
On 6/26/15 2:53 PM, Josh Berkus wrote:
> I would also suggest that if I lose this battle and
> we decide to go with a single stringy GUC, that we at least use JSON
> instead of defining our out, proprietary, syntax?

Does JSON have a natural syntax for a set without order?




Re: Support for N synchronous standby servers - take 2

From
Peter Eisentraut
Date:
On 7/1/15 10:15 AM, Fujii Masao wrote:
> One approach to address this problem is to introduce something like unlogged
> system catalog. I'm not sure if that causes another big problem, though...

Yeah, like the data disappearing after a crash. ;-)




Re: Support for N synchronous standby servers - take 2

From
Peter Eisentraut
Date:
On 6/26/15 1:12 PM, Josh Berkus wrote:
> If we're going to do quorum, multi-set synchrep, then we need to have a
> real management interface.  Like, we really ought to have a system
> catalog and some built in functions to manage this instead, e.g.
> 
> pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC)
> 
> pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5')
> 
> pg_modify_sync_set(quorum INT, set_members VARIADIC)
> 
> pg_drop_synch_set(set_name NAME)

I respect that some people might like this, but I don't really see this
as an improvement.  It's much easier for an administration person or
program to type out a list of standbys in a text file than having to go
through these interfaces that are non-idempotent, verbose, and only
available when the database server is up.  The nice thing about a plain
and simple system is that you can build a complicated system on top of
it, if desired.




Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/29/2015 01:01 AM, Michael Paquier wrote:
>
> You're confusing two separate things.  The primary manageability problem
> has nothing to do with altering the parameter.  The main problem is: if
> there is more than one synch candidate, how do we determine *after the
> master dies* which candidate replica was in synch at the time of
> failure?  Currently there is no way to do that.  This proposal plans to,
> effectively, add more synch candidate configurations without addressing
> that core design failure *at all*.  That's why I say that this patch
> decreases overall reliability of the system instead of increasing it.
>
> When I set up synch rep today, I never use more than two candidate synch
> servers because of that very problem.  And even with two I have to check
> replay point because I have no way to tell which replica was in-sync at
> the time of failure.  Even in the current limited feature, this
> significantly reduces the utility of synch rep.  In your proposal, where
> I could have multiple synch rep groups in multiple geos, how on Earth
> could I figure out what to do when the master datacenter dies?

We can have same application name servers today, it's like group.
So there are two problems regarding fail-over:
1. How can we know which group(set) we should use? (group means
application_name here)
2. And how can we decide which a server of that group we should
promote to the next master server?

#1, it's one of the big problem, I think.
I haven't came up with correct solution yet, but we would need to know
which server(group) is the best for promoting
without the running old master server.
For example, improving pg_stat_replication view. or the mediation
process always check each progress of standby.

#2, I guess the best solution is that the DBA can promote any server of group.
That is, DBA always can promote server without considering state of
server of that group.
It's not difficult, always using lowest LSN of a group as group LSN.

>
> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
> (assuming it's one parameter) instead of some custom syntax.  If it's
> JSON, we can validate it in psql, whereas if it's some custom syntax we
> have to wait for the db to reload and fail to figure out that we forgot
> a comma.  Using JSON would also permit us to use jsonb_set and
> jsonb_delete to incrementally change the configuration.

Sounds convenience and flexibility. I agree with this json format
parameter only if we don't combine both quorum and prioritization.
Because of  backward compatibility.
I tend to use json format value and it's new separated GUC parameter.
Anyway, if we use json, I'm imaging parameter values like below.
{   "group1" : {       "quorum" : 1,       "standbys" : [           {               "a" : {                   "quorum"
:2,                   "standbys" : [                       "c", "d"                   ]               }           },
      "b"       ]   }
 
}


> Question: what happens *today* if we have two different synch rep
> strings in two different *.conf files?  I wouldn't assume that anyone
> has tested this ...

We use last defied parameter even if sync rep strings in several file, right?

Regards,

--
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
All:

Replying to multiple people below.

On 07/01/2015 07:15 AM, Fujii Masao wrote:
> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> You're confusing two separate things.  The primary manageability problem
>> has nothing to do with altering the parameter.  The main problem is: if
>> there is more than one synch candidate, how do we determine *after the
>> master dies* which candidate replica was in synch at the time of
>> failure?  Currently there is no way to do that.  This proposal plans to,
>> effectively, add more synch candidate configurations without addressing
>> that core design failure *at all*.  That's why I say that this patch
>> decreases overall reliability of the system instead of increasing it.
> 
> I agree this is a problem even today, but it's basically independent from
> the proposed feature *itself*. So I think that it's better to discuss and
> work on the problem separately. If so, we might be able to provide
> good way to find new master even if the proposed feature finally fails
> to be adopted.

I agree that they're separate features.  My argument is that the quorum
synch feature isn't materially useful if we don't create some feature to
identify which server(s) were in synch at the time the master died.

The main reason I'm arguing on this thread is that discussion of this
feature went straight into GUC syntax, without ever discussing:

* what use cases are we serving?
* what features do those use cases need?

I'm saying that we need to have that discussion first before we go into
syntax.  We gave up on quorum commit in 9.1 partly because nobody was
convinced that it was actually useful; that case still needs to be
established, and if we can determine *under what circumstances* it's
useful, then we can know if the proposed feature we have is what we want
or not.

Myself, I have two use case for changes to sync rep:

1. the ability to specify a group of three replicas in the same data
center, and have commit succeed if it succeeds on two of them.  The
purpose of this is to avoid data loss even if we lose the master and one
replica.

2. the ability to specify that synch needs to succeed on two replicas in
two different data centers.  The idea here is to be able to ensure
consistency between all data centers.

Speaking of which: how does the proposed patch roll back the commit on
one replica if it fails to get quorum?

On 07/01/2015 07:55 AM, Peter Eisentraut wrote:> I respect that some
people might like this, but I don't really see this
> as an improvement.  It's much easier for an administration person or
> program to type out a list of standbys in a text file than having to go
> through these interfaces that are non-idempotent, verbose, and only
> available when the database server is up.  The nice thing about a plain
> and simple system is that you can build a complicated system on top of
> it, if desired.

I'm disagreeing that the proposed system is "plain and simple".  What we
have now is simple; anything we try to add on top of it is goign to be
much less so.  Frankly, given the proposed feature, I'm not sure that a
"plain and simple" implementation is *possible*; it's not a simple problem.

On 07/01/2015 07:58 AM, Sawada Masahiko wrote:> On Tue, Jun 30, 2015 at
> We can have same application name servers today, it's like group.
> So there are two problems regarding fail-over:
> 1. How can we know which group(set) we should use? (group means
> application_name here)
> 2. And how can we decide which a server of that group we should
> promote to the next master server?

Well, one possibility is to have each replica keep a flag which
indicates whether it thinks it's in sync or not.  This flag would be
updated every time the replica sends a sync-ack to the master. There's a
couple issues with that though:

Synch Flag: the flag would need to be WAL-logged or written to disk
somehow on the replica, in case of the situation where the whole data
center shuts down, comes back up, and the master fails on restart.  In
order for the replica to WAL-log this, we'd need to add special .sync
files to pg_xlog, like we currently have .history. Such a file could be
getting updated thousands of times per second, which is potentially an
issue.  We could reduce writes by either synching to disk periodically,
or having the master write the sync state to a catalog, and replicate
it, but ...

Race Condition: there's a bit of a race condition during adverse
shutdown situations which could result in uncertainty, especially in
general data center failures and network failures which might not hit
all servers at the same time. If the master is wal-logging sync state,
this race condition is much worse, because it's pretty much certain that
one message updating sync state would be lost in the event of a master
crash.  Likewise, if we don't log every synch state change, we've
widened the opportunity for a race condition.

> #1, it's one of the big problem, I think.
> I haven't came up with correct solution yet, but we would need to know
> which server(group) is the best for promoting
> without the running old master server.
> For example, improving pg_stat_replication view. or the mediation
> process always check each progress of standby.

Well, pg_stat_replication is useless for promotion, because if you need
to do an emergency promotion, you don't have access to that view.

Mind you, any adding additional synch configurations will require either
extra columns in pg_stat_replication, or a new system view, but that
doesn't help us for the failover issue.

> #2, I guess the best solution is that the DBA can promote any server
of group.
> That is, DBA always can promote server without considering state of
> server of that group.
> It's not difficult, always using lowest LSN of a group as group LSN.

Sure, but if we're going to do that, why use synch rep at all?  Let
alone quorum commit?

> Sounds convenience and flexibility. I agree with this json format
> parameter only if we don't combine both quorum and prioritization.
> Because of  backward compatibility.
> I tend to use json format value and it's new separated GUC parameter.

Well, we could just detect if the parameter begins with { or not. ;-)

We could also do an end-run around the current GUC code by not
permitting line breaks in the JSON.

>> Question: what happens *today* if we have two different synch rep
>> strings in two different *.conf files?  I wouldn't assume that anyone
>> has tested this ...
>
> We use last defied parameter even if sync rep strings in several file,
right?

Yeah, I was just wondering if anyone had tested that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Jul 1, 2015 at 11:45 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 6/26/15 1:46 AM, Michael Paquier wrote:
>> - k(elt1,elt2,eltN) means that we need for the k elements in the set
>> to return true (aka commit confirmation).
>> - k[elt1,elt2,eltN] means that we need for the first k elements in the
>> set to return true.
>
> I think the difference between (...) and [...] is not intuitive.  To me,
> {...} would be more intuitive to indicate order does not matter.

When defining a set of elements {} defines elements one by one, () and
[] are used for ranges. Perhaps the difference is better this way.

>> When k is not defined for a group, k = 1.
>
> How about putting it at the end?  Like
>
> [foo,bar,baz](2)

I am less convinced by that, now I won't argue against it either.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Jul 1, 2015 at 11:58 PM, Sawada Masahiko wrote:
> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus wrote:
>>
>> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
>> (assuming it's one parameter) instead of some custom syntax.  If it's
>> JSON, we can validate it in psql, whereas if it's some custom syntax we
>> have to wait for the db to reload and fail to figure out that we forgot
>> a comma.  Using JSON would also permit us to use jsonb_set and
>> jsonb_delete to incrementally change the configuration.
>
> Sounds convenience and flexibility. I agree with this json format
> parameter only if we don't combine both quorum and prioritization.
> Because of  backward compatibility.
> I tend to use json format value and it's new separated GUC parameter.

This is going to make postgresql.conf unreadable. That does not look
very user-friendly, and a JSON object is actually longer in characters
than the formula spec proposed upthread.

> Anyway, if we use json, I'm imaging parameter values like below.
> [JSON]
>> Question: what happens *today* if we have two different synch rep
>> strings in two different *.conf files?  I wouldn't assume that anyone
>> has tested this ...
> We use last defied parameter even if sync rep strings in several file, right?

The last one wins, that's the rule in GUCs. Note that
postgresql.auto.conf has the top priority over the rest, and that
files included in postgresql.conf have their value considered when
they are opened by the parser.

Well, the JSON format has merit, if stored as metadata in PGDATA such
as it is independent on WAL, in something like pg_syncdata/ and if it
can be modified with a useful interface, which is where Josh's first
idea could prove to be useful. We just need a clear representation of
the JSON schema we would use and with what kind of functions we could
manipulate it on top of a get/set that can be used to retrieve and
update the metadata as wanted.

In order to preserve backward-compatibility, set s_s_names as
'special_value' and switch to the old interface. We could consider
dropping it after a couple of releases and being sure that the new
system is stable.

Also, I think that we should rely on SIGHUP as a first step of the
implementation to update the status of sync nodes in backend
processes. As a future improvement we could perhaps get rid. Still it
seems safer to me to rely on a signal to update the in-memory status
as a first step as this is what we have now.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Jul 2, 2015 at 3:21 AM, Josh Berkus <josh@agliodbs.com> wrote:
> All:
>
> Replying to multiple people below.
>
> On 07/01/2015 07:15 AM, Fujii Masao wrote:
>> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>> You're confusing two separate things.  The primary manageability problem
>>> has nothing to do with altering the parameter.  The main problem is: if
>>> there is more than one synch candidate, how do we determine *after the
>>> master dies* which candidate replica was in synch at the time of
>>> failure?  Currently there is no way to do that.  This proposal plans to,
>>> effectively, add more synch candidate configurations without addressing
>>> that core design failure *at all*.  That's why I say that this patch
>>> decreases overall reliability of the system instead of increasing it.
>>
>> I agree this is a problem even today, but it's basically independent from
>> the proposed feature *itself*. So I think that it's better to discuss and
>> work on the problem separately. If so, we might be able to provide
>> good way to find new master even if the proposed feature finally fails
>> to be adopted.
>
> I agree that they're separate features.  My argument is that the quorum
> synch feature isn't materially useful if we don't create some feature to
> identify which server(s) were in synch at the time the master died.
>
> The main reason I'm arguing on this thread is that discussion of this
> feature went straight into GUC syntax, without ever discussing:
>
> * what use cases are we serving?
> * what features do those use cases need?
>
> I'm saying that we need to have that discussion first before we go into
> syntax.  We gave up on quorum commit in 9.1 partly because nobody was
> convinced that it was actually useful; that case still needs to be
> established, and if we can determine *under what circumstances* it's
> useful, then we can know if the proposed feature we have is what we want
> or not.
>
> Myself, I have two use case for changes to sync rep:
>
> 1. the ability to specify a group of three replicas in the same data
> center, and have commit succeed if it succeeds on two of them.  The
> purpose of this is to avoid data loss even if we lose the master and one
> replica.
>
> 2. the ability to specify that synch needs to succeed on two replicas in
> two different data centers.  The idea here is to be able to ensure
> consistency between all data centers.

Yeah, I'm also thinking those *simple* use cases. I'm not sure
how many people really want to have very complicated quorum
commit setting.

> Speaking of which: how does the proposed patch roll back the commit on
> one replica if it fails to get quorum?

You meant the case where there are two sync replicas and the master
needs to wait until both send the ACK, then only one replica goes down?
In this case, the master receives the ACK from only one replica and
it must keep waiting until new sync replica appears and sends back
the ACK. So the committed transaction (written WAL record) would not
be rolled back.

> Well, one possibility is to have each replica keep a flag which
> indicates whether it thinks it's in sync or not.  This flag would be
> updated every time the replica sends a sync-ack to the master. There's a
> couple issues with that though:

I don't think this is good approach because there can be the case where
you need to promote even the standby server not having sync flag.
Please imagine the case where you have sync and async standby servers.
When the master goes down, the async standby might be ahead of the
sync one. This is possible in practice. In this case, it might be better to
promote the async standby instead of sync one. Because the remaining
sync standby which is behind can easily follow up with new master.

We can promote the sync standby in this case. But since the remaining
async standby is ahead, it's not easy to follow up with new master.
Probably new base backup needs to be taken onto async standby from
new master, or pg_rewind needs to be executed. That is, the async
standby basically needs to be set up again.

So I'm thinking that we basically need to check the progress on each
standby to choose new master.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2015-07-02 PM 03:12, Fujii Masao wrote:
> 
> So I'm thinking that we basically need to check the progress on each
> standby to choose new master.
> 

Does HA software determine a standby to promote based on replication progress
or would things be reliable enough for it to infer one from the quorum setting
specified in GUC (or wherever)? Is part of the job of this patch to make the
latter possible? Just wondering or perhaps I am completely missing the point.

Thanks,
Amit




Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Amit wrote:

> Does HA software determine a standby to promote based on replication
> progress 
> or would things be reliable enough for it to infer one from the quorum
> setting 
> specified in GUC (or wherever)? Is part of the job of this patch to make
> the 
> latter possible? Just wondering or perhaps I am completely missing the
> point.

Deciding the failover standby is not exactly part of this patch but we
should be able to set up a mechanism to decide which is the best standby to
be promoted. 

We might not be able to conclude this from the sync parameter alone.

As specified before in some cases an async standby could also be most
eligible for the promotion.



-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856201.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2015-07-02 PM 03:12, Fujii Masao wrote:
>>
>> So I'm thinking that we basically need to check the progress on each
>> standby to choose new master.
>>
>
> Does HA software determine a standby to promote based on replication progress
> or would things be reliable enough for it to infer one from the quorum setting
> specified in GUC (or wherever)? Is part of the job of this patch to make the
> latter possible? Just wondering or perhaps I am completely missing the point.

Replication progress is a factor of choice, but not the only one. The
sole role of this patch is just to allow us to have more advanced
policy in defining how synchronous replication works, aka how we want
to let the master acknowledge a commit synchronously from a set of N
standbys. In any case, this is something unrelated to the discussion
happening here.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2015-07-02 PM 03:52, Michael Paquier wrote:
> On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2015-07-02 PM 03:12, Fujii Masao wrote:
>>>
>>> So I'm thinking that we basically need to check the progress on each
>>> standby to choose new master.
>>>
>>
>> Does HA software determine a standby to promote based on replication progress
>> or would things be reliable enough for it to infer one from the quorum setting
>> specified in GUC (or wherever)? Is part of the job of this patch to make the
>> latter possible? Just wondering or perhaps I am completely missing the point.
> 
> Replication progress is a factor of choice, but not the only one. The
> sole role of this patch is just to allow us to have more advanced
> policy in defining how synchronous replication works, aka how we want
> to let the master acknowledge a commit synchronously from a set of N
> standbys. In any case, this is something unrelated to the discussion
> happening here.
> 

Got it, thanks!

Regards,
Amit




Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2015-07-02 PM 03:43, Beena Emerson wrote:
> Amit wrote:
> 
>> Does HA software determine a standby to promote based on replication
>> progress 
>> or would things be reliable enough for it to infer one from the quorum
>> setting 
>> specified in GUC (or wherever)? Is part of the job of this patch to make
>> the 
>> latter possible? Just wondering or perhaps I am completely missing the
>> point.
> 
> Deciding the failover standby is not exactly part of this patch but we
> should be able to set up a mechanism to decide which is the best standby to
> be promoted. 
> 
> We might not be able to conclude this from the sync parameter alone.
> 
> As specified before in some cases an async standby could also be most
> eligible for the promotion.
> 

Thanks for the explanation.

Regards,
Amit





Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,
There has been a lot of discussion. It has become a bit confusing. 
I am summarizing my understanding of the discussion till now.
Kindly let me know if I missed anything important.

Backward compatibility:
We have to provide support for the current format and behavior for
synchronous replication (The first running standby from list s_s_names)
In case the new format does not include GUC, then a special value to be
specified for s_s_names to indicate that.

Priority and quorum:
Quorum treats all the standby with same priority while in priority behavior,
each one has a different priority and ACK must be received from the
specified k lowest priority servers. 
I am not sure how combining both will work out. 
Mostly we would like to have some standbys from each data center to be in
sync. Can it not be achieved by quorum only?

GUC parameter:
There are some arguments over the text format. However if we continue using
this, specifying the number before the group is a more readable option than
specifying it later.
S_s_names = 3(A, (P,Q), 2(X,Y,Z)) is better compared to
S_s_names = (A, (P,Q), (X,Y,Z) (2)) (3)

Catalog Method:
Is it safe to assume we may not going ahead with the Catalog approach?
A system catalog and some built in functions to set the sync parameters is
not viable because it can cause- promoted master to sync rep itself- changes to catalog may continuously wait for ACK
froma down server.
 
The main problem of unlogged system catalog is data loss during crash.

JSON:
I agree it would make GUC very complex and unreadable. We can consider using
is as meta data. 
I think the only point in favor of JSON is to be able to set it using
functions instead of having to edit and reload right?

Identifying standby:
The main concern for the current use of application_name seems to be that
multiple standby with same name would form an intentional group (maybe
across data clusters too?).
I agree it would be better to have a mechanism to uniquely identify a
standby and groups can be made using whatever method we use to set the sync
requirements.

Main concern seems to be about deciding which standby is to be promoted is a
separate issue altogether.




-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856216.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
> There has been a lot of discussion. It has become a bit confusing.
> I am summarizing my understanding of the discussion till now.
> Kindly let me know if I missed anything important.
>
> Backward compatibility:
> We have to provide support for the current format and behavior for
> synchronous replication (The first running standby from list s_s_names)
> In case the new format does not include GUC, then a special value to be
> specified for s_s_names to indicate that.
>
> Priority and quorum:
> Quorum treats all the standby with same priority while in priority behavior,
> each one has a different priority and ACK must be received from the
> specified k lowest priority servers.
> I am not sure how combining both will work out.
> Mostly we would like to have some standbys from each data center to be in
> sync. Can it not be achieved by quorum only?

So you're wondering if there is the use case where both quorum and priority are
used together?

For example, please imagine the case where you have two standby servers
(say A and B) in local site, and one standby server (say C) in remote disaster
recovery site. You want to set up sync replication so that the master waits for
ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A
or B crashes, you want to make the master wait for ACK from either the
remaining local standby or C. On the other hand, you don't want to use the
setting like 1(A, B, C). Because in this setting, C can be sync standby when
the master craches, and both A and B might be very behind of C. In this case,
you need to promote the remote standby server C to new master,,, this is what
you'd like to avoid.

The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 2 July 2015 at 09:44, Beena Emerson <memissemerson@gmail.com> wrote:
 
I am not sure how combining both will work out.

Use cases needed.
 
Catalog Method:
Is it safe to assume we may not going ahead with the Catalog approach?

Yes 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/01/2015 11:12 PM, Fujii Masao wrote:
> I don't think this is good approach because there can be the case where
> you need to promote even the standby server not having sync flag.
> Please imagine the case where you have sync and async standby servers.
> When the master goes down, the async standby might be ahead of the
> sync one. This is possible in practice. In this case, it might be better to
> promote the async standby instead of sync one. Because the remaining
> sync standby which is behind can easily follow up with new master.

If we're always going to be polling the replicas for furthest ahead,
then why bother implementing quorum synch at all? That's the basic
question I'm asking.  What does it buy us that we don't already have?

I'm serious, here.  Without any additional information on synch state at
failure time, I would never use quorum synch.  If there's someone on
this thread who *would*, let's speak to their use case and then we can
actually get the feature right.  Anyone?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2015-07-02 11:10:27 -0700, Josh Berkus wrote:
> If we're always going to be polling the replicas for furthest ahead,
> then why bother implementing quorum synch at all? That's the basic
> question I'm asking.  What does it buy us that we don't already have?

What do those topic have to do with each other? A standby fundamentally
can be further ahead than what the primary knows about. So you can't do
very much with that knowledge on the master anyway?

> I'm serious, here.  Without any additional information on synch state at
> failure time, I would never use quorum synch.  If there's someone on
> this thread who *would*, let's speak to their use case and then we can
> actually get the feature right.  Anyone?

How would you otherwise ensure that your data is both on a second server
in the same DC and in another DC? Which is a pretty darn common desire?

Greetings,

Andres Freund



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/02/2015 11:31 AM, Andres Freund wrote:
> On 2015-07-02 11:10:27 -0700, Josh Berkus wrote:
>> If we're always going to be polling the replicas for furthest ahead,
>> then why bother implementing quorum synch at all? That's the basic
>> question I'm asking.  What does it buy us that we don't already have?
> 
> What do those topic have to do with each other? A standby fundamentally
> can be further ahead than what the primary knows about. So you can't do
> very much with that knowledge on the master anyway?
> 
>> I'm serious, here.  Without any additional information on synch state at
>> failure time, I would never use quorum synch.  If there's someone on
>> this thread who *would*, let's speak to their use case and then we can
>> actually get the feature right.  Anyone?
> 
> How would you otherwise ensure that your data is both on a second server
> in the same DC and in another DC? Which is a pretty darn common desire?

So there's two parts to this:

1. I need to ensure that data is replicated to X places.

2. I need to *know* which places data was synchronously replicated to
when the master goes down.

My entire point is that (1) alone is useless unless you also have (2).
And do note that I'm talking about information on the replica, not on
the master, since in any failure situation we don't have the old master
around to check.

Say you take this case:

"2" : { "local_replica", "london_server", "nyc_server" }

... which should ensure that any data which is replicated is replicated
to at least two places, so that even if you lose the entire local
datacenter, you have the data on at least one remote data center.

EXCEPT: say you lose both the local datacenter and communication with
the london server at the same time (due to transatlantic cable issues, a
huge DDOS, or whatever).  You'd like to promote the NYC server to be the
new master, but only if it was in sync at the time its communication
with the original master was lost ... except that you have no way of
knowing that.

Given that, we haven't really reduced our data loss potential or
improved availabilty from the current 1-redundant synch rep.  We still
need to wait to get the London server back to figure out if we want to
promote or not.

Now, this configuration would reduce the data loss window:

"3" : { "local_replica", "london_server", "nyc_server" }

As would this one:

"2" : { "local_replica", "nyc_server" }

... because we would know definitively which servers were in sync.  So
maybe that's the use case we should be supporting?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
> So there's two parts to this:
> 
> 1. I need to ensure that data is replicated to X places.
> 
> 2. I need to *know* which places data was synchronously replicated to
> when the master goes down.
> 
> My entire point is that (1) alone is useless unless you also have (2).

I think there's a good set of usecases where that's really not the case.

> And do note that I'm talking about information on the replica, not on
> the master, since in any failure situation we don't have the old
> master around to check.

How would you, even theoretically, synchronize that knowledge to all the
replicas? Even when they're temporarily disconnected?

> Say you take this case:
> 
> "2" : { "local_replica", "london_server", "nyc_server" }
> 
> ... which should ensure that any data which is replicated is replicated
> to at least two places, so that even if you lose the entire local
> datacenter, you have the data on at least one remote data center.

> EXCEPT: say you lose both the local datacenter and communication with
> the london server at the same time (due to transatlantic cable issues, a
> huge DDOS, or whatever).  You'd like to promote the NYC server to be the
> new master, but only if it was in sync at the time its communication
> with the original master was lost ... except that you have no way of
> knowing that.

Pick up the phone, compare the lsns, done.

> Given that, we haven't really reduced our data loss potential or
> improved availabilty from the current 1-redundant synch rep.  We still
> need to wait to get the London server back to figure out if we want to
> promote or not.
> 
> Now, this configuration would reduce the data loss window:
> 
> "3" : { "local_replica", "london_server", "nyc_server" }
> 
> As would this one:
> 
> "2" : { "local_replica", "nyc_server" }
> 
> ... because we would know definitively which servers were in sync.  So
> maybe that's the use case we should be supporting?

If you want automated failover you need a leader election amongst the
surviving nodes. The replay position is all they need to elect the node
that's furthest ahead, and that information exists today.

Greetings,

Andres Freund



On 07/02/2015 12:44 PM, Andres Freund wrote:
> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>> So there's two parts to this:
>>
>> 1. I need to ensure that data is replicated to X places.
>>
>> 2. I need to *know* which places data was synchronously replicated to
>> when the master goes down.
>>
>> My entire point is that (1) alone is useless unless you also have (2).
> 
> I think there's a good set of usecases where that's really not the case.

Please share!  My plea for usecases was sincere.  I can't think of any.

>> And do note that I'm talking about information on the replica, not on
>> the master, since in any failure situation we don't have the old
>> master around to check.
> 
> How would you, even theoretically, synchronize that knowledge to all the
> replicas? Even when they're temporarily disconnected?

You can't, which is why what we need to know is when the replica thinks
it was last synced from the replica side.  That is, a sync timestamp and
lsn from the last time the replica ack'd a sync commit back to the
master successfully.  Based on that information, I can make an informed
decision, even if I'm down to one replica.

>> ... because we would know definitively which servers were in sync.  So
>> maybe that's the use case we should be supporting?
> 
> If you want automated failover you need a leader election amongst the
> surviving nodes. The replay position is all they need to elect the node
> that's furthest ahead, and that information exists today.

I can do that already.  If quorum synch commit doesn't help us minimize
data loss any better than async replication or the current 1-redundant,
why would we want it?  If it does help us minimize data loss, how?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/02/2015 12:44 PM, Andres Freund wrote:
>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>> So there's two parts to this:
>>>
>>> 1. I need to ensure that data is replicated to X places.
>>>
>>> 2. I need to *know* which places data was synchronously replicated to
>>> when the master goes down.
>>>
>>> My entire point is that (1) alone is useless unless you also have (2).
>>
>> I think there's a good set of usecases where that's really not the case.
>
> Please share!  My plea for usecases was sincere.  I can't think of any.
>
>>> And do note that I'm talking about information on the replica, not on
>>> the master, since in any failure situation we don't have the old
>>> master around to check.
>>
>> How would you, even theoretically, synchronize that knowledge to all the
>> replicas? Even when they're temporarily disconnected?
>
> You can't, which is why what we need to know is when the replica thinks
> it was last synced from the replica side.  That is, a sync timestamp and
> lsn from the last time the replica ack'd a sync commit back to the
> master successfully.  Based on that information, I can make an informed
> decision, even if I'm down to one replica.
>
>>> ... because we would know definitively which servers were in sync.  So
>>> maybe that's the use case we should be supporting?
>>
>> If you want automated failover you need a leader election amongst the
>> surviving nodes. The replay position is all they need to elect the node
>> that's furthest ahead, and that information exists today.
>
> I can do that already.  If quorum synch commit doesn't help us minimize
> data loss any better than async replication or the current 1-redundant,
> why would we want it?  If it does help us minimize data loss, how?

In your example of "2" : { "local_replica", "london_server", "nyc_server" },
if there is not something like quorum commit, only local_replica is synch
and the other two are async. In this case, if the local data center gets
destroyed, you need to promote either london_server or nyc_server. But
since they are async, they might not have the data which have been already
committed in the master. So data loss! Of course, as I said yesterday,
they might have all the data and no data loss happens at the promotion.
But the point is that there is no guarantee that no data loss happens.
OTOH, if we use quorum commit, we can guarantee that either london_server
or nyc_server has all the data which have been committed in the master.

So I think that quorum commit is helpful for minimizing the data loss.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Josh Berkus wrote:
> 
> Say you take this case:
> 
> "2" : { "local_replica", "london_server", "nyc_server" }
> 
> ... which should ensure that any data which is replicated is replicated
> to at least two places, so that even if you lose the entire local
> datacenter, you have the data on at least one remote data center.

> EXCEPT: say you lose both the local datacenter and communication with
> the london server at the same time (due to transatlantic cable issues, a
> huge DDOS, or whatever).  You'd like to promote the NYC server to be the
> new master, but only if it was in sync at the time its communication
> with the original master was lost ... except that you have no way of
> knowing that.

Please consider the following:

If we have multiple replica on each DC, we can use the following:

3(local1, 1(london1, london2), 1(nyc1, nyc2))

In this case at least 1 from each DC is sync rep. When local and London
center is lost, NYC promotion can be done by comparing the LSN.

Also quorum would also ensure that even if one of the standby in a data
center goes down, another can take over, preventing data loss.

In the case 3(local1, london1, nyc1)

If nyc1, is down, the transaction would wait continuously. This can be
avoided.









-----

--

Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856394.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
<p dir="ltr">Hello,<p dir="ltr">This has been registered in the next 2015-09 CF since majority are in favor of adding
thismultiple sync replication feature (with quorum/priority).<p dir="ltr">New patch will be submitted once we have
reacheda consensus on the design.<p dir="ltr">--<br /> Beena Emerson<br /> 

Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 07/02/2015 12:44 PM, Andres Freund wrote:
>>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>>> So there's two parts to this:
>>>>
>>>> 1. I need to ensure that data is replicated to X places.
>>>>
>>>> 2. I need to *know* which places data was synchronously replicated to
>>>> when the master goes down.
>>>>
>>>> My entire point is that (1) alone is useless unless you also have (2).
>>>
>>> I think there's a good set of usecases where that's really not the case.
>>
>> Please share!  My plea for usecases was sincere.  I can't think of any.
>>
>>>> And do note that I'm talking about information on the replica, not on
>>>> the master, since in any failure situation we don't have the old
>>>> master around to check.
>>>
>>> How would you, even theoretically, synchronize that knowledge to all the
>>> replicas? Even when they're temporarily disconnected?
>>
>> You can't, which is why what we need to know is when the replica thinks
>> it was last synced from the replica side.  That is, a sync timestamp and
>> lsn from the last time the replica ack'd a sync commit back to the
>> master successfully.  Based on that information, I can make an informed
>> decision, even if I'm down to one replica.
>>
>>>> ... because we would know definitively which servers were in sync.  So
>>>> maybe that's the use case we should be supporting?
>>>
>>> If you want automated failover you need a leader election amongst the
>>> surviving nodes. The replay position is all they need to elect the node
>>> that's furthest ahead, and that information exists today.
>>
>> I can do that already.  If quorum synch commit doesn't help us minimize
>> data loss any better than async replication or the current 1-redundant,
>> why would we want it?  If it does help us minimize data loss, how?
>
> In your example of "2" : { "local_replica", "london_server", "nyc_server" },
> if there is not something like quorum commit, only local_replica is synch
> and the other two are async. In this case, if the local data center gets
> destroyed, you need to promote either london_server or nyc_server. But
> since they are async, they might not have the data which have been already
> committed in the master. So data loss! Of course, as I said yesterday,
> they might have all the data and no data loss happens at the promotion.
> But the point is that there is no guarantee that no data loss happens.
> OTOH, if we use quorum commit, we can guarantee that either london_server
> or nyc_server has all the data which have been committed in the master.
>
> So I think that quorum commit is helpful for minimizing the data loss.
>

Yeah, quorum commit is helpful for minimizing data loss in comparison
with today replication.
But in this your case, how can we know which server we should use as
the next master server, after local data center got down?
If we choose a wrong one, we would get the data loss.

Regards,

--
Sawada Masahiko



On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 07/02/2015 12:44 PM, Andres Freund wrote:
>>>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
>>>>> So there's two parts to this:
>>>>>
>>>>> 1. I need to ensure that data is replicated to X places.
>>>>>
>>>>> 2. I need to *know* which places data was synchronously replicated to
>>>>> when the master goes down.
>>>>>
>>>>> My entire point is that (1) alone is useless unless you also have (2).
>>>>
>>>> I think there's a good set of usecases where that's really not the case.
>>>
>>> Please share!  My plea for usecases was sincere.  I can't think of any.
>>>
>>>>> And do note that I'm talking about information on the replica, not on
>>>>> the master, since in any failure situation we don't have the old
>>>>> master around to check.
>>>>
>>>> How would you, even theoretically, synchronize that knowledge to all the
>>>> replicas? Even when they're temporarily disconnected?
>>>
>>> You can't, which is why what we need to know is when the replica thinks
>>> it was last synced from the replica side.  That is, a sync timestamp and
>>> lsn from the last time the replica ack'd a sync commit back to the
>>> master successfully.  Based on that information, I can make an informed
>>> decision, even if I'm down to one replica.
>>>
>>>>> ... because we would know definitively which servers were in sync.  So
>>>>> maybe that's the use case we should be supporting?
>>>>
>>>> If you want automated failover you need a leader election amongst the
>>>> surviving nodes. The replay position is all they need to elect the node
>>>> that's furthest ahead, and that information exists today.
>>>
>>> I can do that already.  If quorum synch commit doesn't help us minimize
>>> data loss any better than async replication or the current 1-redundant,
>>> why would we want it?  If it does help us minimize data loss, how?
>>
>> In your example of "2" : { "local_replica", "london_server", "nyc_server" },
>> if there is not something like quorum commit, only local_replica is synch
>> and the other two are async. In this case, if the local data center gets
>> destroyed, you need to promote either london_server or nyc_server. But
>> since they are async, they might not have the data which have been already
>> committed in the master. So data loss! Of course, as I said yesterday,
>> they might have all the data and no data loss happens at the promotion.
>> But the point is that there is no guarantee that no data loss happens.
>> OTOH, if we use quorum commit, we can guarantee that either london_server
>> or nyc_server has all the data which have been committed in the master.
>>
>> So I think that quorum commit is helpful for minimizing the data loss.
>>
>
> Yeah, quorum commit is helpful for minimizing data loss in comparison
> with today replication.
> But in this your case, how can we know which server we should use as
> the next master server, after local data center got down?
> If we choose a wrong one, we would get the data loss.

Check the progress of each server, e.g., by using
pg_last_xlog_replay_location(),
and choose the server which is ahead of as new master.

Regards,

-- 
Fujii Masao



Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Fri, Jul 3, 2015 at 6:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> Yeah, quorum commit is helpful for minimizing data loss in comparison
>> with today replication.
>> But in this your case, how can we know which server we should use as
>> the next master server, after local data center got down?
>> If we choose a wrong one, we would get the data loss.
>
> Check the progress of each server, e.g., by using
> pg_last_xlog_replay_location(),
> and choose the server which is ahead of as new master.
>

Thanks. So we can choice the next master server using by checking the
progress of each server, if hot standby is enabled.
And a such procedure is needed even today replication.

I think that the #2 problem which is Josh pointed out seems to be solved;   1. I need to ensure that data is replicated
toX places.   2. I need to *know* which places data was synchronously replicated
 
to when the master goes down.
And we can address #1 problem using quorum commit.

Thought?

Regards,

--
Sawada Masahiko



Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Sawada Masahiko wrote:
>
> I think that the #2 problem which is Josh pointed out seems to be solved;
>     1. I need to ensure that data is replicated to X places.
>     2. I need to *know* which places data was synchronously replicated
> to when the master goes down.
> And we can address #1 problem using quorum commit.
> 
> Thought?

I agree. The knowledge of which servers where in sync(#2) would not actually
help us determine the new master and quorum solves #1.





-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856459.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2015-07-02 14:54:19 -0700, Josh Berkus wrote:
> On 07/02/2015 12:44 PM, Andres Freund wrote:
> > On 2015-07-02 11:50:44 -0700, Josh Berkus wrote:
> >> So there's two parts to this:
> >>
> >> 1. I need to ensure that data is replicated to X places.
> >>
> >> 2. I need to *know* which places data was synchronously replicated to
> >> when the master goes down.
> >>
> >> My entire point is that (1) alone is useless unless you also have (2).
> > 
> > I think there's a good set of usecases where that's really not the case.
> 
> Please share!  My plea for usecases was sincere.  I can't think of any.

"I have important data. I want to survive both a local hardware failure
(it's faster to continue using the local standby) and I want to protect
myself against actual disaster striking the primary datacenter". Pretty
common.

> >> And do note that I'm talking about information on the replica, not on
> >> the master, since in any failure situation we don't have the old
> >> master around to check.
> > 
> > How would you, even theoretically, synchronize that knowledge to all the
> > replicas? Even when they're temporarily disconnected?
> 
> You can't, which is why what we need to know is when the replica thinks
> it was last synced from the replica side.  That is, a sync timestamp and
> lsn from the last time the replica ack'd a sync commit back to the
> master successfully.  Based on that information, I can make an informed
> decision, even if I'm down to one replica.

I think you're mashing together nearly unrelated topics.

Note that we already have the last replayed lsn, and we have the
timestamp of the last replayed transaction.

> > If you want automated failover you need a leader election amongst the
> > surviving nodes. The replay position is all they need to elect the node
> > that's furthest ahead, and that information exists today.
> 
> I can do that already.  If quorum synch commit doesn't help us minimize
> data loss any better than async replication or the current 1-redundant,
> why would we want it?  If it does help us minimize data loss, how?

But it does make us safer against data loss? If your app gets back the
commit you know that the data has made it both to the local replica and
one other datacenter. And you're now safe agains the loss of either the
master's hardware (most likely scenario) and safe against the loss of
the entire primary datacenter. That you need additional logic to know to
which other datacenter to fail over is just yet another piece (which you
*can* build today).



On 07/03/2015 03:12 AM, Sawada Masahiko wrote:
> Thanks. So we can choice the next master server using by checking the
> progress of each server, if hot standby is enabled.
> And a such procedure is needed even today replication.
> 
> I think that the #2 problem which is Josh pointed out seems to be solved;
>     1. I need to ensure that data is replicated to X places.
>     2. I need to *know* which places data was synchronously replicated
> to when the master goes down.
> And we can address #1 problem using quorum commit.

It's not solved. I still have zero ways of knowing if a replica was in
sync or not at the time the master went down.

Now, you and others have argued persuasively that there are valuable use
cases for quorum commit even without solving that particular issue, but
there's a big difference between "we can work around this problem" and
the problem is solved.  I forked the subject line because I think that
the inability to identify synch replicas under failover conditions is a
serious problem with synch rep *today*, and pretending that it doesn't
exist doesn't help us even if we don't fix it in 9.6.

Let me give you three cases where our lack of information on the replica
side about whether it thinks it's in sync or not causes synch rep to
fail to protect data.  The first case is one I've actually seen in
production, and the other two are hypothetical but entirely plausible.

Case #1: two synchronous replica servers have the application name
"synchreplica".  An admin uses the wrong Chef template, and deploys a
server which was supposed to be an async replica with the same
recovery.conf template, and it ends up in the "synchreplica" group as
well. Due to restarts (pushing out an update release), the new server
ends up seizing and keeping sync. Then the master dies.  Because the new
server wasn't supposed to be a sync replica in the first place, it is
not checked; they just fail over to the furthest ahead of the two
original synch replicas, neither of which was actually in synch.

Case #2: "2 { local, london, nyc }" setup.  At 2am, the links between
data centers become unreliable, such that the on-call sysadmin disables
synch rep because commits on the master are intolerably slow.  Then, at
10am, the links between data centers fail entirely.  The day shift, not
knowing that the night shift disabled sync, fail over to London thinking
that they can do so with zero data loss.

Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group
priority setup.  We lose communication with everything but Europe.  How
can we decide whether to wait to get sydney back, or to promote London
immedately?

I could come up with numerous other situations, but all of the three
above completely reasonable cases show how having the knowledge of what
time a replica thought it was last in sync is vital to preventing bad
failovers and data loss, and to knowing the quantity of data loss when
it can't be prevented.

It's an issue *now* that the only data we have about the state of sync
rep is on the master, and dies with the master.   And it severely limits
the actual utility of our synch rep.  People implement synch rep in the
first place because the "best effort" of asynch rep isn't good enough
for them, and yet when it comes to failover we're just telling them
"give it your best effort".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2015-07-03 10:27:05 -0700, Josh Berkus wrote:
> On 07/03/2015 03:12 AM, Sawada Masahiko wrote:
> > Thanks. So we can choice the next master server using by checking the
> > progress of each server, if hot standby is enabled.
> > And a such procedure is needed even today replication.
> > 
> > I think that the #2 problem which is Josh pointed out seems to be solved;
> >     1. I need to ensure that data is replicated to X places.
> >     2. I need to *know* which places data was synchronously replicated
> > to when the master goes down.
> > And we can address #1 problem using quorum commit.
> 
> It's not solved. I still have zero ways of knowing if a replica was in
> sync or not at the time the master went down.

What?

You pick the standby that's furthest ahead. And you use a high enough
quorum so that given your tolerance for failures you'll always be able
to reach at least one of the synchronous replicas. Then you promote the
one with the highest LSN. Done.

This is something that gets *easier* by quorum, not harder.

> I forked the subject line because I think that the inability to
> identify synch replicas under failover conditions is a serious problem
> with synch rep *today*, and pretending that it doesn't exist doesn't
> help us even if we don't fix it in 9.6.

That's just not how failovers can sanely work. And again, you *have* the
information you can have on the standbys already. You *know* what/from
when the last replayed xact is.

> Let me give you three cases where our lack of information on the replica
> side about whether it thinks it's in sync or not causes synch rep to
> fail to protect data.  The first case is one I've actually seen in
> production, and the other two are hypothetical but entirely plausible.
> 
> Case #1: two synchronous replica servers have the application name
> "synchreplica".  An admin uses the wrong Chef template, and deploys a
> server which was supposed to be an async replica with the same
> recovery.conf template, and it ends up in the "synchreplica" group as
> well. Due to restarts (pushing out an update release), the new server
> ends up seizing and keeping sync. Then the master dies.  Because the new
> server wasn't supposed to be a sync replica in the first place, it is
> not checked; they just fail over to the furthest ahead of the two
> original synch replicas, neither of which was actually in synch.

Nobody can protect you against such configuration errors. We can make it
harder to misconfigure, sure, but it doesn't have anything to do with
the topic at hand.

> Case #2: "2 { local, london, nyc }" setup.  At 2am, the links between
> data centers become unreliable, such that the on-call sysadmin disables
> synch rep because commits on the master are intolerably slow.  Then, at
> 10am, the links between data centers fail entirely.  The day shift, not
> knowing that the night shift disabled sync, fail over to London thinking
> that they can do so with zero data loss.

As I said earlier, you can check against that today by checking the last
replayed timestamp. SELECT pg_last_xact_replay_timestamp();

You don't have to pick the one that used to be a sync replica. You pick
the one with the most data received.


If the day shift doesn't bother to check the standbys now, they'd not
check either if they had some way to check whether a node was the chosen
sync replica.

> Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group
> priority setup.  We lose communication with everything but Europe.  How
> can we decide whether to wait to get sydney back, or to promote London
> immedately?

You normally don't continue automatically at all in that situation. To
avoid/minimize data loss you want to have a majority election system to
select the new primary. That requires reaching the majority of the
nodes. This isn't something specific to postgres, if you look at any
solution out there, they're also doing it that way.

Statically choosing which of the replicas in a group is the current sync
one is a *bad* idea. You want to ensure that at least node in a group
has received the data, and stop waiting as soon that's the case.

> It's an issue *now* that the only data we have about the state of sync
> rep is on the master, and dies with the master.   And it severely limits
> the actual utility of our synch rep.  People implement synch rep in the
> first place because the "best effort" of asynch rep isn't good enough
> for them, and yet when it comes to failover we're just telling them
> "give it your best effort".

We don't tell them that, but apparently you do.


This subthread is getting absurd, stopping here.



Re: Synch failover WAS: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sat, Jul 4, 2015 at 2:44 AM, Andres Freund wrote:
> This subthread is getting absurd, stopping here.

Yeah, I agree with Andres here, we are making a mountain of nothing
(Frenglish?). I'll send to the other thread some additional ideas soon
using a JSON structure.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Thu, Jul 2, 2015 at 9:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Hello,
>> There has been a lot of discussion. It has become a bit confusing.
>> I am summarizing my understanding of the discussion till now.
>> Kindly let me know if I missed anything important.
>>
>> Backward compatibility:
>> We have to provide support for the current format and behavior for
>> synchronous replication (The first running standby from list s_s_names)
>> In case the new format does not include GUC, then a special value to be
>> specified for s_s_names to indicate that.
>>
>> Priority and quorum:
>> Quorum treats all the standby with same priority while in priority behavior,
>> each one has a different priority and ACK must be received from the
>> specified k lowest priority servers.
>> I am not sure how combining both will work out.
>> Mostly we would like to have some standbys from each data center to be in
>> sync. Can it not be achieved by quorum only?
>
> So you're wondering if there is the use case where both quorum and priority are
> used together?
>
> For example, please imagine the case where you have two standby servers
> (say A and B) in local site, and one standby server (say C) in remote disaster
> recovery site. You want to set up sync replication so that the master waits for
> ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A
> or B crashes, you want to make the master wait for ACK from either the
> remaining local standby or C. On the other hand, you don't want to use the
> setting like 1(A, B, C). Because in this setting, C can be sync standby when
> the master craches, and both A and B might be very behind of C. In this case,
> you need to promote the remote standby server C to new master,,, this is what
> you'd like to avoid.
>
> The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.
>

If we set the remote disaster recovery site up as synch replica, we
would get some big latencies even though we use quorum commit.
So I think this case Fujii-san suggested is a good configuration, and
many users would want to use it.
I tend to agree with combine quorum and prioritization into one GUC
parameter while keeping backward compatibility.

Regards,

--
Sawada Masahiko



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/06/2015 10:03 AM, Sawada Masahiko wrote:
>> > The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer.
>> >
> If we set the remote disaster recovery site up as synch replica, we
> would get some big latencies even though we use quorum commit.
> So I think this case Fujii-san suggested is a good configuration, and
> many users would want to use it.
> I tend to agree with combine quorum and prioritization into one GUC
> parameter while keeping backward compatibility.

OK, so here's the arguments pro-JSON and anti-JSON:

pro-JSON:

* standard syntax which is recognizable to sysadmins and devops.
* can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
additions/deletions from the synch rep config.
* can add group labels (see below)

anti-JSON:

* more verbose
* syntax is not backwards-compatible, we'd need a switch
* people will want to use line breaks, which we can't support

Re: group labels: I see a lot of value in being able to add names to
quorum groups.  Think about how this will be represented in system
views; it will be difficult to show sync status of any quorum group in
any meaningful way if the group has no label, and any system-assigned
label would change unpredictably from the user's perspective.

To give a JSON example, let's take the case of needing to sync to two of
the servers in either London or NC:

'{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
"london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
"servers" [ "nc1", "nc2" ] } }'

This says: as the "remotes" group, synch with a quorum of 2 servers in
london and a quorum of 1 server in NC.  This assumes for
backwards-compatibility reasons that we support a priority list of
groups of quorums, and not some other combination (see below for more on
this).

The advantage of having these labels is that it becomes easy to
represent statuses for them:

sync_group    state        definition
remotes        waiting        { "london_servers" : { "quorum" ...
london_servers    synced        { "quorum" : 2, "servers" : ...
nc_servers    waiting        { "quorum" : 1, "servers" [  ...

Without labels, we force the DBA to track groups by raw definitions,
which would be difficult.  Also, there's the question of what we do on
reload with any statuses of synch groups which are currently in-process,
if we don't have a stable key with which to identify groups.

The other grammar issue has to do with the nesting nature of quorums and
priorities.  A theoretical user could want:

* a priority list of quorum groups
* a quorum group of priority lists
* a quorum group of quorum groups
* a priority list of quorum groups of quorum groups
* a quorum group of quorum groups of priority lists
... etc.

I don't really see any possible end to the possible permutations, which
is why it would be good to establish some real use cases from now in
order to figure out what we really want to support.  Absent that, my
inclination is that we should implement the simplest possible thing
(i.e. no nesting) for 9.5.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2015-07-07 AM 02:56, Josh Berkus wrote:
> 
> Re: group labels: I see a lot of value in being able to add names to
> quorum groups.  Think about how this will be represented in system
> views; it will be difficult to show sync status of any quorum group in
> any meaningful way if the group has no label, and any system-assigned
> label would change unpredictably from the user's perspective.
> 
> To give a JSON example, let's take the case of needing to sync to two of
> the servers in either London or NC:
> 
> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
> "servers" [ "nc1", "nc2" ] } }'
> 

What if we write the above as:

remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1 [nc1, nc2])

That requires only slightly altering the proposed format, that is prepend sync
group label string to the quorum number. The monitoring view can be made to
internally generate JSON output (if needed) from it. It does not seem very
ALTER SYSTEM SET friendly but there are trade-offs either way.

Just my 2c.

Thanks,
Amit




Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
> pro-JSON:
>
> * standard syntax which is recognizable to sysadmins and devops.
> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
> additions/deletions from the synch rep config.
> * can add group labels (see below)

If we go this way, I think that managing a JSON blob with a GUC
parameter is crazy, this is way longer in character size than a simple
formula because of the key names. Hence, this JSON blob should be in a
separate place than postgresql.conf not within the catalog tables,
manageable using an SQL interface, and reloaded in backends using
SIGHUP.

> anti-JSON:
> * more verbose
> * syntax is not backwards-compatible, we'd need a switch

This point is valid as well in the pro-JSON portion.

> * people will want to use line breaks, which we can't support

Yes, this is caused by the fact of using a GUC. For a simple formula
this seems fine to me though, that's what we have today for s_s_names
and using a formula is not much longer in character size than what we
have now.

> Re: group labels: I see a lot of value in being able to add names to
> quorum groups.  Think about how this will be represented in system
> views; it will be difficult to show sync status of any quorum group in
> any meaningful way if the group has no label, and any system-assigned
> label would change unpredictably from the user's perspective.
> To give a JSON example, let's take the case of needing to sync to two of
> the servers in either London or NC:
>
> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1,
> "servers" [ "nc1", "nc2" ] } }'

The JSON blob managing sync node information could contain additional
JSON objects that register a set of nodes as a given group. More
easily, you could use let's say the following structure to store the
blobs:
- pg_syncinfo/global, to store the root of the formula, that could use groups.
- pg_syncinfo/groups/$GROUP_NAME to store a set JSON blobs representing a group.

> The advantage of having these labels is that it becomes easy to
> represent statuses for them:
>
> sync_group      state           definition
> remotes         waiting         { "london_servers" : { "quorum" ...
> london_servers  synced          { "quorum" : 2, "servers" : ...
> nc_servers      waiting         { "quorum" : 1, "servers" [  ...
> Without labels, we force the DBA to track groups by raw definitions,
> which would be difficult.  Also, there's the question of what we do on
> reload with any statuses of synch groups which are currently in-process,
> if we don't have a stable key with which to identify groups.

Well, yes.

> The other grammar issue has to do with the nesting nature of quorums and
> priorities.  A theoretical user could want:
>
> * a priority list of quorum groups
> * a quorum group of priority lists
> * a quorum group of quorum groups
> * a priority list of quorum groups of quorum groups
> * a quorum group of quorum groups of priority lists
> ... etc.
>
> I don't really see any possible end to the possible permutations, which
> is why it would be good to establish some real use cases from now in
> order to figure out what we really want to support.  Absent that, my
> inclination is that we should implement the simplest possible thing
> (i.e. no nesting) for 9.5.

I am not sure I agree that this will simplify the work. Currently
s_s_names has already 1 level, and we want to append groups to each
element of it as well, meaning that we'll need at least 2 level of
nesting.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/06/2015 06:40 PM, Michael Paquier wrote:
> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> pro-JSON:
>>
>> * standard syntax which is recognizable to sysadmins and devops.
>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>> additions/deletions from the synch rep config.
>> * can add group labels (see below)
> 
> If we go this way, I think that managing a JSON blob with a GUC
> parameter is crazy, this is way longer in character size than a simple
> formula because of the key names. Hence, this JSON blob should be in a
> separate place than postgresql.conf not within the catalog tables,
> manageable using an SQL interface, and reloaded in backends using
> SIGHUP.

I'm not following this at all.  What are you saying here?

>> I don't really see any possible end to the possible permutations, which
>> is why it would be good to establish some real use cases from now in
>> order to figure out what we really want to support.  Absent that, my
>> inclination is that we should implement the simplest possible thing
>> (i.e. no nesting) for 9.5.
> 
> I am not sure I agree that this will simplify the work. Currently
> s_s_names has already 1 level, and we want to append groups to each
> element of it as well, meaning that we'll need at least 2 level of
> nesting.

Well, we have to draw a line somewhere, unless we're going to support
infinite recursion.

And if we are going to support infinitie recursion, and kind of compact
syntax for a GUC isn't even worth talking about ...


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>> pro-JSON:
>>>
>>> * standard syntax which is recognizable to sysadmins and devops.
>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>> additions/deletions from the synch rep config.
>>> * can add group labels (see below)
>>
>> If we go this way, I think that managing a JSON blob with a GUC
>> parameter is crazy, this is way longer in character size than a simple
>> formula because of the key names. Hence, this JSON blob should be in a
>> separate place than postgresql.conf not within the catalog tables,
>> manageable using an SQL interface, and reloaded in backends using
>> SIGHUP.
>
> I'm not following this at all.  What are you saying here?

A JSON string is longer in terms of number of characters than a
formula because it contains key names, and those key names are usually
repeated several times, making it harder to read in a configuration
file. So what I am saying that that we do not save it as a GUC, but as
a separate metadata that can be accessed with a set of SQL functions
to manipulate it.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/06/2015 09:56 PM, Michael Paquier wrote:
> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>>> pro-JSON:
>>>>
>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>> additions/deletions from the synch rep config.
>>>> * can add group labels (see below)
>>>
>>> If we go this way, I think that managing a JSON blob with a GUC
>>> parameter is crazy, this is way longer in character size than a simple
>>> formula because of the key names. Hence, this JSON blob should be in a
>>> separate place than postgresql.conf not within the catalog tables,
>>> manageable using an SQL interface, and reloaded in backends using
>>> SIGHUP.
>>
>> I'm not following this at all.  What are you saying here?
> 
> A JSON string is longer in terms of number of characters than a
> formula because it contains key names, and those key names are usually
> repeated several times, making it harder to read in a configuration
> file. So what I am saying that that we do not save it as a GUC, but as
> a separate metadata that can be accessed with a set of SQL functions
> to manipulate it.

Where, though?  Someone already pointed out the issues with storing it
in a system catalog, and adding an additional .conf file with a
different format is too horrible to contemplate.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Josh Berkus wrote:

> '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [ 
> "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1, 
> "servers" [ "nc1", "nc2" ] } }' 
> 
> This says: as the "remotes" group, synch with a quorum of 2 servers in 
> london and a quorum of 1 server in NC.

I wanted to clarify about the format.
The remotes group does not specify any quorum, only its individual elements
mention the quorum.
"remotes" is said to sync in london_servers "and" NC.
Would absence of a quorum number in a group mean "all" elements?
Or the above would be represented as following to imply "AND" between the 2
DC.

'{ "remotes" : "quorum" : 2, "servers" :{ "london_servers" :     { "quorum" : 2, "servers" : [ "london1", "london2",
"london3"] },  "nc_servers" :     { "quorum" : 1, "servers" : [ "nc1", "nc2" ] } }
 
}'





-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856868.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Amit wrote:
> What if we write the above as: 
> 
> remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1
> [nc1, nc2])

Yes this we can consider.

Thanks,



-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856869.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/06/2015 09:56 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>>>> pro-JSON:
>>>>>
>>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>>> additions/deletions from the synch rep config.
>>>>> * can add group labels (see below)
>>>>
>>>> If we go this way, I think that managing a JSON blob with a GUC
>>>> parameter is crazy, this is way longer in character size than a simple
>>>> formula because of the key names. Hence, this JSON blob should be in a
>>>> separate place than postgresql.conf not within the catalog tables,
>>>> manageable using an SQL interface, and reloaded in backends using
>>>> SIGHUP.
>>>
>>> I'm not following this at all.  What are you saying here?
>>
>> A JSON string is longer in terms of number of characters than a
>> formula because it contains key names, and those key names are usually
>> repeated several times, making it harder to read in a configuration
>> file. So what I am saying that that we do not save it as a GUC, but as
>> a separate metadata that can be accessed with a set of SQL functions
>> to manipulate it.
>
> Where, though?  Someone already pointed out the issues with storing it
> in a system catalog, and adding an additional .conf file with a
> different format is too horrible to contemplate.

Something like pg_syncinfo/ coupled with a LW lock, we already do
something similar for replication slots with pg_replslot/.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote:
> pro-JSON: 
> 
> * standard syntax which is recognizable to sysadmins and devops. 
> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make 
> additions/deletions from the synch rep config. 
> * can add group labels (see below) 

Adding group labels do have a lot of values but as Amit has pointed out,
with little modification, they can be included in GUC as well. It will not
make it any more complex.

On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:

> Something like pg_syncinfo/ coupled with a LW lock, we already do 
> something similar for replication slots with pg_replslot/.

I was trying to figure out how the JSON metadata can be used.
It would have to be set using a given set of functions. Right?
I am sorry this question is very basic.

The functions could be something like:
1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members
VARIADIC)

This will be used to add a sync set. The set_members can be individual
elements of another set name. The parameter is_priority is used to decide
whether the set is priority (true) set or quorum (false). This function call
will  create a folder pg_syncinfo/groups/$NAME and store the json blob? 

The root group would be automatically sset by finding the group which is not
included in other groups? or can be set by another function?

2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool,
set_members VARIADIC)

This will update the pg_syncinfo/groups/$NAME to store the new values.

3. pg_drop_synch_set(set_name NAME)

This will update the pg_syncinfo/groups/$NAME folder. Also all the groups
which included this would be updated?

4. pg_show_synch_set()

this will display the current sync setting in json format.

Am I missing something?

Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
format already known to users?

In a real-life scenario, at most how many groups and nesting would be
expected? 



-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5857516.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
> Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote:
>> pro-JSON:
>>
>> * standard syntax which is recognizable to sysadmins and devops.
>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>> additions/deletions from the synch rep config.
>> * can add group labels (see below)
>
> Adding group labels do have a lot of values but as Amit has pointed out,
> with little modification, they can be included in GUC as well. It will not
> make it any more complex.
>
> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>
>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>> something similar for replication slots with pg_replslot/.
>
> I was trying to figure out how the JSON metadata can be used.
> It would have to be set using a given set of functions. Right?
> I am sorry this question is very basic.
>
> The functions could be something like:
> 1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members
> VARIADIC)
>
> This will be used to add a sync set. The set_members can be individual
> elements of another set name. The parameter is_priority is used to decide
> whether the set is priority (true) set or quorum (false). This function call
> will  create a folder pg_syncinfo/groups/$NAME and store the json blob?
>
> The root group would be automatically sset by finding the group which is not
> included in other groups? or can be set by another function?
>
> 2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool,
> set_members VARIADIC)
>
> This will update the pg_syncinfo/groups/$NAME to store the new values.
>
> 3. pg_drop_synch_set(set_name NAME)
>
> This will update the pg_syncinfo/groups/$NAME folder. Also all the groups
> which included this would be updated?
>
> 4. pg_show_synch_set()
>
> this will display the current sync setting in json format.
>
> Am I missing something?
>
> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
> format already known to users?
>
> In a real-life scenario, at most how many groups and nesting would be
> expected?
>

I might missing something but, these functions will generate WAL?
If they does, we will face the situation where we need to wait
forever, Fujii-san pointed out.


Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, Jul 13, 2015 at 9:22 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> I might missing something but, these functions will generate WAL?
> If they does, we will face the situation where we need to wait
> forever, Fujii-san pointed out.

No, those functions are here to manipulate the metadata defining the
quorum/priority set. We definitely do not want something that
generates WAL.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
> Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote:
>> pro-JSON:
>>
>> * standard syntax which is recognizable to sysadmins and devops.
>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>> additions/deletions from the synch rep config.
>> * can add group labels (see below)
>
> Adding group labels do have a lot of values but as Amit has pointed out,
> with little modification, they can be included in GUC as well.

Or you can extend the custom GUC mechanism so that we can
specify the groups by using them, for example,
   quorum_commit.mygroup1 = 'london, nyc'   quorum_commit.mygruop2 = 'tokyo, pune'   synchronous_standby_names =
'1(mygroup1),1(mygroup2)'
 

> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>
>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>> something similar for replication slots with pg_replslot/.
>
> I was trying to figure out how the JSON metadata can be used.
> It would have to be set using a given set of functions.

So we can use only such a set of functions to configure synch rep?
I don't like that idea. Because it prevents us from configuring that
while the server is not running.

> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
> format already known to users?

At least currently ALTER SYSTEM cannot accept the JSON data
(e.g., the return value of JSON function like json_build_object())
as the setting value. So I'm not sure how friendly ALTER SYSTEM
and JSON format really. If you want to argue that, probably you
need to improve ALTER SYSTEM so that JSON can be specified.

> In a real-life scenario, at most how many groups and nesting would be
> expected?

I don't think that many groups and nestings are common.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:
> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote:
>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>>
>>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>>> something similar for replication slots with pg_replslot/.
>>
>> I was trying to figure out how the JSON metadata can be used.
>> It would have to be set using a given set of functions.
>
> So we can use only such a set of functions to configure synch rep?
> I don't like that idea. Because it prevents us from configuring that
> while the server is not running.

If you store a json blob in a set of files of PGDATA you could update
them manually there as well. That's perhaps re-inventing the wheel
with what is available with GUCs though.

>> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a
>> format already known to users?
>
> At least currently ALTER SYSTEM cannot accept the JSON data
> (e.g., the return value of JSON function like json_build_object())
> as the setting value. So I'm not sure how friendly ALTER SYSTEM
> and JSON format really. If you want to argue that, probably you
> need to improve ALTER SYSTEM so that JSON can be specified.
>
>> In a real-life scenario, at most how many groups and nesting would be
>> expected?
>
> I don't think that many groups and nestings are common.

Yeah, in most common configurations people are not going to have more
than 3 groups with only one level of nodes.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Jul 14, 2015 at 9:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:
>> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote:
>>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:
>>>
>>>> Something like pg_syncinfo/ coupled with a LW lock, we already do
>>>> something similar for replication slots with pg_replslot/.
>>>
>>> I was trying to figure out how the JSON metadata can be used.
>>> It would have to be set using a given set of functions.
>>
>> So we can use only such a set of functions to configure synch rep?
>> I don't like that idea. Because it prevents us from configuring that
>> while the server is not running.
>
> If you store a json blob in a set of files of PGDATA you could update
> them manually there as well. That's perhaps re-inventing the wheel
> with what is available with GUCs though.

Why don't we just use GUC? If the quorum setting is not so complicated
in real scenario, GUC seems enough for that.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
<p dir="ltr">On Jul 14, 2015 7:15 AM, "Fujii Masao" <<a
href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>>wrote:<br /> ><br /> > On Tue, Jul 14, 2015 at
9:00AM, Michael Paquier<br /> > <<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>
wrote:<br/> > > On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:<br /> > >> On Fri, Jul 10, 2015 at
10:06PM, Beena Emerson wrote:<br /> > >>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:<br /> >
>>><br/> > >>>> Something like pg_syncinfo/ coupled with a LW lock, we already do<br /> >
>>>>something similar for replication slots with pg_replslot/.<br /> > >>><br /> >
>>>I was trying to figure out how the JSON metadata can be used.<br /> > >>> It would have to be
setusing a given set of functions.<br /> > >><br /> > >> So we can use only such a set of functions
toconfigure synch rep?<br /> > >> I don't like that idea. Because it prevents us from configuring that<br />
>>> while the server is not running.<br /> > ><br /> > > If you store a json blob in a set of
filesof PGDATA you could update<br /> > > them manually there as well. That's perhaps re-inventing the wheel<br
/>> > with what is available with GUCs though.<br /> ><br /> > Why don't we just use GUC? If the quorum
settingis not so complicated<br /> > in real scenario, GUC seems enough for that.<p dir="ltr">I agree GUC would be
enough.<br/> We could also name groups in it.<p dir="ltr">I am thinking of the following format similar to JSON<p
dir="ltr"><group_name>:<count>(<list>)<br /> Use of square brackets for priority.<p dir="ltr">Ex:<br
/>s_s_names = 'remotes: 2 (london: 1 [lndn1, lndn2], nyc: 1[nyc1,nyc2])'<br /><p dir="ltr">Regards,<p dir="ltr">Beena
Emerson<br /> 

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Fri, Jun 26, 2015 at 11:16 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs  wrote:
> > Let's start with a complex, fully described use case then work out how to
> > specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>

I think the way to address this could be via SQL Syntax as that
will make users life easier.

Create Replication Setup Master A 
Sync_Priority_Standby B Sync_Group_Any_Standby C,D 
Sync_Group_Fixed_Standby 2,E,F,G

where
Sync_Priority_Standby - means same as current setting in
synchronous_standby_names

Sync_Group_Any_Standby - means if any one in the group has
acknowledged commit master can proceed

Sync_Group_Fixed_Standby - means fixed number
(that will be first parameter following this option) of standby's from this
group should commit before master can proceed.

The above syntax is just to explain the idea, but I think we can invent
better syntax if required.  We can define these as options in syntax
like we do in some other syntaxes to avoid creating more keywords.
We need to ensure that all these option values needs to be persisted.

> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>

I think with above kind of syntax we can address all these points
and even if something is remaining it is easily extendable.

> That's where the micro-language idea makes sense to use. 

Micro-language idea is good, but I think if we can provide some
syntax or via SQL functions, then it can be convienient for users to
specify the replication topology.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 7 July 2015 at 07:03, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/06/2015 09:56 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>>>> pro-JSON:
>>>>>
>>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>>> additions/deletions from the synch rep config.
>>>>> * can add group labels (see below)
>>>>
>>>> If we go this way, I think that managing a JSON blob with a GUC
>>>> parameter is crazy, this is way longer in character size than a simple
>>>> formula because of the key names. Hence, this JSON blob should be in a
>>>> separate place than postgresql.conf not within the catalog tables,
>>>> manageable using an SQL interface, and reloaded in backends using
>>>> SIGHUP.
>>>
>>> I'm not following this at all.  What are you saying here?
>>
>> A JSON string is longer in terms of number of characters than a
>> formula because it contains key names, and those key names are usually
>> repeated several times, making it harder to read in a configuration
>> file. So what I am saying that that we do not save it as a GUC, but as
>> a separate metadata that can be accessed with a set of SQL functions
>> to manipulate it.
>
> Where, though?  Someone already pointed out the issues with storing it
> in a system catalog, and adding an additional .conf file with a
> different format is too horrible to contemplate.

Something like pg_syncinfo/ coupled with a LW lock, we already do
something similar for replication slots with pg_replslot/.

-1 to pg_syncinfo/

pg_replslot has persistent state. We are discussing permanent configuration data for which I don't see the need to create an additional parallel infrastructure just to store a string given stated objection that the string is fairly long. AFAICS its not even that long.

...

JSON seems the most sensible format for the string. Inventing a new one doesn't make sense. Most important for me is the ability to programmatically manipulate/edit the config string, which would be harder with a new custom format.

...

Group labels are essential.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 2 July 2015 at 19:50, Josh Berkus <josh@agliodbs.com> wrote:
 
So there's two parts to this:

1. I need to ensure that data is replicated to X places.

2. I need to *know* which places data was synchronously replicated to
when the master goes down.

My entire point is that (1) alone is useless unless you also have (2).
And do note that I'm talking about information on the replica, not on
the master, since in any failure situation we don't have the old master
around to check.

You might *think* you know, but given we are in this situation because of an unexpected failure, it seems strange to specifically avoid checking before you proceed.

Bacon not Aristotle.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 29 June 2015 at 18:40, Josh Berkus <josh@agliodbs.com> wrote:
  
I'm in favor of a more robust and sophisticated synch rep.  But not if
nobody not on this mailing list can configure it, and not if even we
don't know what it will do in an actual failure situation.

That's the key point. Editing the config after a failure is a Failure of Best Practice in an HA system.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Jul 15, 2015 at 3:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> pg_replslot has persistent state. We are discussing permanent configuration
> data for which I don't see the need to create an additional parallel
> infrastructure just to store a string given stated objection that the string
> is fairly long. AFAICS its not even that long.
>
> ...
>
> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to programmatically
> manipulate/edit the config string, which would be harder with a new custom
> format.
>
> ...
>
> Group labels are essential.

OK, so this is leading us to the following points:
- Use a JSON object to define the quorum/priority groups for the sync state.
- Store it as a GUC, and use the check hook to validate its format,
which is what we have now with s_s_names
- Rely on SIGHUP to maintain an in-memory image of the quorum/priority
sync state
- Have the possibility to define group labels in this JSON blob, and
be able to use those labels in a quorum or priority sync definition.
- For backward-compatibility, use for example s_s_names = 'json' to
switch to the new system.

Also, as a first step of the implementation, do we actually need a set
of functions to manipulate the JSON blob. I mean, we could perhaps
have them in contrib/ but they do not seem mandatory as long as we
document correctly how to document a label group and define a quorum
or priority group, no?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 15 July 2015 at 10:03, Michael Paquier <michael.paquier@gmail.com> wrote:
 
OK, so this is leading us to the following points:
- Use a JSON object to define the quorum/priority groups for the sync state.
- Store it as a GUC, and use the check hook to validate its format,
which is what we have now with s_s_names
- Rely on SIGHUP to maintain an in-memory image of the quorum/priority
sync state
- Have the possibility to define group labels in this JSON blob, and
be able to use those labels in a quorum or priority sync definition.

+1
 
- For backward-compatibility, use for example s_s_names = 'json' to
switch to the new system.

Seems easy enough to check to see if it is has a leading { and then treat it as if it is an attempt to use JSON (which may fail), otherwise use the old syntax.
 
Also, as a first step of the implementation, do we actually need a set
of functions to manipulate the JSON blob. I mean, we could perhaps
have them in contrib/ but they do not seem mandatory as long as we
document correctly how to document a label group and define a quorum
or priority group, no?

Agreed, no specific functions needed to manipulate this field. 

If we lack the means to manipulate JSON in SQL that can be solved outside of the scope of this patch, because its just JSON.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to
> programmatically manipulate/edit the config string, which would be harder
> with a new custom format.

Do we need to keep the value consistent across all the servers in the
flock?  If not, is the behavior halfway sane upon failover?

If we need the DBA to keep the value in sync manually, that's going to
be a recipe for trouble.  Which is going to bite particularly hard
during those stressing moments when disaster strikes and things have to
be done in emergency mode.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 15 July 2015 at 12:25, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Simon Riggs wrote:

> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to
> programmatically manipulate/edit the config string, which would be harder
> with a new custom format.

Do we need to keep the value consistent across all the servers in the
flock?  If not, is the behavior halfway sane upon failover?

Mostly, yes. Which means it doesn't change much, so config data is OK.
 
If we need the DBA to keep the value in sync manually, that's going to
be a recipe for trouble.  Which is going to bite particularly hard
during those stressing moments when disaster strikes and things have to
be done in emergency mode.

Manual config itself is the recipe for trouble, not this particular setting. There are already many other settings that need to be the same on all nodes for example. Nothing here changes that. This is just an enhancement of the current technology.

For the future, a richer mechanism for defining nodes and their associated metadata is needed for logical replication and clustering. That is not what is being discussed here though, nor should we begin!

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Group labels are essential.
>
> OK, so this is leading us to the following points:
> - Use a JSON object to define the quorum/priority groups for the sync state.
> - Store it as a GUC, and use the check hook to validate its format,
> which is what we have now with s_s_names
> - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> sync state
> - Have the possibility to define group labels in this JSON blob, and
> be able to use those labels in a quorum or priority sync definition.
> - For backward-compatibility, use for example s_s_names = 'json' to
> switch to the new system.

Personally, I think we're going to find that using JSON for this
rather than a custom syntax makes the configuration strings two or
three times as long for no discernable benefit.

But I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 16 July 2015 at 18:27, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Group labels are essential.
>
> OK, so this is leading us to the following points:
> - Use a JSON object to define the quorum/priority groups for the sync state.
> - Store it as a GUC, and use the check hook to validate its format,
> which is what we have now with s_s_names
> - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> sync state
> - Have the possibility to define group labels in this JSON blob, and
> be able to use those labels in a quorum or priority sync definition.
> - For backward-compatibility, use for example s_s_names = 'json' to
> switch to the new system.

Personally, I think we're going to find that using JSON for this
rather than a custom syntax makes the configuration strings two or
three times as long for

They may well be 2-3 times as long. Why is that a negative?
 
no discernable benefit.

Benefits:
* More readable
* Easy to validate
* No additional code required in the server to support this syntax (so no bugs)
* Developers will immediately understand the format
* Easy to programmatically manipulate in a range of languages

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Personally, I think we're going to find that using JSON for this
>> rather than a custom syntax makes the configuration strings two or
>> three times as long for
>
> They may well be 2-3 times as long. Why is that a negative?

In my opinion, brevity makes things easier to read and understand.  We
also don't support multi-line GUCs, so if your configuration takes 140
characters, you're going to have a very long line in your
postgresql.conf (and in your pg_settings output, etc.)

> * No additional code required in the server to support this syntax (so no
> bugs)

I think you'll find that this is far from true.  Presumably not any
arbitrary JSON object will be acceptable.  You'll have to parse it as
JSON, and then validate that it is of the expected form.  It may not
be MORE code than implementing a mini-language from scratch, but I
wouldn't expect to save much.

> * Developers will immediately understand the format

I doubt it.  I think any format that we pick will have to be carefully
documented.  People may know what JSON looks like in general, but they
will not immediately know what bells and whistles are available in
this context.

> * Easy to programmatically manipulate in a range of languages

I agree that JSON has that advantage, but I doubt that it is important
here.  I would expect that people might need to generate a new config
string and dump it into postgresql.conf, but that should be easy with
any reasonable format.  I think it will be rare to need to parse the
postgresql.conf string, manipulate it programatically, and then put it
back.  As we've already said, most configurations are simple and
shouldn't change frequently.  If they're not or they do, that's a
problem of itself.

However, I'm not trying to ram my idea through; I'm just telling you my opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Jul 16, 2015 at 11:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > * Developers will immediately understand the format
>
> I doubt it.  I think any format that we pick will have to be carefully
> documented.  People may know what JSON looks like in general, but they
> will not immediately know what bells and whistles are available in
> this context.
>

I also think any format where user has to carefully remember how he has
to provide the values is not user-friendly, why in this case SQL based
syntax is not preferable, with that we can even achieve consistency of this
parameter across all servers which I think is not of utmost importance
for this feature, but still I think it will make users happy.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Robert Haas wrote:
> 
> On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@> wrote:
> >> Personally, I think we're going to find that using JSON for this
> >> rather than a custom syntax makes the configuration strings two or
> >> three times as long for
> >
> > They may well be 2-3 times as long. Why is that a negative?
> 
> In my opinion, brevity makes things easier to read and understand.  We
> also don't support multi-line GUCs, so if your configuration takes 140
> characters, you're going to have a very long line in your
> postgresql.conf (and in your pg_settings output, etc.)
> 
> > * No additional code required in the server to support this syntax (so
> no
> > bugs)
> 
> I think you'll find that this is far from true.  Presumably not any
> arbitrary JSON object will be acceptable.  You'll have to parse it as
> JSON, and then validate that it is of the expected form.  It may not
> be MORE code than implementing a mini-language from scratch, but I
> wouldn't expect to save much.
> 
> > * Developers will immediately understand the format
> 
> I doubt it.  I think any format that we pick will have to be carefully
> documented.  People may know what JSON looks like in general, but they
> will not immediately know what bells and whistles are available in
> this context.
> 
> * Easy to programmatically manipulate in a range of languages
> 
> I agree that JSON has that advantage, but I doubt that it is important
> here.  I would expect that people might need to generate a new config
> string and dump it into postgresql.conf, but that should be easy with
> any reasonable format.  I think it will be rare to need to parse the
> postgresql.conf string, manipulate it programatically, and then put it
> back.  As we've already said, most configurations are simple and
> shouldn't change frequently.  If they're not or they do, that's a
> problem of itself.
> 

All points here are valid and I would prefer a new language over JSON. I
agree, the new validation code would have to be properly tested to avoid
bugs but it wont be too difficult. 

Also I think methods that generate WAL record is avoided because any attempt
to change the syncrep settings will go in indefinite wait when a mandatory
sync candidate (as per current settings) goes down (Explained in earlier
post id: CAHGQGwE_-HCzw687B4SdMWqAkkPcu-uxmF3MKyDB9mu38cJ7Jg@mail.gmail.com)





-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858255.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Jim Nasby
Date:
On 7/16/15 12:40 PM, Robert Haas wrote:
>> >They may well be 2-3 times as long. Why is that a negative?
> In my opinion, brevity makes things easier to read and understand.  We
> also don't support multi-line GUCs, so if your configuration takes 140
> characters, you're going to have a very long line in your
> postgresql.conf (and in your pg_settings output, etc.)

Brevity goes both ways, but I don't think that's the real problem here; 
it's the lack of multi-line support. The JSON that's been proposed makes 
you work really hard to track what level of nesting you're at, while 
every alternative format I've seen is terse enough to be very clear on a 
single line.

I'm guessing it'd be really ugly/hard to support at least this GUC being 
multi-line?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 07/17/2015 04:36 PM, Jim Nasby wrote:
> On 7/16/15 12:40 PM, Robert Haas wrote:
>>> >They may well be 2-3 times as long. Why is that a negative?
>> In my opinion, brevity makes things easier to read and understand.  We
>> also don't support multi-line GUCs, so if your configuration takes 140
>> characters, you're going to have a very long line in your
>> postgresql.conf (and in your pg_settings output, etc.)
> 
> Brevity goes both ways, but I don't think that's the real problem here;
> it's the lack of multi-line support. The JSON that's been proposed makes
> you work really hard to track what level of nesting you're at, while
> every alternative format I've seen is terse enough to be very clear on a
> single line.

I will point out that the proposed non-JSON syntax does not offer any
ability to name consensus/priority groups.  I believe that being able to
name groups is vital to managing any complex synch rep, but if we add
names it will make the non-JSON syntax less compact.

> 
> I'm guessing it'd be really ugly/hard to support at least this GUC being
> multi-line?

Yes.

Mind you, multi-line GUCs would be useful otherwise, but we don't want
to hinge this feature on making that work.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>> multi-line?

> Mind you, multi-line GUCs would be useful otherwise, but we don't want
> to hinge this feature on making that work.

I'm pretty sure that changing the GUC parser to allow quoted strings to
continue across lines would be trivial.  The problem with it is not that
it's hard, it's that omitting a closing quote mark would then result in
the entire file being syntactically broken, with the error message(s)
almost certainly pointing somewhere else than where the actual mistake is.
Do we really want such a global reduction in friendliness to make this
feature easier?
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 19 July 2015 at 21:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josh Berkus <josh@agliodbs.com> writes:
> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>> multi-line?

> Mind you, multi-line GUCs would be useful otherwise, but we don't want
> to hinge this feature on making that work.

I'm pretty sure that changing the GUC parser to allow quoted strings to
continue across lines would be trivial.

Agreed
 
The problem with it is not that
it's hard, it's that omitting a closing quote mark would then result in
the entire file being syntactically broken, with the error message(s)
almost certainly pointing somewhere else than where the actual mistake is.

That depends upon how we specify line-continuation. If we do it with starting and ending quotes, then we would have the problem you suggest. If we required each new continuation line to start with a \ then it wouldn't (or similar). Or perhaps it gets its own file even, an idea raised before.

Do we really want such a global reduction in friendliness to make this
feature easier?

Clearly not, but we must first decide whether that is how we characterise the decision.

synchronous_standby_name= is already 25 characters, so that leaves 115 characters - are they always single byte chars?

It's not black and white for me that JSON necessarily requires >115 chars whereas other ways never will do.

What we are discussing is expanding an existing parameter to include more information. If Josh gets some of the things he's been asking for, then the format will bloat further. It doesn't take much for me to believe it might expand further still, so my view from the discussion is that we'll likely need to expand beyond 115 chars one day whatever format we choose.

I'm personally ambivalent what the exact format is that we choose; I care much more about the feature than the syntax, always. My contribution so far was to summarise what I thought was the majority opinion, and to challenge the thought that JSON had no discernible benefit. If the majority view is different, I have no problem there.

Clusters of 20 or more standby nodes are reasonably common, so those limits do seem a little small. Synchronous commit behavior is far from being the only cluster metadata we need to record.  I'm thinking now that this illustrates that this is the wrong way altogether and we should just be storing cluster metadata in database tables, which is what was discussed and agreed at the BDR meeting at PGCon. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Simon Riggs wrote: 

>synchronous_standby_name= is already 25 characters, so that leaves 115
characters - are they always single byte chars?

I am sorry, I did not get why there is a 140 byte limit. Can you please
explain?




-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858502.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 20 July 2015 at 08:18, Beena Emerson <memissemerson@gmail.com> wrote:
 Simon Riggs wrote:

>synchronous_standby_name= is already 25 characters, so that leaves 115
characters - are they always single byte chars?

I am sorry, I did not get why there is a 140 byte limit. Can you please
explain?

Hmm, sorry, I thought Robert had said there was a 140 byte limit. I misread.

I don't think that affects my point. The choice between formats is not solely predicated on whether we have multi-line support.

I still think writing down some actual use cases would help bring the discussion to a conclusion. Inventing a general facility is hard without some clear goals about what we need to support.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Simon Riggs wrote:

> The choice between formats is not
> solely predicated on whether we have multi-line support.

> I still think writing down some actual use cases would help bring the
> discussion to a conclusion. Inventing a general facility is hard without
> some clear goals about what we need to support.

We need to at least support the following:
- Grouping: Specify of standbys along with the minimum number of commits
required from the group.
- Group Type: Groups can either be priority or quorum group.
- Group names: to simplify status reporting
- Nesting: At least 2 levels of nesting

Using JSON, sync rep parameter to replicate in 2 different clusters could be
written as: 
  {"remotes":        {"quorum": 2,         "servers": [{"london":            {"prioirty": 2,             "servers":
["lndn1","lndn2", "lndn3"]           }}           ,             {"nyc":           {"priority": 1,            "servers":
["ny1","ny2"]           }}         ]       }   }
 

The same parameter in the new language (as suggested above) could be written
as:'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

Also, I was thinking the name of the main group could be optional.
Internally, it can be given the name 'default group' or 'main group' for
status reporting.

The above could also be written as:'2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

backward compatible:
In JSON, while validating we may have to check if it starts with '{' to go
for JSON parsing else proceed with the current method.

A,B,C => 1[A,B,C]. This can be added in the new parser code.

Thoughts?



-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858571.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Simon Riggs wrote:
>
>> The choice between formats is not
>> solely predicated on whether we have multi-line support.
>
>> I still think writing down some actual use cases would help bring the
>> discussion to a conclusion. Inventing a general facility is hard without
>> some clear goals about what we need to support.
>
> We need to at least support the following:
> - Grouping: Specify of standbys along with the minimum number of commits
> required from the group.
> - Group Type: Groups can either be priority or quorum group.

As far as I understood at the lowest level a group is just an alias
for a list of nodes, quorum or priority are properties that can be
applied to a group of nodes when this group is used in the expression
to define what means synchronous commit.

> - Group names: to simplify status reporting
> - Nesting: At least 2 levels of nesting

If I am following correctly, at the first level there is the
definition of the top level objects, like groups and sync expression.

> Using JSON, sync rep parameter to replicate in 2 different clusters could be
> written as:
>
>    {"remotes":
>         {"quorum": 2,
>          "servers": [{"london":
>             {"priority": 2,
>              "servers": ["lndn1", "lndn2", "lndn3"]
>             }}
>             ,
>               {"nyc":
>             {"priority": 1,
>              "servers": ["ny1", "ny2"]
>             }}
>           ]
>         }
>     }
> The same parameter in the new language (as suggested above) could be written
> as:
>  'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'

OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
I think that JSON makes the most sense. One of the advantage of a
group is that you can use it in several places in the blob and set
different properties into it, hence we should be able to define a
group out of the sync expression.

Hence I would think that something like that makes more sense:
{       "sync_standby_names":       {               "quorum":2,               "nodes":               [
    {"priority":1,"group":"cluster1"},                       {"quorum":2,"nodes":["node1","node2","node3"]}
 ]       },       "groups":       {               "cluster1":["node11","node12","node13"],
"cluster2":["node21","node22","node23"]      }
 
}

> Also, I was thinking the name of the main group could be optional.
> Internally, it can be given the name 'default group' or 'main group' for
> status reporting.
>
> The above could also be written as:
>  '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>
> backward compatible:
> In JSON, while validating we may have to check if it starts with '{' to go

Something worth noticing, application_name can begin with "{".

> for JSON parsing else proceed with the current method.

> A,B,C => 1[A,B,C]. This can be added in the new parser code.

This makes sense. We could do the same for JSON-based format as well
by reusing the in-memory structure used to deparse the blob when the
former grammar is used as well.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Sawada Masahiko
Date:
On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Simon Riggs wrote:
>>
>>> The choice between formats is not
>>> solely predicated on whether we have multi-line support.
>>
>>> I still think writing down some actual use cases would help bring the
>>> discussion to a conclusion. Inventing a general facility is hard without
>>> some clear goals about what we need to support.
>>
>> We need to at least support the following:
>> - Grouping: Specify of standbys along with the minimum number of commits
>> required from the group.
>> - Group Type: Groups can either be priority or quorum group.
>
> As far as I understood at the lowest level a group is just an alias
> for a list of nodes, quorum or priority are properties that can be
> applied to a group of nodes when this group is used in the expression
> to define what means synchronous commit.
>
>> - Group names: to simplify status reporting
>> - Nesting: At least 2 levels of nesting
>
> If I am following correctly, at the first level there is the
> definition of the top level objects, like groups and sync expression.
>

The grouping and using same application_name different server is similar.
How does the same application_name different server work?

>> Using JSON, sync rep parameter to replicate in 2 different clusters could be
>> written as:
>>
>>    {"remotes":
>>         {"quorum": 2,
>>          "servers": [{"london":
>>             {"priority": 2,
>>              "servers": ["lndn1", "lndn2", "lndn3"]
>>             }}
>>             ,
>>               {"nyc":
>>             {"priority": 1,
>>              "servers": ["ny1", "ny2"]
>>             }}
>>           ]
>>         }
>>     }
>> The same parameter in the new language (as suggested above) could be written
>> as:
>>  'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>
> OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
> nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
> I think that JSON makes the most sense. One of the advantage of a
> group is that you can use it in several places in the blob and set
> different properties into it, hence we should be able to define a
> group out of the sync expression.
> Hence I would think that something like that makes more sense:
> {
>         "sync_standby_names":
>         {
>                 "quorum":2,
>                 "nodes":
>                 [
>                         {"priority":1,"group":"cluster1"},
>                         {"quorum":2,"nodes":["node1","node2","node3"]}
>                 ]
>         },
>         "groups":
>         {
>                 "cluster1":["node11","node12","node13"],
>                 "cluster2":["node21","node22","node23"]
>         }
> }
>
>> Also, I was thinking the name of the main group could be optional.
>> Internally, it can be given the name 'default group' or 'main group' for
>> status reporting.
>>
>> The above could also be written as:
>>  '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>
>> backward compatible:
>> In JSON, while validating we may have to check if it starts with '{' to go
>
> Something worth noticing, application_name can begin with "{".
>
>> for JSON parsing else proceed with the current method.
>
>> A,B,C => 1[A,B,C]. This can be added in the new parser code.
>
> This makes sense. We could do the same for JSON-based format as well
> by reusing the in-memory structure used to deparse the blob when the
> former grammar is used as well.

If I validate s_s_name JSON syntax, I will definitely use JSONB,
rather than JSON.
Because JSONB has some useful operation functions for adding node,
deleting node to s_s_name today.
But the down side of using JSONB for s_s_name is that it could switch
in key name order place.(and remove duplicate key)
For example in the syntax Michael suggested,


* JSON (just casting JSON)                                 json
------------------------------------------------------------------------{
                     +        "sync_standby_names":                                         +        {
                                          +                "quorum":2,                                           +
         "nodes":                                              +                [
             +                        {"priority":1,"group":"cluster1"},            +
{"quorum":2,"nodes":["node1","node2","node3"]}+               ]                                                     +
    },                                                            +        "groups":
                +        {                                                             +
"cluster1":["node11","node12","node13"],             +                "cluster2":["node21","node22","node23"]
   +        }                                                             +}
 

* JSONB (using jsonb_pretty)            jsonb_pretty
--------------------------------------{                                   +    "groups": {                     +
"cluster1":[               +            "node11",               +            "node12",               +
"node13"               +        ],                          +        "cluster2": [               +            "node21",
             +            "node22",               +            "node23"                +        ]
   +    },                              +    "sync_standby_names": {         +        "nodes": [                  +
      {                       +                "group": "cluster1",+                "priority": 1       +            },
                    +            {                       +                "nodes": [          +
"node1",       +                    "node2",        +                    "node3"         +                ],
     +                "quorum": 2         +            }                       +        ],                          +
    "quorum": 2                 +    }                               +}
 

"group" and "sync_standby_names" has been switched place. I'm not sure
it's good for the users.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Jul 29, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> Simon Riggs wrote:
>>>
>>>> The choice between formats is not
>>>> solely predicated on whether we have multi-line support.
>>>
>>>> I still think writing down some actual use cases would help bring the
>>>> discussion to a conclusion. Inventing a general facility is hard without
>>>> some clear goals about what we need to support.
>>>
>>> We need to at least support the following:
>>> - Grouping: Specify of standbys along with the minimum number of commits
>>> required from the group.
>>> - Group Type: Groups can either be priority or quorum group.
>>
>> As far as I understood at the lowest level a group is just an alias
>> for a list of nodes, quorum or priority are properties that can be
>> applied to a group of nodes when this group is used in the expression
>> to define what means synchronous commit.
>>
>>> - Group names: to simplify status reporting
>>> - Nesting: At least 2 levels of nesting
>>
>> If I am following correctly, at the first level there is the
>> definition of the top level objects, like groups and sync expression.
>>
>
> The grouping and using same application_name different server is similar.
> How does the same application_name different server work?

In the same of a priority group both nodes get the same priority,
imagine for example that we need to wait for 2 nodes with lower
priority: node1 with priority 1, node2 with priority 2 and again node2
with priority 2, we would wait for the first one, and then one of the
second. In quorum group, any of them could be qualified for selection.

>>> Using JSON, sync rep parameter to replicate in 2 different clusters could be
>>> written as:
>>>
>>>    {"remotes":
>>>         {"quorum": 2,
>>>          "servers": [{"london":
>>>             {"priority": 2,
>>>              "servers": ["lndn1", "lndn2", "lndn3"]
>>>             }}
>>>             ,
>>>               {"nyc":
>>>             {"priority": 1,
>>>              "servers": ["ny1", "ny2"]
>>>             }}
>>>           ]
>>>         }
>>>     }
>>> The same parameter in the new language (as suggested above) could be written
>>> as:
>>>  'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>
>> OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3],
>> nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases,
>> I think that JSON makes the most sense. One of the advantage of a
>> group is that you can use it in several places in the blob and set
>> different properties into it, hence we should be able to define a
>> group out of the sync expression.
>> Hence I would think that something like that makes more sense:
>> {
>>         "sync_standby_names":
>>         {
>>                 "quorum":2,
>>                 "nodes":
>>                 [
>>                         {"priority":1,"group":"cluster1"},
>>                         {"quorum":2,"nodes":["node1","node2","node3"]}
>>                 ]
>>         },
>>         "groups":
>>         {
>>                 "cluster1":["node11","node12","node13"],
>>                 "cluster2":["node21","node22","node23"]
>>         }
>> }
>>
>>> Also, I was thinking the name of the main group could be optional.
>>> Internally, it can be given the name 'default group' or 'main group' for
>>> status reporting.
>>>
>>> The above could also be written as:
>>>  '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])'
>>>
>>> backward compatible:
>>> In JSON, while validating we may have to check if it starts with '{' to go
>>
>> Something worth noticing, application_name can begin with "{".
>>
>>> for JSON parsing else proceed with the current method.
>>
>>> A,B,C => 1[A,B,C]. This can be added in the new parser code.
>>
>> This makes sense. We could do the same for JSON-based format as well
>> by reusing the in-memory structure used to deparse the blob when the
>> former grammar is used as well.
>
> If I validate s_s_name JSON syntax, I will definitely use JSONB,
> rather than JSON.
> Because JSONB has some useful operation functions for adding node,
> deleting node to s_s_name today.
> But the down side of using JSONB for s_s_name is that it could switch
> in key name order place.(and remove duplicate key)
> For example in the syntax Michael suggested,
> [...]
> "group" and "sync_standby_names" has been switched place. I'm not sure
> it's good for the users.

I think that's perfectly fine.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

Just looking at how the 2 differnt methods can be used to set the s_s_names
value.

1. For a simple case where quorum is required for a single group the JSON
could be:
{        "sync_standby_names":        {                "quorum":2,                "nodes":                [
"node1","node2","node3"]        }}
 

or
{        "sync_standby_names":        {                "quorum":2,                "group": "cluster1"        },
"groups":       {                "cluster1":["node1","node2","node3"]        }}
 

Language:
2(node1, node2, node3)


2. For having quorum between different groups and node:{        "sync_standby_names":        {
"quorum":2,               "nodes":                    [                       {"priority":1,"nodes":["node0"]},
             {"quorum":2,"group": "cluster1"}                   ]        },        "groups":        {
"cluster1":["node1","node2","node3"]       }}
 

or{        "sync_standby_names":        {                "quorum":2,                "nodes":                    [
               {"priority":1,"group": "cluster2"},                       {"quorum":2,"group": "cluster1"}
   ]        },        "groups":        {                "cluster1":["node1","node2","node3"],
"cluster2":["node0"]       }}
 

Language:
2 (node0, cluster1: 2(node1, node2, node3))

Since there will not be many nesting and grouping, I still prefer new
language to JSON. 
I understand one can easily, modify/add groups in JSON using in built
functions but I think changes will not be done too often. 



-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860197.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>>> multi-line?
>
>> Mind you, multi-line GUCs would be useful otherwise, but we don't want
>> to hinge this feature on making that work.
>
> I'm pretty sure that changing the GUC parser to allow quoted strings to
> continue across lines would be trivial.  The problem with it is not that
> it's hard, it's that omitting a closing quote mark would then result in
> the entire file being syntactically broken, with the error message(s)
> almost certainly pointing somewhere else than where the actual mistake is.
> Do we really want such a global reduction in friendliness to make this
> feature easier?

Maybe shoehorning this into the GUC mechanism is the wrong thing, and
what we really need is a new config file for this.  The information
we're proposing to store seems complex enough to justify that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
> Just looking at how the 2 differnt methods can be used to set the s_s_names
> value.
>
> 1. For a simple case where quorum is required for a single group the JSON
> could be:
>
>  {
>          "sync_standby_names":
>          {
>                  "quorum":2,
>                  "nodes":
>                  [ "node1","node2","node3" ]
>          }
>  }
>
> or
>
>  {
>          "sync_standby_names":
>          {
>                  "quorum":2,
>                  "group": "cluster1"
>          },
>          "groups":
>          {
>                  "cluster1":["node1","node2","node3"]
>          }
>  }
>
> Language:
> 2(node1, node2, node3)
>
>
> 2. For having quorum between different groups and node:
>  {
>          "sync_standby_names":
>          {
>                  "quorum":2,
>                  "nodes":
>                     [
>                         {"priority":1,"nodes":["node0"]},
>                         {"quorum":2,"group": "cluster1"}
>                     ]
>          },
>          "groups":
>          {
>                  "cluster1":["node1","node2","node3"]
>          }
>  }
>
> or
>  {
>          "sync_standby_names":
>          {
>                  "quorum":2,
>                  "nodes":
>                     [
>                         {"priority":1,"group": "cluster2"},
>                         {"quorum":2,"group": "cluster1"}
>                     ]
>          },
>          "groups":
>          {
>                  "cluster1":["node1","node2","node3"],
>                  "cluster2":["node0"]
>          }
>  }
>
> Language:
> 2 (node0, cluster1: 2(node1, node2, node3))
>
> Since there will not be many nesting and grouping, I still prefer new
> language to JSON.
> I understand one can easily, modify/add groups in JSON using in built
> functions but I think changes will not be done too often.
>

If we decided to use dedicated language, the syntax checker for that
language is needed, via SQL or something.
Otherwise we will not be able to know whether the parsing that value
will be done correctly, until reloading or restarting server.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Aug 4, 2015 at 2:57 PM, Masahiko Sawada wrote:
> On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson wrote:
>> Since there will not be many nesting and grouping, I still prefer new
>> language to JSON.
>> I understand one can easily, modify/add groups in JSON using in built
>> functions but I think changes will not be done too often.
>>
>
> If we decided to use dedicated language, the syntax checker for that
> language is needed, via SQL or something.

Well, sure, both approaches have downsides.

> Otherwise we will not be able to know whether the parsing that value
> will be done correctly, until reloading or restarting server.

And this is the case of any format as well. String format validation
for a GUC occurs when server is reloaded or restarted, one advantage
of JSON is that the parser validator is already here, so we don't need
to reinvent a new machinery for that.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Michael Paquier wrote:
> And this is the case of any format as well. String format validation
> for a GUC occurs when server is reloaded or restarted, one advantage
> of JSON is that the parser validator is already here, so we don't need
> to reinvent a new machinery for that.

IIUC correctly, we would also have to add additional code to check that that
given JSON has the required keys and entries. For ex: The "group" mentioned
in the "s_s_names"  should be definied in the "groups" section, etc.





-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860758.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Aug 4, 2015 at 3:27 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Michael Paquier wrote:
>> And this is the case of any format as well. String format validation
>> for a GUC occurs when server is reloaded or restarted, one advantage
>> of JSON is that the parser validator is already here, so we don't need
>> to reinvent a new machinery for that.
>
> IIUC correctly, we would also have to add additional code to check that that
> given JSON has the required keys and entries. For ex: The "group" mentioned
> in the "s_s_names"  should be definied in the "groups" section, etc.

Yep, true as well.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Robert Haas wrote:
>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>what we really need is a new config file for this.  The information
>we're proposing to store seems complex enough to justify that.
>

I think the consensus is that JSON is better.
And using a new file with multi line support would be good.

Name of the file: how about pg_syncinfo.conf? 


Backward compatibility: synchronous_standby_names will be supported.
synchronous_standby_names='pg_syncinfo' indicates use of new file.


JSON format:
It would contain 2 main keys: "sync_info" and  "groups"
The "sync_info" would consist of "quorum"/"priority" with the count and
"nodes"/"group" with the group name or node list.
The optional "groups" key would list out all the "group" mentioned within
"sync_info" along with the node list.


Ex:
1.
{       "sync_info":       {               "quorum":2,               "nodes":               [
"node1","node2", "node3"               ]       }
 
}

2.
{       "sync_info":       {               "quorum":2,               "nodes":               [
{"priority":1,"group":"cluster1"},                      {"quorum":2,"group": "cluster2"},        "node99"
]      },       "groups":       {               "cluster1":["node11","node12"],
"cluster2":["node21","node22","node23"]      }
 
}

Thoughts?



-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Robert Haas wrote:
>>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>>what we really need is a new config file for this.  The information
>>we're proposing to store seems complex enough to justify that.
>>
>
> I think the consensus is that JSON is better.

I guess so as well. Thanks for brainstorming the whole thread in a single post.

> And using a new file with multi line support would be good.

This file just contains a JSON blob, hence we just need to fetch its
content entirely and then let the server parse it using the existing
facilities.

> Name of the file: how about pg_syncinfo.conf?
> Backward compatibility: synchronous_standby_names will be supported.
> synchronous_standby_names='pg_syncinfo' indicates use of new file.

This strengthens the fact that parsing is done at SIGHUP, so that
sounds fine to me. We may still find out an application_name that uses
pg_syncinfo but well, that's unlikely to happen...

> JSON format:
> It would contain 2 main keys: "sync_info" and  "groups"
> The "sync_info" would consist of "quorum"/"priority" with the count and
> "nodes"/"group" with the group name or node list.
> The optional "groups" key would list out all the "group" mentioned within
> "sync_info" along with the node list.
>
> [...]
>
> Thoughts?

Yes, I think that's the idea. I would let a couple of days to let
people time to give their opinion and objections regarding this
approach though.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Bruce Momjian
Date:
On Wed, Jul  1, 2015 at 11:21:47AM -0700, Josh Berkus wrote:
> All:
> 
> Replying to multiple people below.
> 
> On 07/01/2015 07:15 AM, Fujii Masao wrote:
> > On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote:
> >> You're confusing two separate things.  The primary manageability problem
> >> has nothing to do with altering the parameter.  The main problem is: if
> >> there is more than one synch candidate, how do we determine *after the
> >> master dies* which candidate replica was in synch at the time of
> >> failure?  Currently there is no way to do that.  This proposal plans to,
> >> effectively, add more synch candidate configurations without addressing
> >> that core design failure *at all*.  That's why I say that this patch
> >> decreases overall reliability of the system instead of increasing it.
> > 
> > I agree this is a problem even today, but it's basically independent from
> > the proposed feature *itself*. So I think that it's better to discuss and
> > work on the problem separately. If so, we might be able to provide
> > good way to find new master even if the proposed feature finally fails
> > to be adopted.
> 
> I agree that they're separate features.  My argument is that the quorum
> synch feature isn't materially useful if we don't create some feature to
> identify which server(s) were in synch at the time the master died.

I am coming in here late, but I thought the last time we talked about
this that the only reasonable way to communicate that we have changed to
synchronize with a secondary server (different application_name) is to
allow a GUC-configured command string to be run when a change like this
happens.  The command string would write a status on another server or
send an email.

Based on the new s_s_name API, this would mean whenever we switch to a
different priority level, like 1 to 2, 2 to 3, or 2 to 1.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Support for N synchronous standby servers - take 2

From
Jim Nasby
Date:
On 8/4/15 9:18 PM, Michael Paquier wrote:
>> >And using a new file with multi line support would be good.
> This file just contains a JSON blob, hence we just need to fetch its
> content entirely and then let the server parse it using the existing
> facilities.

It sounds like there's other places where multiline GUCs would be 
useful, so I think we should just support that instead of creating 
something that only works for SR configuration.

I also don't see the problem with supporting multi-line GUCs that are 
wrapped in quotes. Yes, you miss a quote and things blow up, but so 
what? Anyone that's done any amount of programming has faced that 
problem. Heck, if we wanted to be fancy we could watch for the first 
line that could have been another GUC and stick that in a hint.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

Please find attached the WIP patch for the proposed feature. It is built based on the already discussed design.

Changes made:
- add new parameter "sync_file" to provide the location of the pg_syncinfo file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and pg_ident file.
- pg_syncinfo file will hold the sync rep information in the approved JSON format.
- synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the JSON value stored in the file.
- All the standbys mentioned in the s_s_names or the pg_syncinfo file currently get the priority as 1 and all others as  0 (async)
- Various functions in syncrep.c to read the json file and store the values in a struct to be used in checking the quorum status of syncrep standbys (SyncRepGetQuorumRecPtr function).

It does not support the current behavior for synchronous_standby_names = '*'. I am yet to thoroughly test the patch.

Thoughts?

--
Beena Emerson

Attachment

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> Please find attached the WIP patch for the proposed feature. It is built
> based on the already discussed design.
>
> Changes made:
> - add new parameter "sync_file" to provide the location of the pg_syncinfo
> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
> pg_ident file.

I am not sure that's really necessary. We could just hardcode its location.

> - pg_syncinfo file will hold the sync rep information in the approved JSON
> format.

OK. Have you considered as well the approach to add support for
multi-line GUC parameters? This has been mentioned a couple of time
above as well, with something like that I imagine:
param = 'value1,' \   'value2,' \   'value3'
and this reads as 'value1,value2,value3'. This would benefit as well
for other parameters.

> - synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the
> JSON value stored in the file.

Check.

> - All the standbys mentioned in the s_s_names or the pg_syncinfo file
> currently get the priority as 1 and all others as  0 (async)
> - Various functions in syncrep.c to read the json file and store the values
> in a struct to be used in checking the quorum status of syncrep standbys
> (SyncRepGetQuorumRecPtr function).
> It does not support the current behavior for synchronous_standby_names = '*'.
> I am yet to thoroughly test the patch.

As this patch adds a whole new infrastructure, this is going to need
complex test setups with many configurations that will require either
bash-ing a bunch of new things, and we are not protected from bugs in
those scripts either or manual manipulation mistakes during the tests.
What I think looks really necessary with this patch is to have
included a set of tests to prove that the patch actually does what it
should with complex scenarios and that it does it correctly. So we had
better perhaps move on with this patch first:
https://commitfest.postgresql.org/6/197/

And it would be really nice to get the tests of this patch integrated
with it as well. We are not protected from bugs in this patch as well,
but if we have an infrastructure centralized this will add a level of
confidence that we are doing things the right way. Your patch offers
as well a good occasion to see if there would be some generic routines
that would be helpful in this recovery test suite.
Regards,
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Sameer Thakur-2
Date:
Hello,
I did apply the patch to HEAD and tried to setup basic async replication.But
i got an error. Turned on logging for details below.

Unpatched Primary Log
LOG:  database system was shut down at 2015-09-12 13:41:40 IST
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Unpatched Standby log
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  invalid record length at 0/20000D0
LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
LOG:  consistent recovery state reached at 0/20000F8
LOG:  database system is ready to accept read only connections

Patched Primary log
LOG:  database system was shut down at 2015-09-12 13:50:17 IST
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  server process (PID 17317) was terminated by signal 11: Segmentation
fault
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2015-09-12 13:50:18
IST
FATAL:  the database system is in recovery mode
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  invalid record length at 0/3000098
LOG:  redo is not required
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  server process (PID 17343) was terminated by signal 11: Segmentation
fault
LOG:  terminating any other active server processes

Patched Standby log
LOG:  database system was interrupted; last known up at 2015-09-12 13:50:16
IST
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  invalid record length at 0/20000D0
LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
FATAL:  could not receive data from WAL stream: server closed the connection
unexpectedly    This probably means the server terminated abnormally    before or while processing the request.
FATAL:  could not connect to the primary server: FATAL:  the database system
is in recovery mode

Not sure if there is something i am missing which causes this.
regards
Sameer



--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865685.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Fri, Sep 11, 2015 at 6:41 AM, Beena Emerson <memissemerson@gmail.com> wrote:
Hello,

Please find attached the WIP patch for the proposed feature. It is built based on the already discussed design.

Changes made:
- add new parameter "sync_file" to provide the location of the pg_syncinfo file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and pg_ident file.
- pg_syncinfo file will hold the sync rep information in the approved JSON format.
- synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the JSON value stored in the file.
- All the standbys mentioned in the s_s_names or the pg_syncinfo file currently get the priority as 1 and all others as  0 (async)
- Various functions in syncrep.c to read the json file and store the values in a struct to be used in checking the quorum status of syncrep standbys (SyncRepGetQuorumRecPtr function).

It does not support the current behavior for synchronous_standby_names = '*'. I am yet to thoroughly test the patch.

Thoughts?

This is a great feature, thanks for working on it!

Here is some initial feedback after a quick eyeballing of the patch and a couple of test runs.  I will have more soon after I figure out how to really test it and try out the configuration system...

It crashes when async standbys connect, as already reported by Sameer Thakur.  It doesn't crash with this change:

@@ -700,6 +700,9 @@ SyncRepGetStandbyPriority(void)
        if (am_cascading_walsender)
                return 0;
 
+       if (SyncRepStandbyInfo == NULL)
+               return 0;
+
        if (CheckNameList(SyncRepStandbyInfo, application_name, false))
                return 1;

I got the following error from clang-602.0.53 on my Mac:

walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of type 'void *' discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
                        memcpy(walsnd->name, application_name, strlen(application_name));
                               ^~~~~~~~~~~~

I think your memcpy and explicit null termination could be replaced with strcpy, or maybe something to limit buffer overrun damage in case of sizing bugs elsewhere.  But to get rid of that warning you'd still need to cast away volatile...  I note that you do that in SyncRepGetQuorumRecPtr when you read the string with strcmp.  But is that actually safe, with respect to load/store reordering around spinlock operations?  Do we actually need volatile-preserving cstring copy and compare functions for this type of thing?

In walsender_private.h:

+#define MAX_APPLICATION_NAME_LEN 8192

What is the basis for this size?  application_name is a GUC with GUC_IS_NAME set.  As far as I can see, it's limited to NAMEDATALEN (including null terminator), so why not use the exact same buffer size?

In load_syncinfo:

+                       len = strlen(standby_name);
+                       temp->name = malloc(len);
+                       memcpy(temp->name, standby_name, len);
+                       temp->name[len] = '\0';

This buffer is one byte too short, and doesn't handle malloc failure.  And generally, this code is equivalent to strdup, and could instead be pstrdup (which raises an error on allocation failure for free).  But I'm not sure which memory context is appropriate and when this should be freed.

Same problem in sync_info_scalar:

+                               state->cur_node->name = (char *) malloc(len);
+                               memcpy(state->cur_node->name, token, strlen(token));
+                               state->cur_node->name[len] = '\0';

In SyncRepGetQuorumRecPtr, some extra curly braces:

+       if (node->next)
+       {
+               SyncRepGetQuorumRecPtr(node->next, lsnlist, node->priority_group);
+       }

... and:

+       if (*lsnlist == NIL)
+       {
+               *lsnlist = lappend(*lsnlist, lsn);
+       }

In sync_info_object_field_start:

+                               ereport(ERROR,
+                                               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                                errmsg("Unrecognised key \"%s\" in file \"%s\"",
+                                                               fname, SYNC_FILENAME)));

I think this should use US spelling (-ized) as you have it elsewhere.  Also the primary error message should not be capitalised according to the "Error Message Style Guide".

--

Re: Support for N synchronous standby servers - take 2

From
Sameer Thakur-2
Date:
Hello,
Continuing testing:

For pg_syncinfo.conf below an error is thrown. 

{              "sync_info":                             {                                            "quorum": 3,
                                            "nodes":                                                           [
                                                                
 
{"priority":1,"group":"cluster1"},                                                                         
"A"                                                           ]                             },
  "groups":                             {                                            "cluster1":["B","C"]
             }
 
}


LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
TRAP: FailedAssertion("!(n < list->length)", File: "list.c", Line: 392)
LOG:  server process (PID 17764) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2015-09-15 17:15:35
IST

In the scenario here the quorum specified is 3 but there are just 2 nodes,
what should the expected behaviour be?
I feel the json parsing should throw an appropriate error with explanation
as the sync rule does not make sense. The behaviour that the master keeps
waiting for the non existent 3rd quorum node will not be helpful anyway.

regards
Sameer



--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865954.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
<p dir="ltr">Hello,<p dir="ltr">Thank you Thomas and Sameer for checking the patch and giving your comments!<p
dir="ltr">Iwill post an updated patch soon.<br /><p dir="ltr">Regards,<p dir="ltr">Beena Emerson <br /> 

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Tue, Sep 15, 2015 at 3:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I got the following error from clang-602.0.53 on my Mac:
>
> walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of
> type 'void *' discards qualifiers
> [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
>                         memcpy(walsnd->name, application_name,
> strlen(application_name));
>                                ^~~~~~~~~~~~
>
> I think your memcpy and explicit null termination could be replaced with
> strcpy, or maybe something to limit buffer overrun damage in case of sizing
> bugs elsewhere.  But to get rid of that warning you'd still need to cast
> away volatile...  I note that you do that in SyncRepGetQuorumRecPtr when you
> read the string with strcmp.  But is that actually safe, with respect to
> load/store reordering around spinlock operations?  Do we actually need
> volatile-preserving cstring copy and compare functions for this type of
> thing?

Maybe volatile isn't even needed here at all.  I have asked that
question separately here:

http://www.postgresql.org/message-id/CAEepm=2f-N5MD+xYYyO=yBpC9SoOdCdrdiKia9_oLTSiu1uBtA@mail.gmail.com

In SyncRepGetQuorumRecPtr you have strcmp(node->name, (char *)
walsnd->name): that might be more problematic.  I'm not sure about
casting away volatile (it's probably fine at least in practice), but
it's accessing walsnd without the the spinlock.  The existing
syncrep.c code already did that sort of thing (and I haven't had time
to grok the thinking behind that yet), but I think you may be upping
the ante here by doing non-atomic reads with strcmp (whereas the code
in master always read single word values).  Imagine if you hit a slot
that was being set up by InitWalSenderSlot concurrently, and memcpy
was in the process of writing the name.  strcmp would read garbage,
maybe even off the end of the buffer because there is no terminator
yet.  That may be incredibly unlikely, but it seems fishy.  Or I may
have misunderstood the synchronisation at work here completely :-)

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
"Amir Rohan"
Date:
<div style="font-family: Verdana;font-size: 12.0px;"><div><div>>On 07/16/15, Robert Haas wrote:<br /> >    <br />
>>>* Developers will immediately understand the format<br /> >><br /> >>I doubt it.  I think any
formatthat we pick will have to be carefully<br /> >>documented.  People may know what JSON looks like in
general,but they<br /> >>will not immediately know what bells and whistles are available in<br /> >>this
context.<br/> >><br /> >>> * Easy to programmatically manipulate in a range of languages<br />
>><br/> >> <...> I think it will be rare to need to parse the postgresql.conf string,<br /> >>
manipulateit programatically, and then put it back.<br /> ><br /> >On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane
<tgl(at)sss(dot)pgh(dot)pa(dot)us>wrote:<br /> >> Josh Berkus <josh(at)agliodbs(dot)com> writes:<br
/>>>> On 07/17/2015 04:36 PM, Jim Nasby wrote:<br /> >>>> I'm guessing it'd be really ugly/hard to
supportat least this GUC being<br /> >>>> multi-line?<br /> >><br /> >>> Mind you,
multi-lineGUCs would be useful otherwise, but we don't want<br /> >>> to hinge this feature on making that
work.<br/> >><br /> >> Do we really want such a global reduction in friendliness to make this<br />
>>feature easier?<br /> ><br /> >Maybe shoehorning this into the GUC mechanism is the wrong thing, and<br
/>>what we really need is a new config file for this.  The information<br /> >we're proposing to store seems
complexenough to justify that.</div><div> </div><div>It seems like:</div><div>1) There's a need to support structured
datain configuration for future</div><div>needs as well, it isn't specific to this feature.<br /> 2) There should/must
bea better way to validate configuration then<br /> to restarting the server in search of syntax
errors.</div><div> </div><div>Creatinga whole new configuration file for just one feature *and* in a different
<div>formatseems suboptimal.  What happens when the next 20 features need structured</div><div>config data, where does
thatgo? will there be additional JSON config files *and* perhaps</div><div>new mini-language values in .conf as
developmentcontinues?  How many dedicated</div><div>configuration files is too many?</div></div><div>Now, about
JSON....(Earlier Upthread):</div><div> <br /> On 07/01/15, Peter Eisentraut wrote:</div><div>> On 6/26/15 2:53 PM,
JoshBerkus wrote:<br /> > > I would also suggest that if I lose this battle and<br /> > > we decide to go
witha single stringy GUC, that we at least use JSON<br /> > > instead of defining our out, proprietary,
syntax?<br/> >  <br /> > Does JSON have a natural syntax for a set without order?</div><div> </div><div>No. Nor
Timestamps.It doesn't even distingush integer from float</div><div>(Though parsers do it for you in dynamic languages).
It'sall because</div><div>of its unsightly javascript roots.<br />  </div><div><div>The current patch is now forced by
JSONto conflate sets and lists, so</div><div>un/ordered semantics are no longer tied to type but to the specific
configurationkeys.</div><div>So, If a feature ever needs a key where the difference between set and list
matters</div><div>andneeds to support both, you'll need seperate keys (both with lists, but meaning different
things)</div><div>ora separate "mode" key or something. Not terrible, just iffy.</div><div> </div></div><div>Other have
foundJSON unsatisfactory before. For example, the clojure community</div><div>has made (at least) two attempts at
alternatives,complete with the meh adoption</div><div>rates you'd expect despite being more capable
formats:</div><div> </div><div>http://blog.cognitect.com/blog/2014/7/22/transit<br/>
https://github.com/edn-format/edn</div><div> </div><div>There'salso YAML, TOML, etc', none as universal as JSON. But to
reiterate,JSON itself</div><div>has Lackluster type support (no sets, no timestamps), is verbose, iseasy to malform
whenediting</div><div>(missed a curly brace, shouldn't use a single quote), isn't extensible, and my personal pet
peeve</div><div>isthat it doesn't allow non-string or bare-string keys in maps (a.k.a "death by double-quotes").</div>
 <div>Python has the very natural {1,2,3} syntax for sets, but of course that's not part of
JSON.</div><div> </div><div>If JSON wins out despite all this, one alternative not discussed is to extend</div><div>the
.confparser to accept json dicts as a fundamental type. e.g.:</div><div> </div><div>###</div><div>data_directory =
'ConfigDir'  <br /> port = 5432<br /> work_mem = 4MB<br /> hot_standby = off<br /> client_min_messages = notice<br />
log_error_verbosity= default<br /> autovacuum_analyze_scale_factor = 0.1<br /> synch_standby_config = {<br />  
"sync_info":{<br />     "nodes": [<br />       {<br />         "priority": 1,<br />         "group": "cluster1"<br />
     },<br />       "A"<br />     ],<br />     "quorum": 3<br />   },<br />   "groups": {<br />     "cluster1": [<br />
     "B",<br />       "C"<br />     ]<br />   }<br /> }</div><div> </div><div>This *will* break someone's perl I would
guess.Ironically, those scripts wouldn't have broken if</div><div>some structured format were in use for the
configurationdata when they were written...</div><div>`postgres --describe-config` is also pretty much tied to a
line-orientedconfiguration.</div><div> </div><div>Amir</div><div> </div><div>p.s.</div><div> </div><div>MIA
configurationvalidation tool/switch should probably get a thread too.</div><div> </div></div></div> 

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Wed, Sep 23, 2015 at 12:11 AM, Amir Rohan <amir.rohan@mail.com> wrote:
> It seems like:
> 1) There's a need to support structured data in configuration for future
> needs as well, it isn't specific to this feature.
> 2) There should/must be a better way to validate configuration then
> to restarting the server in search of syntax errors.
>
> Creating a whole new configuration file for just one feature *and* in a
> different
> format seems suboptimal.  What happens when the next 20 features need
> structured
> config data, where does that go? will there be additional JSON config files
> *and* perhaps
> new mini-language values in .conf as development continues?  How many
> dedicated
> configuration files is too many?

Well, I think that if we create our own mini-language, it may well be
possible to make the configuration for this compact enough to fit on
one line.  If we use JSON, I think there's zap chance of that.  But...
that's just what *I* think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Well, I think that if we create our own mini-language, it may well be
> possible to make the configuration for this compact enough to fit on
> one line.  If we use JSON, I think there's zap chance of that.  But...
> that's just what *I* think.

Well, that depends on what you think the typical-case complexity is
and on how long a line will fit in your editor window ;-).

I think that we can't make much progress on this argument without a pretty
concrete idea of what typical and worst-case configurations would look
like.  Would someone like to put forward examples?  Then we could try them
in any specific syntax that's suggested and see how verbose it gets.

FWIW, I tend to agree that if we think common cases can be held to,
say, a hundred or two hundred characters, that we're best off avoiding
the challenges of dealing with multi-line postgresql.conf entries.
And I'm really not much in favor of a separate file; if we go that way
then we're going to have to reinvent a huge amount of infrastructure
that already exists for GUCs.
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
"Amir Rohan"
Date:
> Sent: Thursday, September 24, 2015 at 3:11 AM
> 
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Robert Haas <robertmhaas@gmail.com> writes:
> > Well, I think that if we create our own mini-language, it may well be
> > possible to make the configuration for this compact enough to fit on
> > one line. If we use JSON, I think there's zap chance of that. But...
> > that's just what *I* think.
>> 

I've implemented a parser that reads you mini-language and dumps a JSON
equivalent. Once you start naming groups the line fills up quite quickly,
and on the other hands the JSON is verbose and fiddely.
But implementing a mechanism that can be used by other features in
the future seems the deciding factor here, rather then the brevity of a 
bespoke mini-language.

> 
> <...> we're best off avoiding the challenges of dealing with multi-line 
> postgresql.conf entries.
> 
> And I'm really not much in favor of a separate file; if we go that way
> then we're going to have to reinvent a huge amount of infrastructure
> that already exists for GUCs.
> 
> regards, tom lane

Adding support for JSON objects (or some other kind of composite data type) 
to the .conf parser would negate the need for one, and would also solve the
problem being discussed for future cases.
I don't know whether that would break some tooling you care about, 
but if there's interest, I can probably do some of that work.



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Sep 11, 2015 at 10:15 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Please find attached the WIP patch for the proposed feature. It is built
>> based on the already discussed design.
>>
>> Changes made:
>> - add new parameter "sync_file" to provide the location of the pg_syncinfo
>> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
>> pg_ident file.
>
> I am not sure that's really necessary. We could just hardcode its location.
>
>> - pg_syncinfo file will hold the sync rep information in the approved JSON
>> format.
>
> OK. Have you considered as well the approach to add support for
> multi-line GUC parameters? This has been mentioned a couple of time
> above as well, with something like that I imagine:
> param = 'value1,' \
>     'value2,' \
>     'value3'
> and this reads as 'value1,value2,value3'. This would benefit as well
> for other parameters.
>

I agree with adding support for multi-line GUC parameters.
But I though it is:
param = 'param1,
param2,
param3'

This reads as 'value1,value2,value3'.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Amir Rohan wrote:
> But implementing a mechanism that can be used by other features in
> the future seems the deciding factor here, rather then the brevity of a 
> bespoke mini-language.

One decision to be taken is which among JSON or mini-language is better for
the SR setting.
Mini language can fit into the postgresql.conf single line.

For JSON currently a different file is used. But as said earlier, in case
composite types are required in future for other parameters then having
multiple .conf files does not make sense. To avoid this we can:
-    support multi-line GUC which would be helpful for other comma-separated
conf values along with s_s_names.  (This can make mini-language more
readable as well)
-    Allow JSON support in postgresql.conf. So that other parameters in future
can use JSON as well within postgresql.conf. 

What are the chances of future data requiring JSON? I think rare.

> > And I'm really not much in favor of a separate file; if we go that way
> > then we're going to have to reinvent a huge amount of infrastructure
> > that already exists for GUCs.
>
> Adding support for JSON objects (or some other kind of composite data
> type) 
> to the .conf parser would negate the need for one, and would also solve
> the
> problem being discussed for future cases.

With the current pg_syncinfo file, the only code added was to check the
pg_syncinfo file in the specified path and read the entire content of the
file into a variable which was used for further parsing which could have
been avoided with multi-line GUC.




-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869285.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Hello,

The JSON method was used in the patch because it seemed to be the group
consensus.

Requirement:      - Grouping : Specify a list of node names with the required number of
ACK for the group. We  could have priority or quorum group. Quorum treats
all the standby in same level  and ACK from any k can be considered. In
priority behavior, ACK must be received from the specified k lowest priority
servers for a successful transaction.      - Group names to enable easier status reporting for group. The
topmost group may not be named. It will be assigned a default name. All the
sub groups are to be compulsorily named.      - Not more than 3 groups with 1 level of nesting expected

Behavior in submitted patch:      -  The name of the top most group is named ‘Default Group”. All the
other standby_names or groups will have to be listed within this.      -  When more than 1 connected standby has the
samename then the 
highest LSN among them is chosen. Example: 2 priority in (X,Y,Z). If there 2
nodes X connected, even though both X have returned ACK, the server will
wait for ACK from Y.       -  There are no “potential” standbys. In quorum behavior, there are
no fixed standbys which are to  be in sync, all members are equal.  ACK from
any specified n nodes from a set is considered success.

Further:      - improvements to pg_stat_replication to give the node tree and
status?      - Manipulate/Edit conf setting using functions.      - Regression tests

Mini-lang:
[] - to specify prioirty
() - to specify quorum
Format - <name> : <count> [<list>]
Not specifying count defaults to 1.
Ex: s_s_names = '2(cluster1: 1(A,B), cluster2: 2[X,Y,Z], U)'

JSON
It would contain 2 main keys: "sync_info" and  "groups"
The "sync_info" would consist of "quorum"/"priority" with the count and
"nodes"/"group" with the group name or node list.
The optional "groups" key would list out all the "group" mentioned within
"sync_info" along with the node list.Ex: {    "sync_info":    {         "quorum":2,         "nodes":         [
   {"quorum":1,"group":"cluster1"},              {"prioirty":2,"group": "cluster2"},       "U"         ]    },
"groups":   {         "cluster1":["A","B"],         "cluster2":["X","Y","z"]    } 
}

JSON  and mini-language:      - JSON is more verbose      - You can define a group and use it multiple times in sync
settings
but since no many levels or nesting is expected I am not sure how useful
this will be.      - Though JSON parser is inbuilt, additional code is required to check
for the required format of JSON. For mini-language, new parser will have to
be written.

Despite all, I feel the mini-language is better mainly for its brevity.
Also, it will not require additional GUC parser support (multi line).




-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869286.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:
Sawada Masahiko wrote:
>
> I agree with adding support for multi-line GUC parameters.
> But I though it is:
> param = 'param1,
> param2,
> param3'
>
> This reads as 'value1,value2,value3'.

Use of '\' ensures that omission the closing quote does not break the entire
file.





-----
Beena Emerson

--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869289.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Oct 8, 2015 at 7:31 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
>
> Mini-lang:
> [] - to specify prioirty
> () - to specify quorum
> Format - <name> : <count> [<list>]
> Not specifying count defaults to 1.
> Ex: s_s_names = '2(cluster1: 1(A,B), cluster2: 2[X,Y,Z], U)'
>
> JSON
> It would contain 2 main keys: "sync_info" and  "groups"
> The "sync_info" would consist of "quorum"/"priority" with the count and
> "nodes"/"group" with the group name or node list.
> The optional "groups" key would list out all the "group" mentioned within
> "sync_info" along with the node list.
>  Ex: {
>      "sync_info":
>      {
>           "quorum":2,
>           "nodes":
>           [
>                {"quorum":1,"group":"cluster1"},
>                {"prioirty":2,"group": "cluster2"},
>                "U"
>           ]
>      },
>      "groups":
>      {
>           "cluster1":["A","B"],
>           "cluster2":["X","Y","z"]
>      }
> }
>
> JSON  and mini-language:
>        - JSON is more verbose
>        - You can define a group and use it multiple times in sync settings
> but since no many levels or nesting is expected I am not sure how useful
> this will be.
>        - Though JSON parser is inbuilt, additional code is required to check
> for the required format of JSON. For mini-language, new parser will have to
> be written.
>

Sounds like both the approaches have some pros and cons, also there are
some people who prefer mini-language and others who prefer JSON.  I think
one thing that might help, is to check how other databases support this
feature or somewhat similar to this feature (mainly with respect to User
Interface), as that can help us in knowing what users are already familiar
with.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Sounds like both the approaches have some pros and cons, also there are
> some people who prefer mini-language and others who prefer JSON.  I think
> one thing that might help, is to check how other databases support this
> feature or somewhat similar to this feature (mainly with respect to User
> Interface), as that can help us in knowing what users are already familiar
> with.

+1!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Sat, Oct 10, 2015 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Sounds like both the approaches have some pros and cons, also there are
>> some people who prefer mini-language and others who prefer JSON.  I think
>> one thing that might help, is to check how other databases support this
>> feature or somewhat similar to this feature (mainly with respect to User
>> Interface), as that can help us in knowing what users are already familiar
>> with.
>
> +1!
>

For example, MySQL 5.7 has similar feature, but it doesn't support
quorum commit, and is simpler than postgresql attempting feature.
There is one configuration parameter in MySQL 5.7 which indicates the
number of sync replication node.
The primary server commit when the primary server receives the
specified number of ACK from standby server regardless name of standby
server.

And IIRC, Oracle database also doesn't support the quorum commit as well.
The settings standby server sync or async is specified per standby
server in configuration parameter in primary node.

I think that the use of JSON format approach and dedicated language
approach are different.
The dedicated language format approach would be useful for simple
configuration such as the one nesting, not using group.
This will allow us to configure replication more simpler and easier.
In contrast, The JSON format approach would be useful for complex configuration.

I thought that this feature for postgresql should be simple at first
implementation.
It would be good even if there are some restriction such as the
nesting level, the group setting.
The another new approach that I came up with is,
* Add new parameter synchronous_replication_method (say s_r_method)
which can have two names: 'priority', 'quorum'
* If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
is handled using priority. It's same as '[n1,n2,n3]' in dedicated
laguage.
* If s_r_method = 'quorum', the value of s_s_names is handled using
quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
* Setting of synchronous_standby_names is same as today. That is, the
storing the nesting value is not supported.
* If we want to support more complex syntax like what we are
discussing, we can add the new value to s_r_method, for example
'complex', 'json'.

Though?

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Josh Berkus
Date:
On 10/13/2015 11:02 AM, Masahiko Sawada wrote:
> I thought that this feature for postgresql should be simple at first
> implementation.
> It would be good even if there are some restriction such as the
> nesting level, the group setting.
> The another new approach that I came up with is,
> * Add new parameter synchronous_replication_method (say s_r_method)
> which can have two names: 'priority', 'quorum'
> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
> laguage.
> * If s_r_method = 'quorum', the value of s_s_names is handled using
> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.

Well, the first question is: can you implement both of these things for
9.6, realistically?  If you can implement them, then we can argue about
configuration format later.  It's even possible that the nature of your
implementation will enforce a particular syntax.

For example, if your implementation requires sync groups to be named,
then we have to include group names in the syntax.  If you can't
implement nesting in the near future, there's no reason to have a syntax
for it.

> * Setting of synchronous_standby_names is same as today. That is, the
> storing the nesting value is not supported.
> * If we want to support more complex syntax like what we are
> discussing, we can add the new value to s_r_method, for example
> 'complex', 'json'.

I think having two different syntaxes is a bad idea.  I'd rather have a
wholly proprietary configuration markup than deal with two alternate ones.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Oct 14, 2015 at 3:16 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 10/13/2015 11:02 AM, Masahiko Sawada wrote:
>> I thought that this feature for postgresql should be simple at first
>> implementation.
>> It would be good even if there are some restriction such as the
>> nesting level, the group setting.
>> The another new approach that I came up with is,
>> * Add new parameter synchronous_replication_method (say s_r_method)
>> which can have two names: 'priority', 'quorum'
>> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
>> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
>> laguage.
>> * If s_r_method = 'quorum', the value of s_s_names is handled using
>> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
>
> Well, the first question is: can you implement both of these things for
> 9.6, realistically?
>  If you can implement them, then we can argue about
> configuration format later.  It's even possible that the nature of your
> implementation will enforce a particular syntax.
>
> For example, if your implementation requires sync groups to be named,
> then we have to include group names in the syntax.  If you can't
> implement nesting in the near future, there's no reason to have a syntax
> for it.

Yes, I can implement both without nesting.
The draft patch of replication using priority is already implemented
by Michael, so I need to implement simple quorum commit logic and
merge them.

>> * Setting of synchronous_standby_names is same as today. That is, the
>> storing the nesting value is not supported.
>> * If we want to support more complex syntax like what we are
>> discussing, we can add the new value to s_r_method, for example
>> 'complex', 'json'.
>
> I think having two different syntaxes is a bad idea.  I'd rather have a
> wholly proprietary configuration markup than deal with two alternate ones.
>

I agree, we should choice either.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Oct 14, 2015 at 3:28 AM, Masahiko Sawada wrote:
> The draft patch of replication using priority is already implemented
> by Michael, so I need to implement simple quorum commit logic and
> merge them.

The last patch in date I know of is this one:
http://www.postgresql.org/message-id/CAB7nPqRFSLmHbYonra0=p-X8MJ-XTL7oxjP_QXDJGsjpvWRXPA@mail.gmail.com
It would surely need a rebase.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote:
> On Sat, Oct 10, 2015 at 4:35 AM, Robert Haas wrote:
>> On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila wrote:
>>> Sounds like both the approaches have some pros and cons, also there are
>>> some people who prefer mini-language and others who prefer JSON.  I think
>>> one thing that might help, is to check how other databases support this
>>> feature or somewhat similar to this feature (mainly with respect to User
>>> Interface), as that can help us in knowing what users are already familiar
>>> with.
>>
>> +1!

Thanks for having a look at that!

> For example, MySQL 5.7 has similar feature, but it doesn't support
> quorum commit, and is simpler than postgresql attempting feature.
> There is one configuration parameter in MySQL 5.7 which indicates the
> number of sync replication node.
> The primary server commit when the primary server receives the
> specified number of ACK from standby server regardless name of standby
> server.

Hm. This is not much helpful in the case we especially mentioned
upthread at some point with 2 data centers, first one has the master
and a sync standby, and second one has a set of standbys. We need to
be sure that the standby in DC1 acknowledges all the time, and we
would only need to wait for one or more of them in DC2. I still
believe that this is the main use case for this feature to ensure a
proper failover without data loss if one data center blows away with a
meteorite.

> And IIRC, Oracle database also doesn't support the quorum commit as well.
> The settings standby server sync or async is specified per standby
> server in configuration parameter in primary node.

And I guess that they manage standby nodes using a system catalog
then, being able to change the state of a node from async to sync with
something at SQL level? Is that right?

> I thought that this feature for postgresql should be simple at first
> implementation.

And extensible.

> It would be good even if there are some restriction such as the
> nesting level, the group setting.
> The another new approach that I came up with is,
> * Add new parameter synchronous_replication_method (say s_r_method)
> which can have two names: 'priority', 'quorum'
> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
> language.
> * If s_r_method = 'quorum', the value of s_s_names is handled using
> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
> * Setting of synchronous_standby_names is same as today. That is, the
> storing the nesting value is not supported.
> * If we want to support more complex syntax like what we are
> discussing, we can add the new value to s_r_method, for example
> 'complex', 'json'.

If we go that path, I think that we still would need an extra
parameter to control the number of nodes that need to be taken from
the set defined in s_s_names whichever of quorum or priority is used.
Let's not forget that in the current configuration the first node
listed in s_s_names and *connected* to the master will be used to
acknowledge the commit.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:


On Wed, Oct 14, 2015 at 10:38 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote:

> It would be good even if there are some restriction such as the
> nesting level, the group setting.
> The another new approach that I came up with is,
> * Add new parameter synchronous_replication_method (say s_r_method)
> which can have two names: 'priority', 'quorum'
> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
> language.
> * If s_r_method = 'quorum', the value of s_s_names is handled using
> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
> * Setting of synchronous_standby_names is same as today. That is, the
> storing the nesting value is not supported.
> * If we want to support more complex syntax like what we are
> discussing, we can add the new value to s_r_method, for example
> 'complex', 'json'.

If we go that path, I think that we still would need an extra
parameter to control the number of nodes that need to be taken from
the set defined in s_s_names whichever of quorum or priority is used.
Let's not forget that in the current configuration the first node
listed in s_s_names and *connected* to the master will be used to
acknowledge the commit.

Would it be better to just use a simple language instead of 3 different parameters? 

s_s_names = 2[X,Y,Z]  # 2 priority
s_s_names = 1(A,B,C) # 1 quorum
s_s_names = R,S,T # default behavior: 1 priorty?


Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
Reply to multiple member.

> Hm. This is not much helpful in the case we especially mentioned
> upthread at some point with 2 data centers, first one has the master
> and a sync standby, and second one has a set of standbys. We need to
> be sure that the standby in DC1 acknowledges all the time, and we
> would only need to wait for one or more of them in DC2. I still
> believe that this is the main use case for this feature to ensure a
> proper failover without data loss if one data center blows away with a
> meteorite.

Yes, I think so too.
In such case, the idea I posted yesterday could handle by setting the
followings;
* s_r_method = 'quorum'
* s_s_names = 'tokyo, seattle'
* s_s_nums = 2
* application_name of the first standby, which is in DC1, is 'tokyo',
and application_name of other standbys, which are in DC2, is
'seattle'.

> And I guess that they manage standby nodes using a system catalog
> then, being able to change the state of a node from async to sync with
> something at SQL level? Is that right?

I think that's right.

>
> If we go that path, I think that we still would need an extra
> parameter to control the number of nodes that need to be taken from
> the set defined in s_s_names whichever of quorum or priority is used.
> Let's not forget that in the current configuration the first node
> listed in s_s_names and *connected* to the master will be used to
> acknowledge the commit.

Yeah, such parameter is needed. I've forgotten to consider that.

>
>
> Would it be better to just use a simple language instead of 3 different
> parameters?
>
> s_s_names = 2[X,Y,Z]  # 2 priority
> s_s_names = 1(A,B,C) # 1 quorum
> s_s_names = R,S,T # default behavior: 1 priorty?

I think that this means that we have choose dedicated language
approach instead of JSON format approach.
If we want to set multi sync replication more complexly, we would not
have no choice other than  improvement of dedicated language.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Oct 14, 2015 at 5:58 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
>
> On Wed, Oct 14, 2015 at 10:38 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote:
>>
>> > It would be good even if there are some restriction such as the
>> > nesting level, the group setting.
>> > The another new approach that I came up with is,
>> > * Add new parameter synchronous_replication_method (say s_r_method)
>> > which can have two names: 'priority', 'quorum'
>> > * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
>> > is handled using priority. It's same as '[n1,n2,n3]' in dedicated
>> > language.
>> > * If s_r_method = 'quorum', the value of s_s_names is handled using
>> > quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
>> > * Setting of synchronous_standby_names is same as today. That is, the
>> > storing the nesting value is not supported.
>> > * If we want to support more complex syntax like what we are
>> > discussing, we can add the new value to s_r_method, for example
>> > 'complex', 'json'.
>>
>> If we go that path, I think that we still would need an extra
>> parameter to control the number of nodes that need to be taken from
>> the set defined in s_s_names whichever of quorum or priority is used.
>> Let's not forget that in the current configuration the first node
>> listed in s_s_names and *connected* to the master will be used to
>> acknowledge the commit.
>
>
> Would it be better to just use a simple language instead of 3 different
> parameters?
>
> s_s_names = 2[X,Y,Z]  # 2 priority
> s_s_names = 1(A,B,C) # 1 quorum
> s_s_names = R,S,T # default behavior: 1 priorty?

Yeah, the main use case for this feature would just be that for most users:
s_s_names = 2[dc1_standby,1(dc2_standby1, dc2_standby2)]
Meaning that we wait for dc1_standby, which is a standby on data
center 1, and one of the dc2_standby* set which are standbys in data
center 2.
So the following minimal characteristics would be needed:
- support for priority selectivity for N nodes
- support for quorum selectivity for N nodes
- support for nested set of nodes, at least 2 level deep.
The requirement to define a group of nodes also would not be needed.
If we have that, I would say that we already do better than OrXXXe and
MyXXL, to cite two of them. And if we can get that for 9.6 or even
9.7, that would be really great.
Regards,
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Oct 14, 2015 at 3:16 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 10/13/2015 11:02 AM, Masahiko Sawada wrote:
>> I thought that this feature for postgresql should be simple at first
>> implementation.
>> It would be good even if there are some restriction such as the
>> nesting level, the group setting.
>> The another new approach that I came up with is,
>> * Add new parameter synchronous_replication_method (say s_r_method)
>> which can have two names: 'priority', 'quorum'
>> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
>> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
>> laguage.
>> * If s_r_method = 'quorum', the value of s_s_names is handled using
>> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
>
> Well, the first question is: can you implement both of these things for
> 9.6, realistically?  If you can implement them, then we can argue about
> configuration format later.  It's even possible that the nature of your
> implementation will enforce a particular syntax.
>

Hi,

Attached patch is a rough patch which supports multi sync replication
by another approach I sent before.

The new GUC parameters are:
* synchronous_standby_num, which specifies the number of standby
servers using sync rep. (default is 0)
* synchronous_replication_method, which specifies replication method;
priority or quorum. (default is priority)

The behaviour of 'priority' and 'quorum' are same as what we've been discussing.
But I write overview of these here again here.

[Priority Method]
The standby server has each different priority, and the active standby
servers having the top N priroity are become sync standby server.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.
If you want to set up standby like 9.5 or before, you can set
synchronous_standby_num = 1.

[Quorum Method]
The standby servers have same priority 1, and the all the active
standby servers will be sync standby server.
The master server have to wait for ACK from N sync standby servers at
least before COMMIT.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.

[Use case]
This patch can handle the main use case where Michael said;
There are 2 data centers, first one has the master and a sync standby,
and second one has a set of standbys.
We need to be sure that the standby in DC1 acknowledges all the time,
and we would only need to wait for one or more of them in DC2.

In order to handle this use case, you set these standbys and GUC
parameter as follows.
* synchronous_standby_names = 'DC1, DC2'
* synchronous_standby_num = 2
* synchronous_replication_method = quorum
* The name of standby server in DC1 is 'DC1', and the names of two
standby servers in DC2 are 'DC2'.

[Extensible]
By setting same application_name to different standbys, we can set up
sync replication with grouping standbys.
If we want to set up replication more complexly and flexibility, we
could add new syntax for s_s_names (e.g., JSON format or dedicated
language), and increase kind of values of
synhcronous_replication_method, e.g. s_r_method = 'complex',

And this patch doesn't need new parser for GUC parameter.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Beena Emerson
Date:

On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

Attached patch is a rough patch which supports multi sync replication
by another approach I sent before.

The new GUC parameters are:
* synchronous_standby_num, which specifies the number of standby
servers using sync rep. (default is 0)
* synchronous_replication_method, which specifies replication method;
priority or quorum. (default is priority)

The behaviour of 'priority' and 'quorum' are same as what we've been discussing.
But I write overview of these here again here.

[Priority Method]
The standby server has each different priority, and the active standby
servers having the top N priroity are become sync standby server.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.
If you want to set up standby like 9.5 or before, you can set
synchronous_standby_num = 1.


 
I used the following setting with 2 servers A and D connected:

synchronous_standby_names = 'A,B,C,D'
synchronous_standby_num = 2
synchronous_replication_method = 'priority'

Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused segmentation fault.

Regards,

Beena Emerson

Have a Great Day!

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>>
>> Hi,
>>
>> Attached patch is a rough patch which supports multi sync replication
>> by another approach I sent before.
>>
>> The new GUC parameters are:
>> * synchronous_standby_num, which specifies the number of standby
>> servers using sync rep. (default is 0)
>> * synchronous_replication_method, which specifies replication method;
>> priority or quorum. (default is priority)
>>
>> The behaviour of 'priority' and 'quorum' are same as what we've been
>> discussing.
>> But I write overview of these here again here.
>>
>> [Priority Method]
>> The standby server has each different priority, and the active standby
>> servers having the top N priroity are become sync standby server.
>> If synchronous_standby_names = '*', the all active standby server
>> would be sync standby server.
>> If you want to set up standby like 9.5 or before, you can set
>> synchronous_standby_num = 1.
>>
>
>
> I used the following setting with 2 servers A and D connected:
>
> synchronous_standby_names = 'A,B,C,D'
> synchronous_standby_num = 2
> synchronous_replication_method = 'priority'
>
> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused
> segmentation fault.
>

Thank you for taking a look!
This patch is a tool for discussion, so I'm not going to fix this bug
until getting consensus.

We are still under the discussion to find solution that can get consensus.
I felt that it's difficult to select from the two approaches within
this development cycle, and there would not be time to implement such
big feature even if we selected.
But this feature is obviously needed by many users.
So I'm considering more simple and extensible something solution, the
idea I posted is one of them.
The another worth considering approach is that just specifying the
number of sync standby. It also can cover the main use cases in
some-cases.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>
>> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> Attached patch is a rough patch which supports multi sync replication
>>> by another approach I sent before.
>>>
>>> The new GUC parameters are:
>>> * synchronous_standby_num, which specifies the number of standby
>>> servers using sync rep. (default is 0)
>>> * synchronous_replication_method, which specifies replication method;
>>> priority or quorum. (default is priority)
>>>
>>> The behaviour of 'priority' and 'quorum' are same as what we've been
>>> discussing.
>>> But I write overview of these here again here.
>>>
>>> [Priority Method]
>>> The standby server has each different priority, and the active standby
>>> servers having the top N priroity are become sync standby server.
>>> If synchronous_standby_names = '*', the all active standby server
>>> would be sync standby server.
>>> If you want to set up standby like 9.5 or before, you can set
>>> synchronous_standby_num = 1.
>>>
>>
>>
>> I used the following setting with 2 servers A and D connected:
>>
>> synchronous_standby_names = 'A,B,C,D'
>> synchronous_standby_num = 2
>> synchronous_replication_method = 'priority'
>>
>> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused
>> segmentation fault.
>>
>
> Thank you for taking a look!
> This patch is a tool for discussion, so I'm not going to fix this bug
> until getting consensus.
>
> We are still under the discussion to find solution that can get consensus.
> I felt that it's difficult to select from the two approaches within
> this development cycle, and there would not be time to implement such
> big feature even if we selected.
> But this feature is obviously needed by many users.
> So I'm considering more simple and extensible something solution, the
> idea I posted is one of them.
> The another worth considering approach is that just specifying the
> number of sync standby. It also can cover the main use cases in
> some-cases.

Yes, it covers main and simple use case like "I want to have multiple
synchronous replicas!". Even if we miss quorum commit at the first
version, the feature is still very useful.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>>
>>> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>>> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Attached patch is a rough patch which supports multi sync replication
>>>> by another approach I sent before.
>>>>
>>>> The new GUC parameters are:
>>>> * synchronous_standby_num, which specifies the number of standby
>>>> servers using sync rep. (default is 0)
>>>> * synchronous_replication_method, which specifies replication method;
>>>> priority or quorum. (default is priority)
>>>>
>>>> The behaviour of 'priority' and 'quorum' are same as what we've been
>>>> discussing.
>>>> But I write overview of these here again here.
>>>>
>>>> [Priority Method]
>>>> The standby server has each different priority, and the active standby
>>>> servers having the top N priroity are become sync standby server.
>>>> If synchronous_standby_names = '*', the all active standby server
>>>> would be sync standby server.
>>>> If you want to set up standby like 9.5 or before, you can set
>>>> synchronous_standby_num = 1.
>>>>
>>>
>>>
>>> I used the following setting with 2 servers A and D connected:
>>>
>>> synchronous_standby_names = 'A,B,C,D'
>>> synchronous_standby_num = 2
>>> synchronous_replication_method = 'priority'
>>>
>>> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused
>>> segmentation fault.
>>>
>>
>> Thank you for taking a look!
>> This patch is a tool for discussion, so I'm not going to fix this bug
>> until getting consensus.
>>
>> We are still under the discussion to find solution that can get consensus.
>> I felt that it's difficult to select from the two approaches within
>> this development cycle, and there would not be time to implement such
>> big feature even if we selected.
>> But this feature is obviously needed by many users.
>> So I'm considering more simple and extensible something solution, the
>> idea I posted is one of them.
>> The another worth considering approach is that just specifying the
>> number of sync standby. It also can cover the main use cases in
>> some-cases.
>
> Yes, it covers main and simple use case like "I want to have multiple
> synchronous replicas!". Even if we miss quorum commit at the first
> version, the feature is still very useful.

It can cover not only the case you mentioned but also main use case
Michael mentioned by setting same application_name.
And that first version patch is almost implemented, so just needs to
be reviewed.

I think that it would be good to implement the simple feature at the
first version, and then coordinate the design based on opinion and
feed backs from more user, use-case.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 13 Nov 2015 09:07:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC9Vi8wOGtXio3Z1NwoVfXBJPNFtt7+5jadVHKn17uHOg@mail.gmail.com>
> On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
...
> >> This patch is a tool for discussion, so I'm not going to fix this bug
> >> until getting consensus.
> >>
> >> We are still under the discussion to find solution that can get consensus.
> >> I felt that it's difficult to select from the two approaches within
> >> this development cycle, and there would not be time to implement such
> >> big feature even if we selected.
> >> But this feature is obviously needed by many users.
> >> So I'm considering more simple and extensible something solution, the
> >> idea I posted is one of them.
> >> The another worth considering approach is that just specifying the
> >> number of sync standby. It also can cover the main use cases in
> >> some-cases.
> >
> > Yes, it covers main and simple use case like "I want to have multiple
> > synchronous replicas!". Even if we miss quorum commit at the first
> > version, the feature is still very useful.

+1

> It can cover not only the case you mentioned but also main use case
> Michael mentioned by setting same application_name.
> And that first version patch is almost implemented, so just needs to
> be reviewed.
> 
> I think that it would be good to implement the simple feature at the
> first version, and then coordinate the design based on opinion and
> feed backs from more user, use-case.

Yeah. I agree with it. And I have two proposals in this
direction.

- Notation
synchronous_standby_names, and synchronous_replication_method asa variable to provide other syntax is probably no
argumentexceptits name. But I feel synchronous_standby_num looks bittoo specific.
 
I'd like to propose if this doesn't reprise the argument onnotation for replication definitions:p
The following two GUCs would be enough to bear future expansionof notation syntax and/or method.
synchronous_standby_names :  as it is
synchronous_replication_method:
  default is "1-priority", which means the same with the current  meaning.  possible additional values so far would
be,
   "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...",                 where n is the number of
requiredacknowledges.
 
   "n-quorum":   the format of s_s_names is the same as above, but                 it is read in quorum context.
These can be expanded, for example, as follows, but in future.
   "complex" : Michael's format.   "json"    : JSON?   "json-ext": specify JSON in external file.

Even after we have complex notations, I suppose that many use
cases are coverd by the first tree notations.


- Internal design
What should be done in SyncRepReleaseWaiters() is calculating apair of LSNs that can be regarded as synced and decide
whether*this*walsender have advanced the LSN pair, then trying torelease backends that wait for the LSNs *if* this
walsenderhasadvanced them.
 
From such point, the proposed patch will make redundant trialsto release backens.
Addition to that, the patch looks to be a mixture of the currentimplement and the new feature. These are for the same
objectivesothey cannot coexist each other, I think. As the result, codesfor both quorum/priority judgement appear at
multiplelevel incall tree. This would be an obstacle for future (possible)expansion.
 
So, I think this feature should be implemented as following,
SyncRepInitConfig reads the configuration and stores the resultstructure into elsewhere such like
WalSnd->syncrepset_definitioninsteadof WalSnd->sync_standby_priority, which should beremoved. Nothing would be stored
ifthe current wal sender isnot a member of the defined replication set. Storing a pointerto matching function there
wouldincrease the flexibility butsuch implement in contrast will make the code difficult to beread.. (I often look for
theentity of xlogreader->read_page();)
 
Then SyncRepSyncedLsnAdvancedTo() instead ofSyncRepGetSynchronousStandbys() returns an LSN pair that can beregarded as
'synced'according to specified definition ofreplication set and whether this walsender have advanced theLSNs.
 
Finally, SyncRepReleaseWaiters() uses it to release backends ifneeded.
The differences among quorum/priority or others are confined inSyncRepSyncedLsnAdvancedTo(). As the
result,SyncRepReleaseWaiterswould look as following.
 
| SyncRepReleaseWaiters(void)| {|   if (MyWalSnd->syncrepset_definition == NULL || ...)|      return;|   ...|   if
(!SyncRepSyncedLsnAdvancedTo(&flush_pos,&write_pos))|   {|     /* I haven't advanced the synced LSNs */|
LWLockRelease(SyncRepLock);|    rerturn;|   }|   /* Set the lsn first so that when we wake backends they will
relase...
I'm not thought concretely about what SyncRepSyncedLsnAdvancedTodoes but perhaps yes we can:p in effective manner..
What do you think about this?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Nov 13, 2015 at 12:52 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Fri, 13 Nov 2015 09:07:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC9Vi8wOGtXio3Z1NwoVfXBJPNFtt7+5jadVHKn17uHOg@mail.gmail.com>
>> On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> ...
>> >> This patch is a tool for discussion, so I'm not going to fix this bug
>> >> until getting consensus.
>> >>
>> >> We are still under the discussion to find solution that can get consensus.
>> >> I felt that it's difficult to select from the two approaches within
>> >> this development cycle, and there would not be time to implement such
>> >> big feature even if we selected.
>> >> But this feature is obviously needed by many users.
>> >> So I'm considering more simple and extensible something solution, the
>> >> idea I posted is one of them.
>> >> The another worth considering approach is that just specifying the
>> >> number of sync standby. It also can cover the main use cases in
>> >> some-cases.
>> >
>> > Yes, it covers main and simple use case like "I want to have multiple
>> > synchronous replicas!". Even if we miss quorum commit at the first
>> > version, the feature is still very useful.
>
> +1
>
>> It can cover not only the case you mentioned but also main use case
>> Michael mentioned by setting same application_name.
>> And that first version patch is almost implemented, so just needs to
>> be reviewed.
>>
>> I think that it would be good to implement the simple feature at the
>> first version, and then coordinate the design based on opinion and
>> feed backs from more user, use-case.
>
> Yeah. I agree with it. And I have two proposals in this
> direction.
>
> - Notation
>
>  synchronous_standby_names, and synchronous_replication_method as
>  a variable to provide other syntax is probably no argument
>  except its name. But I feel synchronous_standby_num looks bit
>  too specific.
>
>  I'd like to propose if this doesn't reprise the argument on
>  notation for replication definitions:p
>
>  The following two GUCs would be enough to bear future expansion
>  of notation syntax and/or method.
>
>  synchronous_standby_names :  as it is
>
>  synchronous_replication_method:
>
>    default is "1-priority", which means the same with the current
>    meaning.  possible additional values so far would be,
>
>     "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...",
>                   where n is the number of required acknowledges.

One question is that what is different between the leading "n" in
s_s_names and the leading "n" of "n-priority"?

>
>     "n-quorum":   the format of s_s_names is the same as above, but
>                   it is read in quorum context.
>
>  These can be expanded, for example, as follows, but in future.
>
>     "complex" : Michael's format.
>     "json"    : JSON?
>     "json-ext": specify JSON in external file.
>
> Even after we have complex notations, I suppose that many use
> cases are coverd by the first tree notations.

I'm not sure it's desirable to implement the all kind of methods into core.
I think it's better to extend replication  in order to be more
extensibility like adding hook function.
And then other approach is implemented as a contrib module.

>
> - Internal design
>
>  What should be done in SyncRepReleaseWaiters() is calculating a
>  pair of LSNs that can be regarded as synced and decide whether
>  *this* walsender have advanced the LSN pair, then trying to
>  release backends that wait for the LSNs *if* this walsender has
>  advanced them.
>
>  From such point, the proposed patch will make redundant trials
>  to release backens.
>
>  Addition to that, the patch looks to be a mixture of the current
>  implement and the new feature. These are for the same objective
>  so they cannot coexist each other, I think. As the result, codes
>  for both quorum/priority judgement appear at multiple level in
>  call tree. This would be an obstacle for future (possible)
>  expansion.
>
>  So, I think this feature should be implemented as following,
>
>  SyncRepInitConfig reads the configuration and stores the result
>  structure into elsewhere such like WalSnd->syncrepset_definition
>  instead of WalSnd->sync_standby_priority, which should be
>  removed. Nothing would be stored if the current wal sender is
>  not a member of the defined replication set. Storing a pointer
>  to matching function there would increase the flexibility but
>  such implement in contrast will make the code difficult to be
>  read.. (I often look for the entity of xlogreader->read_page()
>  ;)
>
>  Then SyncRepSyncedLsnAdvancedTo() instead of
>  SyncRepGetSynchronousStandbys() returns an LSN pair that can be
>  regarded as 'synced' according to specified definition of
>  replication set and whether this walsender have advanced the
>  LSNs.
>
>  Finally, SyncRepReleaseWaiters() uses it to release backends if
>  needed.
>
>  The differences among quorum/priority or others are confined in
>  SyncRepSyncedLsnAdvancedTo(). As the result,
>  SyncRepReleaseWaiters would look as following.
>
>  | SyncRepReleaseWaiters(void)
>  | {
>  |   if (MyWalSnd->syncrepset_definition == NULL || ...)
>  |      return;
>  |   ...
>  |   if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos))
>  |   {
>  |     /* I haven't advanced the synced LSNs */
>  |     LWLockRelease(SyncRepLock);
>  |     rerturn;
>  |   }
>  |   /* Set the lsn first so that when we wake backends they will relase...
>
>  I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo
>  does but perhaps yes we can:p in effective manner..
>
>  What do you think about this?

I agree with this design.
What SyncRepSyncedLsnAdvancedTo() does would be different for each
method, so we can implement "n-priority" style multiple sync
replication at first version.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDhqGB=EtBfqnkHxR8T53d+8qMs4DPm5HVyq4bA2oR5eQ@mail.gmail.com>
> > - Notation
> >
> >  synchronous_standby_names, and synchronous_replication_method as
> >  a variable to provide other syntax is probably no argument
> >  except its name. But I feel synchronous_standby_num looks bit
> >  too specific.
> >
> >  I'd like to propose if this doesn't reprise the argument on
> >  notation for replication definitions:p
> >
> >  The following two GUCs would be enough to bear future expansion
> >  of notation syntax and/or method.
> >
> >  synchronous_standby_names :  as it is
> >
> >  synchronous_replication_method:
> >
> >    default is "1-priority", which means the same with the current
> >    meaning.  possible additional values so far would be,
> >
> >     "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...",
> >                   where n is the number of required acknowledges.
> 
> One question is that what is different between the leading "n" in
> s_s_names and the leading "n" of "n-priority"?

Ah. Sorry for the ambiguous description. 'n' in s_s_names
representing an arbitrary integer number and that in "n-priority"
is literally an "n", meaning "a format with any number of
priority hosts" as a whole. As an instance, 

synchronous_replication_method = "n-priority"
synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"

I added "n-" of "n-priority" to distinguish with "1-priority" so
if we won't provide "1-priority" for backward compatibility,
"priority" would be enough to represent the type.

By the way, s_r_method is not essentially necessary but it would
be important to avoid complexity of autodetection of formats
including currently undefined ones.


> >     "n-quorum":   the format of s_s_names is the same as above, but
> >                   it is read in quorum context.

The "n" of this is the same as above.

> >  These can be expanded, for example, as follows, but in future.
> >
> >     "complex" : Michael's format.
> >     "json"    : JSON?
> >     "json-ext": specify JSON in external file.
> >
> > Even after we have complex notations, I suppose that many use
> > cases are coverd by the first tree notations.
> 
> I'm not sure it's desirable to implement the all kind of methods into core.
> I think it's better to extend replication  in order to be more
> extensibility like adding hook function.
> And then other approach is implemented as a contrib module.

I agree with you. I proposed the following internal design having
that in mind.

> > - Internal design
> >
> >  What should be done in SyncRepReleaseWaiters() is calculating a
> >  pair of LSNs that can be regarded as synced and decide whether
> >  *this* walsender have advanced the LSN pair, then trying to
> >  release backends that wait for the LSNs *if* this walsender has
> >  advanced them.
> >
> >  From such point, the proposed patch will make redundant trials
> >  to release backens.
> >
> >  Addition to that, the patch looks to be a mixture of the current
> >  implement and the new feature. These are for the same objective
> >  so they cannot coexist each other, I think. As the result, codes
> >  for both quorum/priority judgement appear at multiple level in
> >  call tree. This would be an obstacle for future (possible)
> >  expansion.
> >
> >  So, I think this feature should be implemented as following,
> >
> >  SyncRepInitConfig reads the configuration and stores the result
> >  structure into elsewhere such like WalSnd->syncrepset_definition
> >  instead of WalSnd->sync_standby_priority, which should be
> >  removed. Nothing would be stored if the current wal sender is
> >  not a member of the defined replication set. Storing a pointer
> >  to matching function there would increase the flexibility but
> >  such implement in contrast will make the code difficult to be
> >  read.. (I often look for the entity of xlogreader->read_page()
> >  ;)
> >
> >  Then SyncRepSyncedLsnAdvancedTo() instead of
> >  SyncRepGetSynchronousStandbys() returns an LSN pair that can be
> >  regarded as 'synced' according to specified definition of
> >  replication set and whether this walsender have advanced the
> >  LSNs.
> >
> >  Finally, SyncRepReleaseWaiters() uses it to release backends if
> >  needed.
> >
> >  The differences among quorum/priority or others are confined in
> >  SyncRepSyncedLsnAdvancedTo(). As the result,
> >  SyncRepReleaseWaiters would look as following.
> >
> >  | SyncRepReleaseWaiters(void)
> >  | {
> >  |   if (MyWalSnd->syncrepset_definition == NULL || ...)
> >  |      return;
> >  |   ...
> >  |   if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos))
> >  |   {
> >  |     /* I haven't advanced the synced LSNs */
> >  |     LWLockRelease(SyncRepLock);
> >  |     rerturn;
> >  |   }
> >  |   /* Set the lsn first so that when we wake backends they will relase...
> >
> >  I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo
> >  does but perhaps yes we can:p in effective manner..
> >
> >  What do you think about this?
> 
> I agree with this design.
> What SyncRepSyncedLsnAdvancedTo() does would be different for each
> method, so we can implement "n-priority" style multiple sync
> replication at first version.

Maybe the first *additional* one if we decide to keep backward
compatibility, as the discussion above.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Nov 17, 2015 at 9:57 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDhqGB=EtBfqnkHxR8T53d+8qMs4DPm5HVyq4bA2oR5eQ@mail.gmail.com>
>> > - Notation
>> >
>> >  synchronous_standby_names, and synchronous_replication_method as
>> >  a variable to provide other syntax is probably no argument
>> >  except its name. But I feel synchronous_standby_num looks bit
>> >  too specific.
>> >
>> >  I'd like to propose if this doesn't reprise the argument on
>> >  notation for replication definitions:p
>> >
>> >  The following two GUCs would be enough to bear future expansion
>> >  of notation syntax and/or method.
>> >
>> >  synchronous_standby_names :  as it is
>> >
>> >  synchronous_replication_method:
>> >
>> >    default is "1-priority", which means the same with the current
>> >    meaning.  possible additional values so far would be,
>> >
>> >     "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...",
>> >                   where n is the number of required acknowledges.
>>
>> One question is that what is different between the leading "n" in
>> s_s_names and the leading "n" of "n-priority"?
>
> Ah. Sorry for the ambiguous description. 'n' in s_s_names
> representing an arbitrary integer number and that in "n-priority"
> is literally an "n", meaning "a format with any number of
> priority hosts" as a whole. As an instance,
>
> synchronous_replication_method = "n-priority"
> synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
>
> I added "n-" of "n-priority" to distinguish with "1-priority" so
> if we won't provide "1-priority" for backward compatibility,
> "priority" would be enough to represent the type.
>
> By the way, s_r_method is not essentially necessary but it would
> be important to avoid complexity of autodetection of formats
> including currently undefined ones.

Than you for your explanation, I understood that.

It means that the format of s_s_names will be changed, which would be not good.
So, how about the adding just s_r_method parameter and the number of
required ACK is represented in the leading of s_r_method?
For example, the following setting is same as above.

synchronous_replication_method = "2-priority"
synchronous_standby_names = "mercury, venus, earth, mars, jupiter"

In quorum method, we can set;
synchronous_replication_method = "2-quorum"
synchronous_standby_names = "mercury, venus, earth, mars, jupiter"

Thought?

>
>
>> >     "n-quorum":   the format of s_s_names is the same as above, but
>> >                   it is read in quorum context.
>
> The "n" of this is the same as above.
>
>> >  These can be expanded, for example, as follows, but in future.
>> >
>> >     "complex" : Michael's format.
>> >     "json"    : JSON?
>> >     "json-ext": specify JSON in external file.
>> >
>> > Even after we have complex notations, I suppose that many use
>> > cases are coverd by the first tree notations.
>>
>> I'm not sure it's desirable to implement the all kind of methods into core.
>> I think it's better to extend replication  in order to be more
>> extensibility like adding hook function.
>> And then other approach is implemented as a contrib module.
>
> I agree with you. I proposed the following internal design having
> that in mind.
>
>> > - Internal design
>> >
>> >  What should be done in SyncRepReleaseWaiters() is calculating a
>> >  pair of LSNs that can be regarded as synced and decide whether
>> >  *this* walsender have advanced the LSN pair, then trying to
>> >  release backends that wait for the LSNs *if* this walsender has
>> >  advanced them.
>> >
>> >  From such point, the proposed patch will make redundant trials
>> >  to release backens.
>> >
>> >  Addition to that, the patch looks to be a mixture of the current
>> >  implement and the new feature. These are for the same objective
>> >  so they cannot coexist each other, I think. As the result, codes
>> >  for both quorum/priority judgement appear at multiple level in
>> >  call tree. This would be an obstacle for future (possible)
>> >  expansion.
>> >
>> >  So, I think this feature should be implemented as following,
>> >
>> >  SyncRepInitConfig reads the configuration and stores the result
>> >  structure into elsewhere such like WalSnd->syncrepset_definition
>> >  instead of WalSnd->sync_standby_priority, which should be
>> >  removed. Nothing would be stored if the current wal sender is
>> >  not a member of the defined replication set. Storing a pointer
>> >  to matching function there would increase the flexibility but
>> >  such implement in contrast will make the code difficult to be
>> >  read.. (I often look for the entity of xlogreader->read_page()
>> >  ;)
>> >
>> >  Then SyncRepSyncedLsnAdvancedTo() instead of
>> >  SyncRepGetSynchronousStandbys() returns an LSN pair that can be
>> >  regarded as 'synced' according to specified definition of
>> >  replication set and whether this walsender have advanced the
>> >  LSNs.
>> >
>> >  Finally, SyncRepReleaseWaiters() uses it to release backends if
>> >  needed.
>> >
>> >  The differences among quorum/priority or others are confined in
>> >  SyncRepSyncedLsnAdvancedTo(). As the result,
>> >  SyncRepReleaseWaiters would look as following.
>> >
>> >  | SyncRepReleaseWaiters(void)
>> >  | {
>> >  |   if (MyWalSnd->syncrepset_definition == NULL || ...)
>> >  |      return;
>> >  |   ...
>> >  |   if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos))
>> >  |   {
>> >  |     /* I haven't advanced the synced LSNs */
>> >  |     LWLockRelease(SyncRepLock);
>> >  |     rerturn;
>> >  |   }
>> >  |   /* Set the lsn first so that when we wake backends they will relase...
>> >
>> >  I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo
>> >  does but perhaps yes we can:p in effective manner..
>> >
>> >  What do you think about this?
>>
>> I agree with this design.
>> What SyncRepSyncedLsnAdvancedTo() does would be different for each
>> method, so we can implement "n-priority" style multiple sync
>> replication at first version.
>
> Maybe the first *additional* one if we decide to keep backward
> compatibility, as the discussion above.
>

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com>
> >> One question is that what is different between the leading "n" in
> >> s_s_names and the leading "n" of "n-priority"?
> >
> > Ah. Sorry for the ambiguous description. 'n' in s_s_names
> > representing an arbitrary integer number and that in "n-priority"
> > is literally an "n", meaning "a format with any number of
> > priority hosts" as a whole. As an instance,
> >
> > synchronous_replication_method = "n-priority"
> > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
> >
> > I added "n-" of "n-priority" to distinguish with "1-priority" so
> > if we won't provide "1-priority" for backward compatibility,
> > "priority" would be enough to represent the type.
> >
> > By the way, s_r_method is not essentially necessary but it would
> > be important to avoid complexity of autodetection of formats
> > including currently undefined ones.
> 
> Than you for your explanation, I understood that.
> 
> It means that the format of s_s_names will be changed, which would be not good.

I believe that the format of definition of "replication set"(?)
is not fixed and it would be more complex format to support
nested definition. This should be in very different format from
the current simple list of names. This is a selection among three
or possiblly more disigns in order to be tolerable for future
changes, I suppose.

1. Additional formats of definition in future will be stored in  elsewhere of s_s_names.

2. Additional format will be stored in s_s_names, the format will  be automatically detected.

3. (ditto), the format is designated by s_r_method.

4. Any other way?

I choosed the third way. What do you think about future expansion
of the format?

> So, how about the adding just s_r_method parameter and the number of
> required ACK is represented in the leading of s_r_method?
> For example, the following setting is same as above.
> 
> synchronous_replication_method = "2-priority"
> synchronous_standby_names = "mercury, venus, earth, mars, jupiter"

I *feel* it is the same or worse as having the third parameter
s_s_num as your previous design.

> In quorum method, we can set;
> synchronous_replication_method = "2-quorum"
> synchronous_standby_names = "mercury, venus, earth, mars, jupiter"
> 
> Thought?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Oops. 

At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com>
> > >> One question is that what is different between the leading "n" in
> > >> s_s_names and the leading "n" of "n-priority"?
> > >
> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names
> > > representing an arbitrary integer number and that in "n-priority"
> > > is literally an "n", meaning "a format with any number of
> > > priority hosts" as a whole. As an instance,
> > >
> > > synchronous_replication_method = "n-priority"
> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
> > >
> > > I added "n-" of "n-priority" to distinguish with "1-priority" so
> > > if we won't provide "1-priority" for backward compatibility,
> > > "priority" would be enough to represent the type.
> > >
> > > By the way, s_r_method is not essentially necessary but it would
> > > be important to avoid complexity of autodetection of formats
> > > including currently undefined ones.
> > 
> > Than you for your explanation, I understood that.
> > 
> > It means that the format of s_s_names will be changed, which would be not good.
> 
> I believe that the format of definition of "replication set"(?)
> is not fixed and it would be more complex format to support
> nested definition. This should be in very different format from
> the current simple list of names. This is a selection among three
> or possiblly more disigns in order to be tolerable for future
> changes, I suppose.
> 
> 1. Additional formats of definition in future will be stored in
>    elsewhere of s_s_names.
> 
> 2. Additional format will be stored in s_s_names, the format will
>    be automatically detected.
> 
> 3. (ditto), the format is designated by s_r_method.
> 
> 4. Any other way?
> 
> I choosed the third way. What do you think about future expansion
> of the format?
> 
> > So, how about the adding just s_r_method parameter and the number of
> > required ACK is represented in the leading of s_r_method?
> > For example, the following setting is same as above.
> > 
> > synchronous_replication_method = "2-priority"
> > synchronous_standby_names = "mercury, venus, earth, mars, jupiter"
> 
> I *feel* it is the same or worse as having the third parameter
> s_s_num as your previous design.

I feel it is the same or worse *than* having the third parameter
s_s_num as your previous design.

> > In quorum method, we can set;
> > synchronous_replication_method = "2-quorum"
> > synchronous_standby_names = "mercury, venus, earth, mars, jupiter"
> > 
> > Thought?
> 
> 
> regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Oops.
>
> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> Hello,
>>
>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com>
>> > >> One question is that what is different between the leading "n" in
>> > >> s_s_names and the leading "n" of "n-priority"?
>> > >
>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names
>> > > representing an arbitrary integer number and that in "n-priority"
>> > > is literally an "n", meaning "a format with any number of
>> > > priority hosts" as a whole. As an instance,
>> > >
>> > > synchronous_replication_method = "n-priority"
>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
>> > >
>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so
>> > > if we won't provide "1-priority" for backward compatibility,
>> > > "priority" would be enough to represent the type.
>> > >
>> > > By the way, s_r_method is not essentially necessary but it would
>> > > be important to avoid complexity of autodetection of formats
>> > > including currently undefined ones.
>> >
>> > Than you for your explanation, I understood that.
>> >
>> > It means that the format of s_s_names will be changed, which would be not good.
>>
>> I believe that the format of definition of "replication set"(?)
>> is not fixed and it would be more complex format to support
>> nested definition. This should be in very different format from
>> the current simple list of names. This is a selection among three
>> or possiblly more disigns in order to be tolerable for future
>> changes, I suppose.
>>
>> 1. Additional formats of definition in future will be stored in
>>    elsewhere of s_s_names.
>>
>> 2. Additional format will be stored in s_s_names, the format will
>>    be automatically detected.
>>
>> 3. (ditto), the format is designated by s_r_method.
>>
>> 4. Any other way?
>>
>> I choosed the third way. What do you think about future expansion
>> of the format?
>>

I agree with #3 way and the s_s_name format you suggested.
I think that It's extensible and is tolerable for future changes.
I'm going to implement the patch based on this idea if other hackers
agree with this design.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Oops.
>>
>> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> 
>>> Hello,
>>>
>>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com>
>>> > >> One question is that what is different between the leading "n" in
>>> > >> s_s_names and the leading "n" of "n-priority"?
>>> > >
>>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names
>>> > > representing an arbitrary integer number and that in "n-priority"
>>> > > is literally an "n", meaning "a format with any number of
>>> > > priority hosts" as a whole. As an instance,
>>> > >
>>> > > synchronous_replication_method = "n-priority"
>>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
>>> > >
>>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so
>>> > > if we won't provide "1-priority" for backward compatibility,
>>> > > "priority" would be enough to represent the type.
>>> > >
>>> > > By the way, s_r_method is not essentially necessary but it would
>>> > > be important to avoid complexity of autodetection of formats
>>> > > including currently undefined ones.
>>> >
>>> > Than you for your explanation, I understood that.
>>> >
>>> > It means that the format of s_s_names will be changed, which would be not good.
>>>
>>> I believe that the format of definition of "replication set"(?)
>>> is not fixed and it would be more complex format to support
>>> nested definition. This should be in very different format from
>>> the current simple list of names. This is a selection among three
>>> or possiblly more disigns in order to be tolerable for future
>>> changes, I suppose.
>>>
>>> 1. Additional formats of definition in future will be stored in
>>>    elsewhere of s_s_names.
>>>
>>> 2. Additional format will be stored in s_s_names, the format will
>>>    be automatically detected.
>>>
>>> 3. (ditto), the format is designated by s_r_method.
>>>
>>> 4. Any other way?
>>>
>>> I choosed the third way. What do you think about future expansion
>>> of the format?
>>>
>
> I agree with #3 way and the s_s_name format you suggested.
> I think that It's extensible and is tolerable for future changes.
> I'm going to implement the patch based on this idea if other hackers
> agree with this design.
>

Please find the attached draft patch which supports multi sync replication.
This patch adds a GUC parameter synchronous_replication_method, which
represent the method of synchronous replication.

[Design of replication method]
synchronous_replication_method has two values; 'priority' and
'1-priority' for now.
We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.

* s_r_method = '1-priority'
This method is for backward compatibility, so the syntax of s_s_names
is same as today.
The behavior is same as well.

* s_r_method = 'priority'
This method is for multiple synchronous replication using priority method.
The syntax of s_s_names is,
   <number of sync standbys>, <standby name> [, ...]

For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
node3' means that the master waits for  acknowledge from at least 2
lowest priority servers.
If 4 standbys(node1 - node4) are available, the master server waits
acknowledge from 'node1' and 'node2.
The each status of wal senders are;

=# select application_name, sync_state from pg_stat_replication order
by application_name;
application_name | sync_state
------------------+------------
node1            | sync
node2            | sync
node3            | potential
node4            | async
(4 rows)

After 'node2' crashed, the master will wait for acknowledge from
'node1' and 'node3'.
The each status of wal senders are;

=# select application_name, sync_state from pg_stat_replication order
by application_name;
application_name | sync_state
------------------+------------
node1            | sync
node3            | sync
node4            | async
(3 rows)

[Changing replication method]
When we want to change the replication method, we have to change the
s_r_method  at first, and then do pg_reload_conf().
After changing replication method, we can change the s_s_names.

[Expanding replication method]
If we want to expand new replication method additionally, we need to
implement two functions for each replication method:
* int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
  This function obtains the list of standbys considered as synchronous
at that time, and return its length.
* bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
  This function obtains LSNs(write, flush) considered as synced.

Also, this patch debug code is remain yet, you can debug this behavior
using by enable DEBUG_REPLICATION macro.

Please give me feedbacks.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Dec 9, 2015 at 8:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Oops.
>>>
>>> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> 
>>>> Hello,
>>>>
>>>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com>
>>>> > >> One question is that what is different between the leading "n" in
>>>> > >> s_s_names and the leading "n" of "n-priority"?
>>>> > >
>>>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names
>>>> > > representing an arbitrary integer number and that in "n-priority"
>>>> > > is literally an "n", meaning "a format with any number of
>>>> > > priority hosts" as a whole. As an instance,
>>>> > >
>>>> > > synchronous_replication_method = "n-priority"
>>>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
>>>> > >
>>>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so
>>>> > > if we won't provide "1-priority" for backward compatibility,
>>>> > > "priority" would be enough to represent the type.
>>>> > >
>>>> > > By the way, s_r_method is not essentially necessary but it would
>>>> > > be important to avoid complexity of autodetection of formats
>>>> > > including currently undefined ones.
>>>> >
>>>> > Than you for your explanation, I understood that.
>>>> >
>>>> > It means that the format of s_s_names will be changed, which would be not good.
>>>>
>>>> I believe that the format of definition of "replication set"(?)
>>>> is not fixed and it would be more complex format to support
>>>> nested definition. This should be in very different format from
>>>> the current simple list of names. This is a selection among three
>>>> or possiblly more disigns in order to be tolerable for future
>>>> changes, I suppose.
>>>>
>>>> 1. Additional formats of definition in future will be stored in
>>>>    elsewhere of s_s_names.
>>>>
>>>> 2. Additional format will be stored in s_s_names, the format will
>>>>    be automatically detected.
>>>>
>>>> 3. (ditto), the format is designated by s_r_method.
>>>>
>>>> 4. Any other way?
>>>>
>>>> I choosed the third way. What do you think about future expansion
>>>> of the format?
>>>>
>>
>> I agree with #3 way and the s_s_name format you suggested.
>> I think that It's extensible and is tolerable for future changes.
>> I'm going to implement the patch based on this idea if other hackers
>> agree with this design.
>>
>
> Please find the attached draft patch which supports multi sync replication.
> This patch adds a GUC parameter synchronous_replication_method, which
> represent the method of synchronous replication.
>
> [Design of replication method]
> synchronous_replication_method has two values; 'priority' and
> '1-priority' for now.
> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.
>
> * s_r_method = '1-priority'
> This method is for backward compatibility, so the syntax of s_s_names
> is same as today.
> The behavior is same as well.
>
> * s_r_method = 'priority'
> This method is for multiple synchronous replication using priority method.
> The syntax of s_s_names is,
>    <number of sync standbys>, <standby name> [, ...]
>
> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
> node3' means that the master waits for  acknowledge from at least 2
> lowest priority servers.
> If 4 standbys(node1 - node4) are available, the master server waits
> acknowledge from 'node1' and 'node2.
> The each status of wal senders are;
>
> =# select application_name, sync_state from pg_stat_replication order
> by application_name;
> application_name | sync_state
> ------------------+------------
> node1            | sync
> node2            | sync
> node3            | potential
> node4            | async
> (4 rows)
>
> After 'node2' crashed, the master will wait for acknowledge from
> 'node1' and 'node3'.
> The each status of wal senders are;
>
> =# select application_name, sync_state from pg_stat_replication order
> by application_name;
> application_name | sync_state
> ------------------+------------
> node1            | sync
> node3            | sync
> node4            | async
> (3 rows)
>
> [Changing replication method]
> When we want to change the replication method, we have to change the
> s_r_method  at first, and then do pg_reload_conf().
> After changing replication method, we can change the s_s_names.
>
> [Expanding replication method]
> If we want to expand new replication method additionally, we need to
> implement two functions for each replication method:
> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>   This function obtains the list of standbys considered as synchronous
> at that time, and return its length.
> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>   This function obtains LSNs(write, flush) considered as synced.
>
> Also, this patch debug code is remain yet, you can debug this behavior
> using by enable DEBUG_REPLICATION macro.
>
> Please give me feedbacks.
>

I've attached updated patch.
Please give me feedbacks.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Thank you for the new patch.

At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg@mail.gmail.com>
> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I agree with #3 way and the s_s_name format you suggested.
> > I think that It's extensible and is tolerable for future changes.
> > I'm going to implement the patch based on this idea if other hackers
> > agree with this design.
> 
> Please find the attached draft patch which supports multi sync replication.
> This patch adds a GUC parameter synchronous_replication_method, which
> represent the method of synchronous replication.
> 
> [Design of replication method]
> synchronous_replication_method has two values; 'priority' and
> '1-priority' for now.
> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.
> 
> * s_r_method = '1-priority'
> This method is for backward compatibility, so the syntax of s_s_names
> is same as today.
> The behavior is same as well.
> 
> * s_r_method = 'priority'
> This method is for multiple synchronous replication using priority method.
> The syntax of s_s_names is,
>    <number of sync standbys>, <standby name> [, ...]

Is there anyone opposed to this?

> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
> node3' means that the master waits for  acknowledge from at least 2
> lowest priority servers.
> If 4 standbys(node1 - node4) are available, the master server waits
> acknowledge from 'node1' and 'node2.
> The each status of wal senders are;
> 
> =# select application_name, sync_state from pg_stat_replication order
> by application_name;
> application_name | sync_state
> ------------------+------------
> node1            | sync
> node2            | sync
> node3            | potential
> node4            | async
> (4 rows)
> 
> After 'node2' crashed, the master will wait for acknowledge from
> 'node1' and 'node3'.
> The each status of wal senders are;
> 
> =# select application_name, sync_state from pg_stat_replication order
> by application_name;
> application_name | sync_state
> ------------------+------------
> node1            | sync
> node3            | sync
> node4            | async
> (3 rows)
> 
> [Changing replication method]
> When we want to change the replication method, we have to change the
> s_r_method  at first, and then do pg_reload_conf().
> After changing replication method, we can change the s_s_names.

Mmm. I should be able to be changed at once, because s_r_method
and s_s_names contradict each other during the intermediate
state.

> [Expanding replication method]
> If we want to expand new replication method additionally, we need to
> implement two functions for each replication method:
> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>   This function obtains the list of standbys considered as synchronous
> at that time, and return its length.
> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>   This function obtains LSNs(write, flush) considered as synced.
> 
> Also, this patch debug code is remain yet, you can debug this behavior
> using by enable DEBUG_REPLICATION macro.
> 
> Please give me feedbacks.

I haven't looked into this fully (sorry) but I'm concerned about
several points.


- I feel that some function names looks too long. For example SyncRepGetSynchronousStandbysOnePriority occupies more
thanthe half of a line. (However, the replication code alrady has many long function names..)
 

- The comment below of SyncRepGetSynchronousStandbyOnePriority, >       /* Find lowest priority standby */
 The code where the comment is for is doing the correct thing. Howerver, the comment is confusing. A lower priority
*value*means a higher priority.
 

- SyncRepGetSynchronousStandbys checks all if()s even when the first one matches. Use switch or "else if" there if you
theyare exclusive each other.
 

- Do you intende the DEBUG_REPLICATION code in SyncRepGetSynchronousStandbys*() to be the final shape?  The same code
blockswhich can work for both method should be in their common caller but SyncRepGetSyncLsns*() are headache. Although
itmight need more refactoring, I'm sorry but I don't see a desirable shape for now.
 
 By the way, palloc(20)/free() in such short term looks ineffective.

- SyncRepGetSyncLsnsPriority
 For the comment "/* Find lowest XLogRecPtr of both write and flush from sync_nodes */", LSN is compared as early or
lateso the comment would be better to be something like "Keep/Collect the earliest write and flush LSNs among
prioritizedstandbys".
 
 And what is more important, this block handles write and flush LSN jumbled and it reults in missing the earliest(=
mostdelayed) LSN for certain cases. The following is an example.
 
  Standby 1:  write LSN = 10, flush LSN = 5  Standby 2:  write LSN = 8 , flush LSN = 6
 For this case, finally we get tmp_write = 10 and tmp_flush = 5 from the current code, where tmp_write has wrong value
sinceLSN = 10 has *not* been written yet on standby 2. (the names "tmp_*" don't seem appropriate here)
 


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thank you for the new patch.
>
> At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg@mail.gmail.com>
>> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > I agree with #3 way and the s_s_name format you suggested.
>> > I think that It's extensible and is tolerable for future changes.
>> > I'm going to implement the patch based on this idea if other hackers
>> > agree with this design.
>>
>> Please find the attached draft patch which supports multi sync replication.
>> This patch adds a GUC parameter synchronous_replication_method, which
>> represent the method of synchronous replication.
>>
>> [Design of replication method]
>> synchronous_replication_method has two values; 'priority' and
>> '1-priority' for now.
>> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.
>>
>> * s_r_method = '1-priority'
>> This method is for backward compatibility, so the syntax of s_s_names
>> is same as today.
>> The behavior is same as well.
>>
>> * s_r_method = 'priority'
>> This method is for multiple synchronous replication using priority method.
>> The syntax of s_s_names is,
>>    <number of sync standbys>, <standby name> [, ...]
>
> Is there anyone opposed to this?
>
>> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
>> node3' means that the master waits for  acknowledge from at least 2
>> lowest priority servers.
>> If 4 standbys(node1 - node4) are available, the master server waits
>> acknowledge from 'node1' and 'node2.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1            | sync
>> node2            | sync
>> node3            | potential
>> node4            | async
>> (4 rows)
>>
>> After 'node2' crashed, the master will wait for acknowledge from
>> 'node1' and 'node3'.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1            | sync
>> node3            | sync
>> node4            | async
>> (3 rows)
>>
>> [Changing replication method]
>> When we want to change the replication method, we have to change the
>> s_r_method  at first, and then do pg_reload_conf().
>> After changing replication method, we can change the s_s_names.

Thank you for reviewing the patch!
Please find attached latest patch.

> Mmm. I should be able to be changed at once, because s_r_method
> and s_s_names contradict each other during the intermediate
> state.

Sorry to confuse you. I meant the case where we want to change the
replication method using ALTER SYSTEM.


>> [Expanding replication method]
>> If we want to expand new replication method additionally, we need to
>> implement two functions for each replication method:
>> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>>   This function obtains the list of standbys considered as synchronous
>> at that time, and return its length.
>> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>   This function obtains LSNs(write, flush) considered as synced.
>>
>> Also, this patch debug code is remain yet, you can debug this behavior
>> using by enable DEBUG_REPLICATION macro.
>>
>> Please give me feedbacks.
>
> I haven't looked into this fully (sorry) but I'm concerned about
> several points.
>
>
> - I feel that some function names looks too long. For example
>   SyncRepGetSynchronousStandbysOnePriority occupies more than the
>   half of a line. (However, the replication code alrady has many
>   long function names..)

Yeah, it would be better to change 'Synchronous' to 'Sync' at least.

> - The comment below of SyncRepGetSynchronousStandbyOnePriority,
>   >       /* Find lowest priority standby */
>
>   The code where the comment is for is doing the correct
>   thing. Howerver, the comment is confusing. A lower priority
>   *value* means a higher priority.

Fixed.

> - SyncRepGetSynchronousStandbys checks all if()s even when the
>   first one matches. Use switch or "else if" there if you they
>   are exclusive each other.

Fixed.

> - Do you intende the DEBUG_REPLICATION code in
>   SyncRepGetSynchronousStandbys*() to be the final shape?  The
>   same code blocks which can work for both method should be in
>   their common caller but SyncRepGetSyncLsns*() are
>   headache. Although it might need more refactoring, I'm sorry
>   but I don't see a desirable shape for now.

I'm not going to DEBUG_REPLICAION code to be  the final shape.
These codes are removed from this version patch.

>   By the way, palloc(20)/free() in such short term looks
>   ineffective.
>
> - SyncRepGetSyncLsnsPriority
>
>   For the comment "/* Find lowest XLogRecPtr of both write and
>   flush from sync_nodes */", LSN is compared as early or late so
>   the comment would be better to be something like "Keep/Collect
>   the earliest write and flush LSNs among prioritized standbys".

Fixed.

>   And what is more important, this block handles write and flush
>   LSN jumbled and it reults in missing the earliest(= most
>   delayed) LSN for certain cases. The following is an example.
>
>    Standby 1:  write LSN = 10, flush LSN = 5
>    Standby 2:  write LSN = 8 , flush LSN = 6
>
>   For this case, finally we get tmp_write = 10 and tmp_flush = 5
>   from the current code, where tmp_write has wrong value since
>   LSN = 10 has *not* been written yet on standby 2. (the names
>   "tmp_*" don't seem appropriate here)
>

You are right.
We have to handle write and flush LSN individually, and to get each lowest LSN.
For example in this case,  we have to get write = 8, flush = 5.
I've change the logic so that.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
[000-_multi_sync_replication_v3.patch]

Hi Masahiko,

I haven't tested this version of the patch but I have some comments on the code.

+/* Is this wal sender considerable one? */
+bool
+SyncRepActiveListedWalSender(int num)

Maybe "Is this wal sender managing a standby that is streaming and
listed as a synchronous standby?"

+/*
+ * Obtain three palloc'd arrays containing position of standbys currently
+ * considered as synchronous, and its length.
+ */
+int
+SyncRepGetSyncStandbys(int *sync_standbys)

This comment seems to be out of date.  I would say "Populate a
caller-supplied array which much have enough space for ...  Returns
...".

+/*
+ * Obtain standby currently considered as synchronous using
+ * '1-priority' method.
+ */
+int
+SyncRepGetSyncStandbysOnePriority(int *sync_standbys)
+ ... code ...

Why do we need a separate function and code path for this case?  If
you used SyncRepGetSyncStandbysPriority with a size of 1, should it
not produce the same result in the same time complexity?

+/*
+ * Obtain standby currently considered as synchronous using
+ * 'priority' method.
+ */
+int
+SyncRepGetSyncStandbysPriority(int *sync_standbys)

I would say something more descriptive, maybe like this: "Populates a
caller-supplied buffer with the walsnds indexes of the highest
priority active synchronous standbys, up to the a limit of
'synchronous_standby_num'.  The order of the results is undefined.
Returns the number of results  actually written."

If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
above, then this function could be renamed to SyncRepGetSyncStandbys.
I think it would be a tiny bit nicer if it also took a Size n argument
along with the output buffer pointer.

As for the body of that function (which I won't paste here), it
contains an algorithm to find the top K elements in an array of N
elements.  It does that with a linear search through the top K seen so
far for each value in the input array, so its worst case is O(KN)
comparisons.  Some of the sorting gurus on this list might have
something to say about that but my take is that it seems fine for the
tiny values of K and N that we're dealing with here, and it's nice
that it doesn't need any space other than the output buffer, unlike
some other top-K algorithms which would win for larger inputs.

+ /* Found sync standby */

This comment would be clearer as "Found lowest priority standby, so replace it".

+ if (walsndloc->sync_standby_priority == priority &&
+ walsnd->sync_standby_priority < priority)
+ sync_standbys[j] = i;

In this case, couldn't you also update 'priority' directly, and break
out of the loop immediately?  Wouldn't "lowest_priority" be a better
variable name than "priority"?  It might be good to say "lowest"
rather than "highest" in the nearby comments, to be consistent with
other parts of the code including the function name (lower priority
number means higher priority!).

+/*
+ * Obtain currently synced LSN: write and flush,
+ * using '1-prioirty' method.

s/prioirty/priority/

+ */
+bool
+SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)

Similar to the earlier case, why have a special case for 1-priority?
Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is
synchronous_standby_num == 1?

+/*
+ * Obtain currently synced LSN: write and flush,
+ * using 'prioirty' method.

s/prioirty/priority/

+SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
+{
+ int *sync_standbys = NULL;
+ int num_sync;
+ int i;
+ XLogRecPtr synced_write = InvalidXLogRecPtr;
+ XLogRecPtr synced_flush = InvalidXLogRecPtr;
+
+ sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num);

Would a fixed size buffer on the stack (of compile time constant size)
be better than palloc/free in here and elsewhere?

+ /*
+ for (i = 0; i < num_sync; i++)
+ {
+ volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+ if (walsndloc == MyWalSnd)
+ {
+ found = true;
+ break;
+ }
+ }
+ */

Dead code.

+ if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY)
+ synchronous_standby_num = 1;
+ else
+ synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)),
sizeof(int), 0);

Should we detect if synchronous_standby_num > the number of listed
servers, which would be a nonsensical configuration?  Should we also
impose some other kind of constant limits, like must be >= 0 (I
haven't tried but I wonder if -1 leads to very large palloc) and must
be <= MAX_XXX (smallish sanity check number like 256, rather than the
INT_MAX limit imposed by pg_atoi), so that we could use that constant
to size stack buffers in the places where you currently palloc?

Could 1-priority mode be inferred from the use of a non-number in the
leading position, and if so, does the mode concept even need to exist,
especially if SyncRepGetSyncLsnsOnePriority and
SyncRepGetSyncStandbysOnePriority aren't really needed either way?  Is
there any difference in behaviour between the following
configurations?  (Sorry if that particular question has already been
duked out in the long thread about GUCs.)

synchronous_replication_method = 1-priority
synchronous_standby_names = foo, bar

synchronous_replication_method = priority
synchronous_standby_names = 1, foo, bar

(Apologies for the missing leading whitespace in patch fragments
pasted above, it seems that my mail client has eaten it).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> [000-_multi_sync_replication_v3.patch]
>
> Hi Masahiko,
>
> I haven't tested this version of the patch but I have some comments on the code.
>
> +/* Is this wal sender considerable one? */
> +bool
> +SyncRepActiveListedWalSender(int num)
>
> Maybe "Is this wal sender managing a standby that is streaming and
> listed as a synchronous standby?"
>
> +/*
> + * Obtain three palloc'd arrays containing position of standbys currently
> + * considered as synchronous, and its length.
> + */
> +int
> +SyncRepGetSyncStandbys(int *sync_standbys)
>
> This comment seems to be out of date.  I would say "Populate a
> caller-supplied array which much have enough space for ...  Returns
> ...".
>
> +/*
> + * Obtain standby currently considered as synchronous using
> + * '1-priority' method.
> + */
> +int
> +SyncRepGetSyncStandbysOnePriority(int *sync_standbys)
> + ... code ...
>
> Why do we need a separate function and code path for this case?  If
> you used SyncRepGetSyncStandbysPriority with a size of 1, should it
> not produce the same result in the same time complexity?
>
> +/*
> + * Obtain standby currently considered as synchronous using
> + * 'priority' method.
> + */
> +int
> +SyncRepGetSyncStandbysPriority(int *sync_standbys)
>
> I would say something more descriptive, maybe like this: "Populates a
> caller-supplied buffer with the walsnds indexes of the highest
> priority active synchronous standbys, up to the a limit of
> 'synchronous_standby_num'.  The order of the results is undefined.
> Returns the number of results  actually written."
>
> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
> above, then this function could be renamed to SyncRepGetSyncStandbys.
> I think it would be a tiny bit nicer if it also took a Size n argument
> along with the output buffer pointer.
>
> As for the body of that function (which I won't paste here), it
> contains an algorithm to find the top K elements in an array of N
> elements.  It does that with a linear search through the top K seen so
> far for each value in the input array, so its worst case is O(KN)
> comparisons.  Some of the sorting gurus on this list might have
> something to say about that but my take is that it seems fine for the
> tiny values of K and N that we're dealing with here, and it's nice
> that it doesn't need any space other than the output buffer, unlike
> some other top-K algorithms which would win for larger inputs.
>
> + /* Found sync standby */
>
> This comment would be clearer as "Found lowest priority standby, so replace it".
>
> + if (walsndloc->sync_standby_priority == priority &&
> + walsnd->sync_standby_priority < priority)
> + sync_standbys[j] = i;
>
> In this case, couldn't you also update 'priority' directly, and break
> out of the loop immediately?

Oops, I didn't think that though: you can't break from the loop, you
still need to find the new lowest priority, so I retract that bit.

> Wouldn't "lowest_priority" be a better
> variable name than "priority"?  It might be good to say "lowest"
> rather than "highest" in the nearby comments, to be consistent with
> other parts of the code including the function name (lower priority
> number means higher priority!).
>
> +/*
> + * Obtain currently synced LSN: write and flush,
> + * using '1-prioirty' method.
>
> s/prioirty/priority/
>
> + */
> +bool
> +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>
> Similar to the earlier case, why have a special case for 1-priority?
> Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is
> synchronous_standby_num == 1?
>
> +/*
> + * Obtain currently synced LSN: write and flush,
> + * using 'prioirty' method.
>
> s/prioirty/priority/
>
> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
> +{
> + int *sync_standbys = NULL;
> + int num_sync;
> + int i;
> + XLogRecPtr synced_write = InvalidXLogRecPtr;
> + XLogRecPtr synced_flush = InvalidXLogRecPtr;
> +
> + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num);
>
> Would a fixed size buffer on the stack (of compile time constant size)
> be better than palloc/free in here and elsewhere?
>
> + /*
> + for (i = 0; i < num_sync; i++)
> + {
> + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
> + if (walsndloc == MyWalSnd)
> + {
> + found = true;
> + break;
> + }
> + }
> + */
>
> Dead code.
>
> + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY)
> + synchronous_standby_num = 1;
> + else
> + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)),
> sizeof(int), 0);
>
> Should we detect if synchronous_standby_num > the number of listed
> servers, which would be a nonsensical configuration?  Should we also
> impose some other kind of constant limits, like must be >= 0 (I
> haven't tried but I wonder if -1 leads to very large palloc) and must
> be <= MAX_XXX (smallish sanity check number like 256, rather than the
> INT_MAX limit imposed by pg_atoi), so that we could use that constant
> to size stack buffers in the places where you currently palloc?
>
> Could 1-priority mode be inferred from the use of a non-number in the
> leading position, and if so, does the mode concept even need to exist,
> especially if SyncRepGetSyncLsnsOnePriority and
> SyncRepGetSyncStandbysOnePriority aren't really needed either way?  Is
> there any difference in behaviour between the following
> configurations?  (Sorry if that particular question has already been
> duked out in the long thread about GUCs.)
>
> synchronous_replication_method = 1-priority
> synchronous_standby_names = foo, bar
>
> synchronous_replication_method = priority
> synchronous_standby_names = 1, foo, bar
>
> (Apologies for the missing leading whitespace in patch fragments
> pasted above, it seems that my mail client has eaten it).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Dec 23, 2015 at 12:15 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Review stuff

I have moved this entry to next CF as review is quite recent.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> [000-_multi_sync_replication_v3.patch]
>>
>> Hi Masahiko,
>>
>> I haven't tested this version of the patch but I have some comments on the code.
>>
>> +/* Is this wal sender considerable one? */
>> +bool
>> +SyncRepActiveListedWalSender(int num)
>>
>> Maybe "Is this wal sender managing a standby that is streaming and
>> listed as a synchronous standby?"

Fixed.

>> +/*
>> + * Obtain three palloc'd arrays containing position of standbys currently
>> + * considered as synchronous, and its length.
>> + */
>> +int
>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>
>> This comment seems to be out of date.  I would say "Populate a
>> caller-supplied array which much have enough space for ...  Returns
>> ...".

Fixed.

>> +/*
>> + * Obtain standby currently considered as synchronous using
>> + * '1-priority' method.
>> + */
>> +int
>> +SyncRepGetSyncStandbysOnePriority(int *sync_standbys)
>> + ... code ...
>>
>> Why do we need a separate function and code path for this case?  If
>> you used SyncRepGetSyncStandbysPriority with a size of 1, should it
>> not produce the same result in the same time complexity?

I was thinking that we could add new function like
SyncRepGetSyncStandbysXXXXX function (XXXXX is replication method
name) if we want to expand the kind of repliaction method.
So I include replication method name into function name.
But it's enough to add one function for 2 replication method;
priority, 1-priority

>> +/*
>> + * Obtain standby currently considered as synchronous using
>> + * 'priority' method.
>> + */
>> +int
>> +SyncRepGetSyncStandbysPriority(int *sync_standbys)
>>
>> I would say something more descriptive, maybe like this: "Populates a
>> caller-supplied buffer with the walsnds indexes of the highest
>> priority active synchronous standbys, up to the a limit of
>> 'synchronous_standby_num'.  The order of the results is undefined.
>> Returns the number of results  actually written."

Fixed.

>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>> I think it would be a tiny bit nicer if it also took a Size n argument
>> along with the output buffer pointer.

Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
function uses synchronous_standby_num which is global variable.
But you mean that the number of synchronous standbys is given as
function argument?

>> As for the body of that function (which I won't paste here), it
>> contains an algorithm to find the top K elements in an array of N
>> elements.  It does that with a linear search through the top K seen so
>> far for each value in the input array, so its worst case is O(KN)
>> comparisons.  Some of the sorting gurus on this list might have
>> something to say about that but my take is that it seems fine for the
>> tiny values of K and N that we're dealing with here, and it's nice
>> that it doesn't need any space other than the output buffer, unlike
>> some other top-K algorithms which would win for larger inputs.

Yeah, it's improvement point.
But I'm assumed that the number of synchronous replication is not
large, so I use this algorithm as first version.
And I think that its worst case is O(K(N-K)). Am I missing something?

>> + /* Found sync standby */
>>
>> This comment would be clearer as "Found lowest priority standby, so replace it".

Fixed.

>> + if (walsndloc->sync_standby_priority == priority &&
>> + walsnd->sync_standby_priority < priority)
>> + sync_standbys[j] = i;
>>
>> In this case, couldn't you also update 'priority' directly, and break
>> out of the loop immediately?
>
> Oops, I didn't think that though: you can't break from the loop, you
> still need to find the new lowest priority, so I retract that bit.
>
>> Wouldn't "lowest_priority" be a better
>> variable name than "priority"?  It might be good to say "lowest"
>> rather than "highest" in the nearby comments, to be consistent with
>> other parts of the code including the function name (lower priority
>> number means higher priority!).
>>
>> +/*
>> + * Obtain currently synced LSN: write and flush,
>> + * using '1-prioirty' method.
>>
>> s/prioirty/priority/
>>
>> + */
>> +bool
>> +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>
>> Similar to the earlier case, why have a special case for 1-priority?
>> Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is
>> synchronous_standby_num == 1?
>>
>> +/*
>> + * Obtain currently synced LSN: write and flush,
>> + * using 'prioirty' method.
>>
>> s/prioirty/priority/
>>
>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> +{
>> + int *sync_standbys = NULL;
>> + int num_sync;
>> + int i;
>> + XLogRecPtr synced_write = InvalidXLogRecPtr;
>> + XLogRecPtr synced_flush = InvalidXLogRecPtr;
>> +
>> + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num);
>>
>> Would a fixed size buffer on the stack (of compile time constant size)
>> be better than palloc/free in here and elsewhere?
>>
>> + /*
>> + for (i = 0; i < num_sync; i++)
>> + {
>> + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
>> + if (walsndloc == MyWalSnd)
>> + {
>> + found = true;
>> + break;
>> + }
>> + }
>> + */
>>
>> Dead code.
>>
>> + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY)
>> + synchronous_standby_num = 1;
>> + else
>> + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)),
>> sizeof(int), 0);

Fixed.

>> Should we detect if synchronous_standby_num > the number of listed
>> servers, which would be a nonsensical configuration?  Should we also
>> impose some other kind of constant limits, like must be >= 0 (I
>> haven't tried but I wonder if -1 leads to very large palloc) and must
>> be <= MAX_XXX (smallish sanity check number like 256, rather than the
>> INT_MAX limit imposed by pg_atoi), so that we could use that constant
>> to size stack buffers in the places where you currently palloc?

Yeah, I add validation check for s_s_num.

>> Could 1-priority mode be inferred from the use of a non-number in the
>> leading position, and if so, does the mode concept even need to exist,
>> especially if SyncRepGetSyncLsnsOnePriority and
>> SyncRepGetSyncStandbysOnePriority aren't really needed either way?  Is
>> there any difference in behaviour between the following
>> configurations?  (Sorry if that particular question has already been
>> duked out in the long thread about GUCs.)
>>
>> synchronous_replication_method = 1-priority
>> synchronous_standby_names = foo, bar
>>
>> synchronous_replication_method = priority
>> synchronous_standby_names = 1, foo, bar

The behaviour under the both configuration are the same.
I added '1-priority' method for backward compatibility. The default
value of s_r_method is '1-priority', so user who is using sync
replicatoin can continues to use after upgrading smoothly.

>> (Apologies for the missing leading whitespace in patch fragments
>> pasted above, it seems that my mail client has eaten it).

No problem. Thank you for reviewing!

> I have moved this entry to next CF as review is quite recent.
Thanks!

Attached latest version patch.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>> along with the output buffer pointer.
>
> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
> function uses synchronous_standby_num which is global variable.
> But you mean that the number of synchronous standbys is given as
> function argument?

Yeah, I was thinking of it as the output buffer size which I would be
inclined to make more explicit (I am still coming to terms with the
use of global variables in Postgres) but it doesn't matter, please
disregard that suggestion.

>>> As for the body of that function (which I won't paste here), it
>>> contains an algorithm to find the top K elements in an array of N
>>> elements.  It does that with a linear search through the top K seen so
>>> far for each value in the input array, so its worst case is O(KN)
>>> comparisons.  Some of the sorting gurus on this list might have
>>> something to say about that but my take is that it seems fine for the
>>> tiny values of K and N that we're dealing with here, and it's nice
>>> that it doesn't need any space other than the output buffer, unlike
>>> some other top-K algorithms which would win for larger inputs.
>
> Yeah, it's improvement point.
> But I'm assumed that the number of synchronous replication is not
> large, so I use this algorithm as first version.
> And I think that its worst case is O(K(N-K)). Am I missing something?

You're right, I was dropping that detail, in the tradition of the
hand-wavy school of big-O notation.  (I suppose you could skip the
inner loop when the priority is lower than the current lowest
priority, giving a O(N) best case when the walsenders are perfectly
ordered by coincidence.  Probably a bad idea or just not worth
worrying about.)

> Attached latest version patch.

+/*
+ * Obtain currently synced LSN location: write and flush, using priority
- * In 9.1 we support only a single synchronous standby, chosen from a
- * priority list of synchronous_standby_names. Before it can become the
+ * In 9.6 we support multiple synchronous standby, chosen from a priority

s/standby/standbys/

+ * list of synchronous_standby_names. Before it can become the

s/Before it can become the/Before any standby can become a/
 * synchronous standby it must have caught up with the primary; that may * take some time. Once caught up, the current
highestpriority standby
 

s/standby/standbys/
 * will release waiters from the queue.

+bool
+SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
+{
+ int sync_standbys[synchronous_standby_num];

I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
(Variable sized arrays are a feature of C99 and PostgreSQL is written
in C89.)

+/*
+ * Populate a caller-supplied array which much have enough space for
+ * synchronous_standby_num. Returns position of standbys currently
+ * considered as synchronous, and its length.
+ */
+int
+SyncRepGetSyncStandbys(int *sync_standbys)

s/much/must/ (my bad, in previous email).

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("The number of synchronous standbys must be smaller than the
number of listed : %d",
+ synchronous_standby_num)));

How about "the number of synchronous standbys exceeds the length of
the standby list: %d"?  Error messages usually start with lower case,
':' is not usually preceded by a space.

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("The number of synchronous standbys must be between 1 and %d : %d",

s/The/the/, s/ : /: /

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>> <thomas.munro@enterprisedb.com> wrote:
>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>> along with the output buffer pointer.
>>
>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>> function uses synchronous_standby_num which is global variable.
>> But you mean that the number of synchronous standbys is given as
>> function argument?
>
> Yeah, I was thinking of it as the output buffer size which I would be
> inclined to make more explicit (I am still coming to terms with the
> use of global variables in Postgres) but it doesn't matter, please
> disregard that suggestion.
>
>>>> As for the body of that function (which I won't paste here), it
>>>> contains an algorithm to find the top K elements in an array of N
>>>> elements.  It does that with a linear search through the top K seen so
>>>> far for each value in the input array, so its worst case is O(KN)
>>>> comparisons.  Some of the sorting gurus on this list might have
>>>> something to say about that but my take is that it seems fine for the
>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>> that it doesn't need any space other than the output buffer, unlike
>>>> some other top-K algorithms which would win for larger inputs.
>>
>> Yeah, it's improvement point.
>> But I'm assumed that the number of synchronous replication is not
>> large, so I use this algorithm as first version.
>> And I think that its worst case is O(K(N-K)). Am I missing something?
>
> You're right, I was dropping that detail, in the tradition of the
> hand-wavy school of big-O notation.  (I suppose you could skip the
> inner loop when the priority is lower than the current lowest
> priority, giving a O(N) best case when the walsenders are perfectly
> ordered by coincidence.  Probably a bad idea or just not worth
> worrying about.)

Thank you for reviewing the patch.
Yeah, I added the logic that skip the inner loop.

>
>> Attached latest version patch.
>
> +/*
> + * Obtain currently synced LSN location: write and flush, using priority
> - * In 9.1 we support only a single synchronous standby, chosen from a
> - * priority list of synchronous_standby_names. Before it can become the
> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>
> s/standby/standbys/
>
> + * list of synchronous_standby_names. Before it can become the
>
> s/Before it can become the/Before any standby can become a/
>
>   * synchronous standby it must have caught up with the primary; that may
>   * take some time. Once caught up, the current highest priority standby
>
> s/standby/standbys/
>
>   * will release waiters from the queue.
>
> +bool
> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
> +{
> + int sync_standbys[synchronous_standby_num];
>
> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
> (Variable sized arrays are a feature of C99 and PostgreSQL is written
> in C89.)
>
> +/*
> + * Populate a caller-supplied array which much have enough space for
> + * synchronous_standby_num. Returns position of standbys currently
> + * considered as synchronous, and its length.
> + */
> +int
> +SyncRepGetSyncStandbys(int *sync_standbys)
>
> s/much/must/ (my bad, in previous email).
>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("The number of synchronous standbys must be smaller than the
> number of listed : %d",
> + synchronous_standby_num)));
>
> How about "the number of synchronous standbys exceeds the length of
> the standby list: %d"?  Error messages usually start with lower case,
> ':' is not usually preceded by a space.
>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>
> s/The/the/, s/ : /: /

Fixed you mentioned.

Attached latest v5 patch.
Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sun, Jan 3, 2016 at 10:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>> <thomas.munro@enterprisedb.com> wrote:
>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>>> along with the output buffer pointer.
>>>
>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>> function uses synchronous_standby_num which is global variable.
>>> But you mean that the number of synchronous standbys is given as
>>> function argument?
>>
>> Yeah, I was thinking of it as the output buffer size which I would be
>> inclined to make more explicit (I am still coming to terms with the
>> use of global variables in Postgres) but it doesn't matter, please
>> disregard that suggestion.
>>
>>>>> As for the body of that function (which I won't paste here), it
>>>>> contains an algorithm to find the top K elements in an array of N
>>>>> elements.  It does that with a linear search through the top K seen so
>>>>> far for each value in the input array, so its worst case is O(KN)
>>>>> comparisons.  Some of the sorting gurus on this list might have
>>>>> something to say about that but my take is that it seems fine for the
>>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>>> that it doesn't need any space other than the output buffer, unlike
>>>>> some other top-K algorithms which would win for larger inputs.
>>>
>>> Yeah, it's improvement point.
>>> But I'm assumed that the number of synchronous replication is not
>>> large, so I use this algorithm as first version.
>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>
>> You're right, I was dropping that detail, in the tradition of the
>> hand-wavy school of big-O notation.  (I suppose you could skip the
>> inner loop when the priority is lower than the current lowest
>> priority, giving a O(N) best case when the walsenders are perfectly
>> ordered by coincidence.  Probably a bad idea or just not worth
>> worrying about.)
>
> Thank you for reviewing the patch.
> Yeah, I added the logic that skip the inner loop.
>
>>
>>> Attached latest version patch.
>>
>> +/*
>> + * Obtain currently synced LSN location: write and flush, using priority
>> - * In 9.1 we support only a single synchronous standby, chosen from a
>> - * priority list of synchronous_standby_names. Before it can become the
>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>
>> s/standby/standbys/
>>
>> + * list of synchronous_standby_names. Before it can become the
>>
>> s/Before it can become the/Before any standby can become a/
>>
>>   * synchronous standby it must have caught up with the primary; that may
>>   * take some time. Once caught up, the current highest priority standby
>>
>> s/standby/standbys/
>>
>>   * will release waiters from the queue.
>>
>> +bool
>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> +{
>> + int sync_standbys[synchronous_standby_num];
>>
>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>> in C89.)
>>
>> +/*
>> + * Populate a caller-supplied array which much have enough space for
>> + * synchronous_standby_num. Returns position of standbys currently
>> + * considered as synchronous, and its length.
>> + */
>> +int
>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>
>> s/much/must/ (my bad, in previous email).
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be smaller than the
>> number of listed : %d",
>> + synchronous_standby_num)));
>>
>> How about "the number of synchronous standbys exceeds the length of
>> the standby list: %d"?  Error messages usually start with lower case,
>> ':' is not usually preceded by a space.
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>
>> s/The/the/, s/ : /: /
>
> Fixed you mentioned.
>
> Attached latest v5 patch.
> Please review it.

Something that I find rather scary with this patch: could it be
possible to get actual regression tests now that there is more
machinery with PostgresNode.pm? As syncrep code paths get more and
more complex, so are debugging and maintenance.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com>
> > Attached latest v5 patch.
> > Please review it.
> 
> Something that I find rather scary with this patch: could it be
> possible to get actual regression tests now that there is more
> machinery with PostgresNode.pm? As syncrep code paths get more and
> more complex, so are debugging and maintenance.

The test on the whole replication system will very likely to be
too complex and hard to stabilize, and would be
disproportionately large to other tests.

This patch mainly changes the logic to choose the next syncrep
standbys and calculate the 'synched' LSNs, so performing separate
module tests for the logics, then perform the test for the
behavior according to the result of that by, perhaps,
PostgresNode.pm would remarkably reduce the labor for
testing.

Could we have some tapping point for individual testing of the
logics in appropriate way?

In order to do so, the logics should be able to be fed arbitrary
complete set of parameters, in other words, defining a kind of
API to use the logics from the core side, even though it is not
an extension. Then we will *somehow* kick the API with some set
of parameters in regest.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com>
>> > Attached latest v5 patch.
>> > Please review it.
>>
>> Something that I find rather scary with this patch: could it be
>> possible to get actual regression tests now that there is more
>> machinery with PostgresNode.pm? As syncrep code paths get more and
>> more complex, so are debugging and maintenance.
>
> The test on the whole replication system will very likely to be
> too complex and hard to stabilize, and would be
> disproportionately large to other tests.

I don't buy that much. Mind you, there is in this commit fest a patch
introducing a basic regression test suite for recovery using the new
infrastructure that has been committed last month. You may want to
look at it.

> This patch mainly changes the logic to choose the next syncrep
> standbys and calculate the 'synched' LSNs, so performing separate
> module tests for the logics, then perform the test for the
> behavior according to the result of that by, perhaps,
> PostgresNode.pm would remarkably reduce the labor for
> testing.
> Could we have some tapping point for individual testing of the
> logics in appropriate way?

Isn't pg_stat_replication enough for this purpose? What you basically
need to do is set up a master, a set of slaves and then look at the
WAL sender status. Am I getting that wrong?

> In order to do so, the logics should be able to be fed arbitrary
> complete set of parameters, in other words, defining a kind of
> API to use the logics from the core side, even though it is not
> an extension. Then we will *somehow* kick the API with some set
> of parameters in regest.

Well, you will need to craft in the syncrep test suite associated in
this patch a set of routines that allows to set up appropriately
s_s_names and the other parameters that this patch introduces. I does
not sound like a barrier impossible to cross.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello,
> >
> > At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com>
> >>
> >> Something that I find rather scary with this patch: could it be
> >> possible to get actual regression tests now that there is more
> >> machinery with PostgresNode.pm? As syncrep code paths get more and
> >> more complex, so are debugging and maintenance.
> >
> > The test on the whole replication system will very likely to be
> > too complex and hard to stabilize, and would be
> > disproportionately large to other tests.
> 
> I don't buy that much. Mind you, there is in this commit fest a patch
> introducing a basic regression test suite for recovery using the new
> infrastructure that has been committed last month. You may want to
> look at it.

Kyotaro, please have a look at this patch:
https://commitfest.postgresql.org/8/438/
which is the recovery test framework Michael is talking about.  Is it
possible to use that framework to write tests for this feature?  If so,
then my preferred course of action would be to commit that patch and
then introduce in this patch some additional tests for the N-sync-standby
feature.  Can you please have a look at the test framework patch and
provide your feedback on how usable it is for this?

Thanks,

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello,
>> >
>> > At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com>
>> >>
>> >> Something that I find rather scary with this patch: could it be
>> >> possible to get actual regression tests now that there is more
>> >> machinery with PostgresNode.pm? As syncrep code paths get more and
>> >> more complex, so are debugging and maintenance.
>> >
>> > The test on the whole replication system will very likely to be
>> > too complex and hard to stabilize, and would be
>> > disproportionately large to other tests.
>>
>> I don't buy that much. Mind you, there is in this commit fest a patch
>> introducing a basic regression test suite for recovery using the new
>> infrastructure that has been committed last month. You may want to
>> look at it.
>
> Kyotaro, please have a look at this patch:
> https://commitfest.postgresql.org/8/438/
> which is the recovery test framework Michael is talking about.  Is it
> possible to use that framework to write tests for this feature?  If so,
> then my preferred course of action would be to commit that patch and
> then introduce in this patch some additional tests for the N-sync-standby
> feature.  Can you please have a look at the test framework patch and
> provide your feedback on how usable it is for this?
>

I had a look that patch.
I'm planning to have at least following tests for multiple synchronous
replication.

* Confirm value of pg_stat_replication.sync_state (sync, async or potential)
* Confirm that the data is synchronously replicated to multiple
standbys in same cases. * case 1 : The standby which is not listed in s_s_name, is down * case 2 : The standby which is
listedin s_s_names but potential
 
standby, is down * case 3 : The standby which is considered as sync standby, is down.
* Standby promotion

In order to confirm that the commit isn't done in case #3 forever
unless new sync standby is up, I think we need the framework that
cancels executing query.
That is, what I'm planning is,

1. Set up master server (s_s_name = '2, standby1, standby2)
2. Set up two standby servers
3. Standby1 is down
4. Create some contents on master (But transaction is not committed)
5. Cancel the #4 query. (Also confirm that the flush location of only
standby2 makes progress)

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote:
> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
> * Confirm that the data is synchronously replicated to multiple
> standbys in same cases.
>   * case 1 : The standby which is not listed in s_s_name, is down
>   * case 2 : The standby which is listed in s_s_names but potential
> standby, is down
>   * case 3 : The standby which is considered as sync standby, is down.
> * Standby promotion
>
> In order to confirm that the commit isn't done in case #3 forever
> unless new sync standby is up, I think we need the framework that
> cancels executing query.
> That is, what I'm planning is,
> 1. Set up master server (s_s_name = '2, standby1, standby2)
> 2. Set up two standby servers
> 3. Standby1 is down
> 4. Create some contents on master (But transaction is not committed)
> 5. Cancel the #4 query. (Also confirm that the flush location of only
> standby2 makes progress)

This will need some thinking and is not as easy as it sounds. There is
no way to hold on a connection after executing a query in the current
TAP infrastructure. You are just mentioning case 3, but actually cases
1 and 2 are falling into the same need: if there is a failure we want
to be able to not be stuck in the test forever and have a way to
cancel a query execution at will. TAP uses psql -c to execute any sql
queries, but we would need something that is far lower-level, and that
would be basically using the perl driver for Postgres or an equivalent
here.

Honestly for those tests I just thought that we could get to something
reliable by just looking at how each sync replication setup reflects
in pg_stat_replication as the flow is really getting complicated,
giving to the user a clear representation at SQL level of what is
actually occurring in the server depending on the configuration used
being important here.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote:
>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
>> * Confirm that the data is synchronously replicated to multiple
>> standbys in same cases.
>>   * case 1 : The standby which is not listed in s_s_name, is down
>>   * case 2 : The standby which is listed in s_s_names but potential
>> standby, is down
>>   * case 3 : The standby which is considered as sync standby, is down.
>> * Standby promotion
>>
>> In order to confirm that the commit isn't done in case #3 forever
>> unless new sync standby is up, I think we need the framework that
>> cancels executing query.
>> That is, what I'm planning is,
>> 1. Set up master server (s_s_name = '2, standby1, standby2)
>> 2. Set up two standby servers
>> 3. Standby1 is down
>> 4. Create some contents on master (But transaction is not committed)
>> 5. Cancel the #4 query. (Also confirm that the flush location of only
>> standby2 makes progress)
>
> This will need some thinking and is not as easy as it sounds. There is
> no way to hold on a connection after executing a query in the current
> TAP infrastructure. You are just mentioning case 3, but actually cases
> 1 and 2 are falling into the same need: if there is a failure we want
> to be able to not be stuck in the test forever and have a way to
> cancel a query execution at will. TAP uses psql -c to execute any sql
> queries, but we would need something that is far lower-level, and that
> would be basically using the perl driver for Postgres or an equivalent
> here.
>
> Honestly for those tests I just thought that we could get to something
> reliable by just looking at how each sync replication setup reflects
> in pg_stat_replication as the flow is really getting complicated,
> giving to the user a clear representation at SQL level of what is
> actually occurring in the server depending on the configuration used
> being important here.

I see.
We could check the transition of sync_state in pg_stat_replication.
I think it means that it tests for each replication method (switching
state) rather than synchronization of replication.

What I'm planning to have are,
* Confirm value of pg_stat_replication.sync_state (sync, async or potential)
* Standby promotion
* Standby catching up master
And each replication method has above tests.

Are these enough?

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Thom Brown
Date:
On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>> <thomas.munro@enterprisedb.com> wrote:
>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>>> along with the output buffer pointer.
>>>
>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>> function uses synchronous_standby_num which is global variable.
>>> But you mean that the number of synchronous standbys is given as
>>> function argument?
>>
>> Yeah, I was thinking of it as the output buffer size which I would be
>> inclined to make more explicit (I am still coming to terms with the
>> use of global variables in Postgres) but it doesn't matter, please
>> disregard that suggestion.
>>
>>>>> As for the body of that function (which I won't paste here), it
>>>>> contains an algorithm to find the top K elements in an array of N
>>>>> elements.  It does that with a linear search through the top K seen so
>>>>> far for each value in the input array, so its worst case is O(KN)
>>>>> comparisons.  Some of the sorting gurus on this list might have
>>>>> something to say about that but my take is that it seems fine for the
>>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>>> that it doesn't need any space other than the output buffer, unlike
>>>>> some other top-K algorithms which would win for larger inputs.
>>>
>>> Yeah, it's improvement point.
>>> But I'm assumed that the number of synchronous replication is not
>>> large, so I use this algorithm as first version.
>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>
>> You're right, I was dropping that detail, in the tradition of the
>> hand-wavy school of big-O notation.  (I suppose you could skip the
>> inner loop when the priority is lower than the current lowest
>> priority, giving a O(N) best case when the walsenders are perfectly
>> ordered by coincidence.  Probably a bad idea or just not worth
>> worrying about.)
>
> Thank you for reviewing the patch.
> Yeah, I added the logic that skip the inner loop.
>
>>
>>> Attached latest version patch.
>>
>> +/*
>> + * Obtain currently synced LSN location: write and flush, using priority
>> - * In 9.1 we support only a single synchronous standby, chosen from a
>> - * priority list of synchronous_standby_names. Before it can become the
>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>
>> s/standby/standbys/
>>
>> + * list of synchronous_standby_names. Before it can become the
>>
>> s/Before it can become the/Before any standby can become a/
>>
>>   * synchronous standby it must have caught up with the primary; that may
>>   * take some time. Once caught up, the current highest priority standby
>>
>> s/standby/standbys/
>>
>>   * will release waiters from the queue.
>>
>> +bool
>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> +{
>> + int sync_standbys[synchronous_standby_num];
>>
>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>> in C89.)
>>
>> +/*
>> + * Populate a caller-supplied array which much have enough space for
>> + * synchronous_standby_num. Returns position of standbys currently
>> + * considered as synchronous, and its length.
>> + */
>> +int
>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>
>> s/much/must/ (my bad, in previous email).
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be smaller than the
>> number of listed : %d",
>> + synchronous_standby_num)));
>>
>> How about "the number of synchronous standbys exceeds the length of
>> the standby list: %d"?  Error messages usually start with lower case,
>> ':' is not usually preceded by a space.
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>
>> s/The/the/, s/ : /: /
>
> Fixed you mentioned.
>
> Attached latest v5 patch.
> Please review it.

synchronous_standby_num doesn't appear to be a valid GUC name:

LOG:  unrecognized configuration parameter "synchronous_standby_num"
in file "/home/thom/Development/test/primary/postgresql.conf" line 244

All I did was uncomment it and set it to a value.

Thom



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Jan 19, 2016 at 1:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote:
>>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
>>> * Confirm that the data is synchronously replicated to multiple
>>> standbys in same cases.
>>>   * case 1 : The standby which is not listed in s_s_name, is down
>>>   * case 2 : The standby which is listed in s_s_names but potential
>>> standby, is down
>>>   * case 3 : The standby which is considered as sync standby, is down.
>>> * Standby promotion
>>>
>>> In order to confirm that the commit isn't done in case #3 forever
>>> unless new sync standby is up, I think we need the framework that
>>> cancels executing query.
>>> That is, what I'm planning is,
>>> 1. Set up master server (s_s_name = '2, standby1, standby2)
>>> 2. Set up two standby servers
>>> 3. Standby1 is down
>>> 4. Create some contents on master (But transaction is not committed)
>>> 5. Cancel the #4 query. (Also confirm that the flush location of only
>>> standby2 makes progress)
>>
>> This will need some thinking and is not as easy as it sounds. There is
>> no way to hold on a connection after executing a query in the current
>> TAP infrastructure. You are just mentioning case 3, but actually cases
>> 1 and 2 are falling into the same need: if there is a failure we want
>> to be able to not be stuck in the test forever and have a way to
>> cancel a query execution at will. TAP uses psql -c to execute any sql
>> queries, but we would need something that is far lower-level, and that
>> would be basically using the perl driver for Postgres or an equivalent
>> here.
>>
>> Honestly for those tests I just thought that we could get to something
>> reliable by just looking at how each sync replication setup reflects
>> in pg_stat_replication as the flow is really getting complicated,
>> giving to the user a clear representation at SQL level of what is
>> actually occurring in the server depending on the configuration used
>> being important here.
>
> I see.
> We could check the transition of sync_state in pg_stat_replication.
> I think it means that it tests for each replication method (switching
> state) rather than synchronization of replication.
>
> What I'm planning to have are,
> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
> * Standby promotion
> * Standby catching up master
> And each replication method has above tests.
>
> Are these enough?

Does promoting the standby and checking that it caught really have
value in this context of this patch? What we just want to know is on a
master, which nodes need to be waited for when s_s_names or any other
method is used, no?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote:
> On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>>>> along with the output buffer pointer.
>>>>
>>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>>> function uses synchronous_standby_num which is global variable.
>>>> But you mean that the number of synchronous standbys is given as
>>>> function argument?
>>>
>>> Yeah, I was thinking of it as the output buffer size which I would be
>>> inclined to make more explicit (I am still coming to terms with the
>>> use of global variables in Postgres) but it doesn't matter, please
>>> disregard that suggestion.
>>>
>>>>>> As for the body of that function (which I won't paste here), it
>>>>>> contains an algorithm to find the top K elements in an array of N
>>>>>> elements.  It does that with a linear search through the top K seen so
>>>>>> far for each value in the input array, so its worst case is O(KN)
>>>>>> comparisons.  Some of the sorting gurus on this list might have
>>>>>> something to say about that but my take is that it seems fine for the
>>>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>>>> that it doesn't need any space other than the output buffer, unlike
>>>>>> some other top-K algorithms which would win for larger inputs.
>>>>
>>>> Yeah, it's improvement point.
>>>> But I'm assumed that the number of synchronous replication is not
>>>> large, so I use this algorithm as first version.
>>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>>
>>> You're right, I was dropping that detail, in the tradition of the
>>> hand-wavy school of big-O notation.  (I suppose you could skip the
>>> inner loop when the priority is lower than the current lowest
>>> priority, giving a O(N) best case when the walsenders are perfectly
>>> ordered by coincidence.  Probably a bad idea or just not worth
>>> worrying about.)
>>
>> Thank you for reviewing the patch.
>> Yeah, I added the logic that skip the inner loop.
>>
>>>
>>>> Attached latest version patch.
>>>
>>> +/*
>>> + * Obtain currently synced LSN location: write and flush, using priority
>>> - * In 9.1 we support only a single synchronous standby, chosen from a
>>> - * priority list of synchronous_standby_names. Before it can become the
>>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>>
>>> s/standby/standbys/
>>>
>>> + * list of synchronous_standby_names. Before it can become the
>>>
>>> s/Before it can become the/Before any standby can become a/
>>>
>>>   * synchronous standby it must have caught up with the primary; that may
>>>   * take some time. Once caught up, the current highest priority standby
>>>
>>> s/standby/standbys/
>>>
>>>   * will release waiters from the queue.
>>>
>>> +bool
>>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>> +{
>>> + int sync_standbys[synchronous_standby_num];
>>>
>>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>>> in C89.)
>>>
>>> +/*
>>> + * Populate a caller-supplied array which much have enough space for
>>> + * synchronous_standby_num. Returns position of standbys currently
>>> + * considered as synchronous, and its length.
>>> + */
>>> +int
>>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>>
>>> s/much/must/ (my bad, in previous email).
>>>
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>> + errmsg("The number of synchronous standbys must be smaller than the
>>> number of listed : %d",
>>> + synchronous_standby_num)));
>>>
>>> How about "the number of synchronous standbys exceeds the length of
>>> the standby list: %d"?  Error messages usually start with lower case,
>>> ':' is not usually preceded by a space.
>>>
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>>
>>> s/The/the/, s/ : /: /
>>
>> Fixed you mentioned.
>>
>> Attached latest v5 patch.
>> Please review it.
>
> synchronous_standby_num doesn't appear to be a valid GUC name:
>
> LOG:  unrecognized configuration parameter "synchronous_standby_num"
> in file "/home/thom/Development/test/primary/postgresql.conf" line 244
>
> All I did was uncomment it and set it to a value.
>

Thank you for having a look it.

Yeah, synchronous_standby_num should not exists in postgresql.conf.
Please test for multiple sync replication with latest patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Jan 19, 2016 at 2:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Jan 19, 2016 at 1:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote:
>>>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
>>>> * Confirm that the data is synchronously replicated to multiple
>>>> standbys in same cases.
>>>>   * case 1 : The standby which is not listed in s_s_name, is down
>>>>   * case 2 : The standby which is listed in s_s_names but potential
>>>> standby, is down
>>>>   * case 3 : The standby which is considered as sync standby, is down.
>>>> * Standby promotion
>>>>
>>>> In order to confirm that the commit isn't done in case #3 forever
>>>> unless new sync standby is up, I think we need the framework that
>>>> cancels executing query.
>>>> That is, what I'm planning is,
>>>> 1. Set up master server (s_s_name = '2, standby1, standby2)
>>>> 2. Set up two standby servers
>>>> 3. Standby1 is down
>>>> 4. Create some contents on master (But transaction is not committed)
>>>> 5. Cancel the #4 query. (Also confirm that the flush location of only
>>>> standby2 makes progress)
>>>
>>> This will need some thinking and is not as easy as it sounds. There is
>>> no way to hold on a connection after executing a query in the current
>>> TAP infrastructure. You are just mentioning case 3, but actually cases
>>> 1 and 2 are falling into the same need: if there is a failure we want
>>> to be able to not be stuck in the test forever and have a way to
>>> cancel a query execution at will. TAP uses psql -c to execute any sql
>>> queries, but we would need something that is far lower-level, and that
>>> would be basically using the perl driver for Postgres or an equivalent
>>> here.
>>>
>>> Honestly for those tests I just thought that we could get to something
>>> reliable by just looking at how each sync replication setup reflects
>>> in pg_stat_replication as the flow is really getting complicated,
>>> giving to the user a clear representation at SQL level of what is
>>> actually occurring in the server depending on the configuration used
>>> being important here.
>>
>> I see.
>> We could check the transition of sync_state in pg_stat_replication.
>> I think it means that it tests for each replication method (switching
>> state) rather than synchronization of replication.
>>
>> What I'm planning to have are,
>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential)
>> * Standby promotion
>> * Standby catching up master
>> And each replication method has above tests.
>>
>> Are these enough?
>
> Does promoting the standby and checking that it caught really have
> value in this context of this patch? What we just want to know is on a
> master, which nodes need to be waited for when s_s_names or any other
> method is used, no?

Yeah, these 2 tests are not in this context of this patch.
If test framework could have the facility that allows us to execute
query(psql) as another process, we could use pg_cancel_backend()
function to waiting process when master server waiting for standbys.
In order to check whether the master server would wait for the standby
or not, we need test framework to have such facility, I think.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Jan 20, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote:
>> On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
>>> <thomas.munro@enterprisedb.com> wrote:
>>>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>>>>> along with the output buffer pointer.
>>>>>
>>>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>>>> function uses synchronous_standby_num which is global variable.
>>>>> But you mean that the number of synchronous standbys is given as
>>>>> function argument?
>>>>
>>>> Yeah, I was thinking of it as the output buffer size which I would be
>>>> inclined to make more explicit (I am still coming to terms with the
>>>> use of global variables in Postgres) but it doesn't matter, please
>>>> disregard that suggestion.
>>>>
>>>>>>> As for the body of that function (which I won't paste here), it
>>>>>>> contains an algorithm to find the top K elements in an array of N
>>>>>>> elements.  It does that with a linear search through the top K seen so
>>>>>>> far for each value in the input array, so its worst case is O(KN)
>>>>>>> comparisons.  Some of the sorting gurus on this list might have
>>>>>>> something to say about that but my take is that it seems fine for the
>>>>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>>>>> that it doesn't need any space other than the output buffer, unlike
>>>>>>> some other top-K algorithms which would win for larger inputs.
>>>>>
>>>>> Yeah, it's improvement point.
>>>>> But I'm assumed that the number of synchronous replication is not
>>>>> large, so I use this algorithm as first version.
>>>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>>>
>>>> You're right, I was dropping that detail, in the tradition of the
>>>> hand-wavy school of big-O notation.  (I suppose you could skip the
>>>> inner loop when the priority is lower than the current lowest
>>>> priority, giving a O(N) best case when the walsenders are perfectly
>>>> ordered by coincidence.  Probably a bad idea or just not worth
>>>> worrying about.)
>>>
>>> Thank you for reviewing the patch.
>>> Yeah, I added the logic that skip the inner loop.
>>>
>>>>
>>>>> Attached latest version patch.
>>>>
>>>> +/*
>>>> + * Obtain currently synced LSN location: write and flush, using priority
>>>> - * In 9.1 we support only a single synchronous standby, chosen from a
>>>> - * priority list of synchronous_standby_names. Before it can become the
>>>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>>>
>>>> s/standby/standbys/
>>>>
>>>> + * list of synchronous_standby_names. Before it can become the
>>>>
>>>> s/Before it can become the/Before any standby can become a/
>>>>
>>>>   * synchronous standby it must have caught up with the primary; that may
>>>>   * take some time. Once caught up, the current highest priority standby
>>>>
>>>> s/standby/standbys/
>>>>
>>>>   * will release waiters from the queue.
>>>>
>>>> +bool
>>>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>>> +{
>>>> + int sync_standbys[synchronous_standby_num];
>>>>
>>>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>>>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>>>> in C89.)
>>>>
>>>> +/*
>>>> + * Populate a caller-supplied array which much have enough space for
>>>> + * synchronous_standby_num. Returns position of standbys currently
>>>> + * considered as synchronous, and its length.
>>>> + */
>>>> +int
>>>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>>>
>>>> s/much/must/ (my bad, in previous email).
>>>>
>>>> + ereport(ERROR,
>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>>> + errmsg("The number of synchronous standbys must be smaller than the
>>>> number of listed : %d",
>>>> + synchronous_standby_num)));
>>>>
>>>> How about "the number of synchronous standbys exceeds the length of
>>>> the standby list: %d"?  Error messages usually start with lower case,
>>>> ':' is not usually preceded by a space.
>>>>
>>>> + ereport(ERROR,
>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>>>
>>>> s/The/the/, s/ : /: /
>>>
>>> Fixed you mentioned.
>>>
>>> Attached latest v5 patch.
>>> Please review it.
>>
>> synchronous_standby_num doesn't appear to be a valid GUC name:
>>
>> LOG:  unrecognized configuration parameter "synchronous_standby_num"
>> in file "/home/thom/Development/test/primary/postgresql.conf" line 244
>>
>> All I did was uncomment it and set it to a value.
>>
>
> Thank you for having a look it.
>
> Yeah, synchronous_standby_num should not exists in postgresql.conf.
> Please test for multiple sync replication with latest patch.

In synchronous_replication_method = 'priority' case, when I set
synchronous_standby_names to invalid value like 'hoge,foo' and
reloaded the configuration file, the server crashed with
the following error. This crash should not happen.
   FATAL:  invalid input syntax for integer: "hoge"

+    /*
+     * After read all synchronous replication configuration parameter, we apply
+     * settings according to replication method.
+     */
+    ProcessSynchronousReplicationConfig();

Why does the above function need to be called in ProcessConfigFile(), i.e.,
by every postgres processes? I was thinking that only walsender should
call that to check which walsender is synchronous according to the setting.

When synchronous_replication_method = '1-priority' and
synchronous_standby_names = '*', I started one synchronous standby.
Then, when I ran "SELECT * FROM pg_stat_replication", I got the
following WARNING message.
   WARNING:  detected write past chunk end in ExprContext 0x2acb3c0

I don't think that it's good design to specify the number of sync replicas
to wait for, in synchronous_standby_names. It's confusing for the users.
It's better to add separate parameter (synchronous_standby_num) for
specifying that number. Which increases the number of GUC parameters,
though.

Are we really planning to implement synchronous_replication_method=quorum
at the first version? If not, I'd like to remove s_r_method parameter
because it's meaningless. We can add it later when we implement "quorum".

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Jan 28, 2016 at 8:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Jan 20, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote:
>>> On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
>>>>>>> <thomas.munro@enterprisedb.com> wrote:
>>>>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
>>>>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys.
>>>>>>>> I think it would be a tiny bit nicer if it also took a Size n argument
>>>>>>>> along with the output buffer pointer.
>>>>>>
>>>>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>>>>> function uses synchronous_standby_num which is global variable.
>>>>>> But you mean that the number of synchronous standbys is given as
>>>>>> function argument?
>>>>>
>>>>> Yeah, I was thinking of it as the output buffer size which I would be
>>>>> inclined to make more explicit (I am still coming to terms with the
>>>>> use of global variables in Postgres) but it doesn't matter, please
>>>>> disregard that suggestion.
>>>>>
>>>>>>>> As for the body of that function (which I won't paste here), it
>>>>>>>> contains an algorithm to find the top K elements in an array of N
>>>>>>>> elements.  It does that with a linear search through the top K seen so
>>>>>>>> far for each value in the input array, so its worst case is O(KN)
>>>>>>>> comparisons.  Some of the sorting gurus on this list might have
>>>>>>>> something to say about that but my take is that it seems fine for the
>>>>>>>> tiny values of K and N that we're dealing with here, and it's nice
>>>>>>>> that it doesn't need any space other than the output buffer, unlike
>>>>>>>> some other top-K algorithms which would win for larger inputs.
>>>>>>
>>>>>> Yeah, it's improvement point.
>>>>>> But I'm assumed that the number of synchronous replication is not
>>>>>> large, so I use this algorithm as first version.
>>>>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>>>>
>>>>> You're right, I was dropping that detail, in the tradition of the
>>>>> hand-wavy school of big-O notation.  (I suppose you could skip the
>>>>> inner loop when the priority is lower than the current lowest
>>>>> priority, giving a O(N) best case when the walsenders are perfectly
>>>>> ordered by coincidence.  Probably a bad idea or just not worth
>>>>> worrying about.)
>>>>
>>>> Thank you for reviewing the patch.
>>>> Yeah, I added the logic that skip the inner loop.
>>>>
>>>>>
>>>>>> Attached latest version patch.
>>>>>
>>>>> +/*
>>>>> + * Obtain currently synced LSN location: write and flush, using priority
>>>>> - * In 9.1 we support only a single synchronous standby, chosen from a
>>>>> - * priority list of synchronous_standby_names. Before it can become the
>>>>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>>>>
>>>>> s/standby/standbys/
>>>>>
>>>>> + * list of synchronous_standby_names. Before it can become the
>>>>>
>>>>> s/Before it can become the/Before any standby can become a/
>>>>>
>>>>>   * synchronous standby it must have caught up with the primary; that may
>>>>>   * take some time. Once caught up, the current highest priority standby
>>>>>
>>>>> s/standby/standbys/
>>>>>
>>>>>   * will release waiters from the queue.
>>>>>
>>>>> +bool
>>>>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>>>> +{
>>>>> + int sync_standbys[synchronous_standby_num];
>>>>>
>>>>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>>>>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>>>>> in C89.)
>>>>>
>>>>> +/*
>>>>> + * Populate a caller-supplied array which much have enough space for
>>>>> + * synchronous_standby_num. Returns position of standbys currently
>>>>> + * considered as synchronous, and its length.
>>>>> + */
>>>>> +int
>>>>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>>>>
>>>>> s/much/must/ (my bad, in previous email).
>>>>>
>>>>> + ereport(ERROR,
>>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>>>> + errmsg("The number of synchronous standbys must be smaller than the
>>>>> number of listed : %d",
>>>>> + synchronous_standby_num)));
>>>>>
>>>>> How about "the number of synchronous standbys exceeds the length of
>>>>> the standby list: %d"?  Error messages usually start with lower case,
>>>>> ':' is not usually preceded by a space.
>>>>>
>>>>> + ereport(ERROR,
>>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>>>>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>>>>
>>>>> s/The/the/, s/ : /: /
>>>>
>>>> Fixed you mentioned.
>>>>
>>>> Attached latest v5 patch.
>>>> Please review it.
>>>
>>> synchronous_standby_num doesn't appear to be a valid GUC name:
>>>
>>> LOG:  unrecognized configuration parameter "synchronous_standby_num"
>>> in file "/home/thom/Development/test/primary/postgresql.conf" line 244
>>>
>>> All I did was uncomment it and set it to a value.
>>>
>>
>> Thank you for having a look it.
>>
>> Yeah, synchronous_standby_num should not exists in postgresql.conf.
>> Please test for multiple sync replication with latest patch.
>
> In synchronous_replication_method = 'priority' case, when I set
> synchronous_standby_names to invalid value like 'hoge,foo' and
> reloaded the configuration file, the server crashed with
> the following error. This crash should not happen.
>
>     FATAL:  invalid input syntax for integer: "hoge"
>
> +    /*
> +     * After read all synchronous replication configuration parameter, we apply
> +     * settings according to replication method.
> +     */
> +    ProcessSynchronousReplicationConfig();
>
> Why does the above function need to be called in ProcessConfigFile(), i.e.,
> by every postgres processes? I was thinking that only walsender should
> call that to check which walsender is synchronous according to the setting.
>
> When synchronous_replication_method = '1-priority' and
> synchronous_standby_names = '*', I started one synchronous standby.
> Then, when I ran "SELECT * FROM pg_stat_replication", I got the
> following WARNING message.
>
>     WARNING:  detected write past chunk end in ExprContext 0x2acb3c0
>
> I don't think that it's good design to specify the number of sync replicas
> to wait for, in synchronous_standby_names. It's confusing for the users.
> It's better to add separate parameter (synchronous_standby_num) for
> specifying that number. Which increases the number of GUC parameters,
> though.
>
> Are we really planning to implement synchronous_replication_method=quorum
> at the first version? If not, I'd like to remove s_r_method parameter
> because it's meaningless. We can add it later when we implement "quorum".

Thank you for your comment.

By the discussions so far, I'm planning to have several replication
methods such as 'quorum', 'complex' in the feature, and the each
replication method specifies the syntax of s_s_names.
It means that s_s_names could have the number of sync standbys like
what current patch does.
If we have additional GUC like synchronous_standby_num then it will
look oddly, I think.

Even if we don't have 'quorum' method in first version, the synctax of
s_s_names is completely different between 'priority' and '1-priority'.
So we will need to have new GUC parameter like s_r_method in order to
specify the syntax of s_s_names, I think.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
> By the discussions so far, I'm planning to have several replication
> methods such as 'quorum', 'complex' in the feature, and the each
> replication method specifies the syntax of s_s_names.
> It means that s_s_names could have the number of sync standbys like
> what current patch does.

What if the application_name of a standby node has the format of an integer?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>> By the discussions so far, I'm planning to have several replication
>> methods such as 'quorum', 'complex' in the feature, and the each
>> replication method specifies the syntax of s_s_names.
>> It means that s_s_names could have the number of sync standbys like
>> what current patch does.
>
> What if the application_name of a standby node has the format of an integer?

Even if the standby has an integer as application_name, we can set
s_s_names like '2,1,2,3'.
The leading '2' is always handled as the number of sync standbys when
s_r_method = 'priority'.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>> By the discussions so far, I'm planning to have several replication
>>> methods such as 'quorum', 'complex' in the feature, and the each
>>> replication method specifies the syntax of s_s_names.
>>> It means that s_s_names could have the number of sync standbys like
>>> what current patch does.
>>
>> What if the application_name of a standby node has the format of an integer?
>
> Even if the standby has an integer as application_name, we can set
> s_s_names like '2,1,2,3'.
> The leading '2' is always handled as the number of sync standbys when
> s_r_method = 'priority'.

Hm. I agree with Fujii-san here, having the number of sync standbys
defined in a parameter that should have a list of names is a bit
confusing. I'd rather have a separate GUC, which brings us back to one
of the first patches that I came up with, and a couple of people,
including Josh were not happy with that because this did not support
real quorum. Perhaps the final answer would be really to get a set of
hooks, and a contrib module making use of that.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>>> By the discussions so far, I'm planning to have several replication
>>>> methods such as 'quorum', 'complex' in the feature, and the each
>>>> replication method specifies the syntax of s_s_names.
>>>> It means that s_s_names could have the number of sync standbys like
>>>> what current patch does.
>>>
>>> What if the application_name of a standby node has the format of an integer?
>>
>> Even if the standby has an integer as application_name, we can set
>> s_s_names like '2,1,2,3'.
>> The leading '2' is always handled as the number of sync standbys when
>> s_r_method = 'priority'.
>
> Hm. I agree with Fujii-san here, having the number of sync standbys
> defined in a parameter that should have a list of names is a bit
> confusing. I'd rather have a separate GUC, which brings us back to one
> of the first patches that I came up with, and a couple of people,
> including Josh were not happy with that because this did not support
> real quorum. Perhaps the final answer would be really to get a set of
> hooks, and a contrib module making use of that.

Yeah, I agree with having set of hooks, and postgres core has simple
multi sync replication mechanism like you suggested at first version.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>>>> By the discussions so far, I'm planning to have several replication
>>>>> methods such as 'quorum', 'complex' in the feature, and the each
>>>>> replication method specifies the syntax of s_s_names.
>>>>> It means that s_s_names could have the number of sync standbys like
>>>>> what current patch does.
>>>>
>>>> What if the application_name of a standby node has the format of an integer?
>>>
>>> Even if the standby has an integer as application_name, we can set
>>> s_s_names like '2,1,2,3'.
>>> The leading '2' is always handled as the number of sync standbys when
>>> s_r_method = 'priority'.
>>
>> Hm. I agree with Fujii-san here, having the number of sync standbys
>> defined in a parameter that should have a list of names is a bit
>> confusing. I'd rather have a separate GUC, which brings us back to one
>> of the first patches that I came up with, and a couple of people,
>> including Josh were not happy with that because this did not support
>> real quorum. Perhaps the final answer would be really to get a set of
>> hooks, and a contrib module making use of that.
>
> Yeah, I agree with having set of hooks, and postgres core has simple
> multi sync replication mechanism like you suggested at first version.

If there are hooks, I don't think that we should really bother about
having in core anything more complicated than what we have now. The
trick will be to come up with a hook design modular enough to support
the kind of configurations mentioned on this thread. Roughly perhaps a
refactoring of the syncrep code so as it is possible to wait for
multiple targets some of them being optional,, one modular way in
pg_stat_get_wal_senders to represent the status of a node to user, and
another hook to return to decide which are the nodes to wait for. Some
of the nodes being waited for may be based on conditions for quorum
support. That's a hard problem to do that in a flexible enough way.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>>>>> By the discussions so far, I'm planning to have several replication
>>>>>> methods such as 'quorum', 'complex' in the feature, and the each
>>>>>> replication method specifies the syntax of s_s_names.
>>>>>> It means that s_s_names could have the number of sync standbys like
>>>>>> what current patch does.
>>>>>
>>>>> What if the application_name of a standby node has the format of an integer?
>>>>
>>>> Even if the standby has an integer as application_name, we can set
>>>> s_s_names like '2,1,2,3'.
>>>> The leading '2' is always handled as the number of sync standbys when
>>>> s_r_method = 'priority'.
>>>
>>> Hm. I agree with Fujii-san here, having the number of sync standbys
>>> defined in a parameter that should have a list of names is a bit
>>> confusing. I'd rather have a separate GUC, which brings us back to one
>>> of the first patches that I came up with, and a couple of people,
>>> including Josh were not happy with that because this did not support
>>> real quorum. Perhaps the final answer would be really to get a set of
>>> hooks, and a contrib module making use of that.
>>
>> Yeah, I agree with having set of hooks, and postgres core has simple
>> multi sync replication mechanism like you suggested at first version.
>
> If there are hooks, I don't think that we should really bother about
> having in core anything more complicated than what we have now. The
> trick will be to come up with a hook design modular enough to support
> the kind of configurations mentioned on this thread. Roughly perhaps a
> refactoring of the syncrep code so as it is possible to wait for
> multiple targets some of them being optional,, one modular way in
> pg_stat_get_wal_senders to represent the status of a node to user, and
> another hook to return to decide which are the nodes to wait for. Some
> of the nodes being waited for may be based on conditions for quorum
> support. That's a hard problem to do that in a flexible enough way.

Hm, I think not-nested quorum and priority are not complicated, and we
should support at least both or either simple method in core of
postgres.
More complicated method like using json-style, or dedicated language
would be supported by external module.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Feb 1, 2016 at 5:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
>>>>> <michael.paquier@gmail.com> wrote:
>>>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>>>>>> By the discussions so far, I'm planning to have several replication
>>>>>>> methods such as 'quorum', 'complex' in the feature, and the each
>>>>>>> replication method specifies the syntax of s_s_names.
>>>>>>> It means that s_s_names could have the number of sync standbys like
>>>>>>> what current patch does.
>>>>>>
>>>>>> What if the application_name of a standby node has the format of an integer?
>>>>>
>>>>> Even if the standby has an integer as application_name, we can set
>>>>> s_s_names like '2,1,2,3'.
>>>>> The leading '2' is always handled as the number of sync standbys when
>>>>> s_r_method = 'priority'.
>>>>
>>>> Hm. I agree with Fujii-san here, having the number of sync standbys
>>>> defined in a parameter that should have a list of names is a bit
>>>> confusing. I'd rather have a separate GUC, which brings us back to one
>>>> of the first patches that I came up with, and a couple of people,
>>>> including Josh were not happy with that because this did not support
>>>> real quorum. Perhaps the final answer would be really to get a set of
>>>> hooks, and a contrib module making use of that.
>>>
>>> Yeah, I agree with having set of hooks, and postgres core has simple
>>> multi sync replication mechanism like you suggested at first version.
>>
>> If there are hooks, I don't think that we should really bother about
>> having in core anything more complicated than what we have now. The
>> trick will be to come up with a hook design modular enough to support
>> the kind of configurations mentioned on this thread. Roughly perhaps a
>> refactoring of the syncrep code so as it is possible to wait for
>> multiple targets some of them being optional,, one modular way in
>> pg_stat_get_wal_senders to represent the status of a node to user, and
>> another hook to return to decide which are the nodes to wait for. Some
>> of the nodes being waited for may be based on conditions for quorum
>> support. That's a hard problem to do that in a flexible enough way.
>
> Hm, I think not-nested quorum and priority are not complicated, and we
> should support at least both or either simple method in core of
> postgres.
> More complicated method like using json-style, or dedicated language
> would be supported by external module.

So what about the following plan?

[first version]
Add only synchronous_standby_num which specifies the number of standbys
that the master must wait for before marking sync replication as completed.
This version supports simple use cases like "I want to have two synchronous
standbys".

[second version]
Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
additionally supports simple quorum commit case like "I want to ensure
that WAL is replicated synchronously to at least two standbys from five
ones listed in s_s_names".

Or

Add something like quorum_replication_num and quorum_standby_names, i.e.,
the master must wait for at least q_r_num standbys from ones listed in
q_s_names before marking sync replication as completed. Also the master
must wait for sync replication according to s_s_num and s_s_num.
That is, this approach separates 'priority' and 'quorum' to each parameters.
This increases the number of GUC parameters, but ISTM less confusing, and
it supports a bit complicated case like "there is one local standby and three
remote standbys, then I want to ensure that WAL is replicated synchronously
to the local standby and at least two remote one", e.g.,
 s_s_num = 1, s_s_names = 'local' q_s_num = 2, q_s_names = 'remote1, remote2, remote3'

[third version]
Add the hooks for more complicated sync replication cases.

I'm thinking that the realistic target for 9.6 might be the first one.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:


On Mon, Feb 1, 2016 at 11:28 PM, Fujii Masao wrote:
[first version]
Add only synchronous_standby_num which specifies the number of standbys
that the master must wait for before marking sync replication as completed.
This version supports simple use cases like "I want to have two synchronous
standbys".

[second version]
Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
additionally supports simple quorum commit case like "I want to ensure
that WAL is replicated synchronously to at least two standbys from five
ones listed in s_s_names".

Or

Add something like quorum_replication_num and quorum_standby_names, i.e.,
the master must wait for at least q_r_num standbys from ones listed in
q_s_names before marking sync replication as completed. Also the master
must wait for sync replication according to s_s_num and s_s_num.
That is, this approach separates 'priority' and 'quorum' to each parameters.
This increases the number of GUC parameters, but ISTM less confusing, and
it supports a bit complicated case like "there is one local standby and three
remote standbys, then I want to ensure that WAL is replicated synchronously
to the local standby and at least two remote one", e.g.,

  s_s_num = 1, s_s_names = 'local'
  q_s_num = 2, q_s_names = 'remote1, remote2, remote3'

[third version]
Add the hooks for more complicated sync replication cases.

I'm thinking that the realistic target for 9.6 might be the first one.
 
If we want to get something out for this release, clearly yes, and being able to specify 2 sync targets is already a win when the two sync standbys are not exactly at the same location. FWIW, I don't doing coding and/or review work, that's basically my first patch that needs a bit more love and polishing, *and* test cases but I am used enough to perl and PostgresNode these days to produce something based on sanity checks of pg_stat_replication and my other set of patches that have more basic routines.

Now I would not mind if we actually jump into the 3rd case if we are fine with doing nothing for this release, but this requires a lot of design and background work, so that's not plausible for 9.6. Of course if there are voices against the scenario proposed by Fujii-san others feel free to speak up.
--
Michael

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Feb 1, 2016 at 11:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Feb 1, 2016 at 5:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier
>>>>>> <michael.paquier@gmail.com> wrote:
>>>>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote:
>>>>>>>> By the discussions so far, I'm planning to have several replication
>>>>>>>> methods such as 'quorum', 'complex' in the feature, and the each
>>>>>>>> replication method specifies the syntax of s_s_names.
>>>>>>>> It means that s_s_names could have the number of sync standbys like
>>>>>>>> what current patch does.
>>>>>>>
>>>>>>> What if the application_name of a standby node has the format of an integer?
>>>>>>
>>>>>> Even if the standby has an integer as application_name, we can set
>>>>>> s_s_names like '2,1,2,3'.
>>>>>> The leading '2' is always handled as the number of sync standbys when
>>>>>> s_r_method = 'priority'.
>>>>>
>>>>> Hm. I agree with Fujii-san here, having the number of sync standbys
>>>>> defined in a parameter that should have a list of names is a bit
>>>>> confusing. I'd rather have a separate GUC, which brings us back to one
>>>>> of the first patches that I came up with, and a couple of people,
>>>>> including Josh were not happy with that because this did not support
>>>>> real quorum. Perhaps the final answer would be really to get a set of
>>>>> hooks, and a contrib module making use of that.
>>>>
>>>> Yeah, I agree with having set of hooks, and postgres core has simple
>>>> multi sync replication mechanism like you suggested at first version.
>>>
>>> If there are hooks, I don't think that we should really bother about
>>> having in core anything more complicated than what we have now. The
>>> trick will be to come up with a hook design modular enough to support
>>> the kind of configurations mentioned on this thread. Roughly perhaps a
>>> refactoring of the syncrep code so as it is possible to wait for
>>> multiple targets some of them being optional,, one modular way in
>>> pg_stat_get_wal_senders to represent the status of a node to user, and
>>> another hook to return to decide which are the nodes to wait for. Some
>>> of the nodes being waited for may be based on conditions for quorum
>>> support. That's a hard problem to do that in a flexible enough way.
>>
>> Hm, I think not-nested quorum and priority are not complicated, and we
>> should support at least both or either simple method in core of
>> postgres.
>> More complicated method like using json-style, or dedicated language
>> would be supported by external module.
>
> So what about the following plan?
>
> [first version]
> Add only synchronous_standby_num which specifies the number of standbys
> that the master must wait for before marking sync replication as completed.
> This version supports simple use cases like "I want to have two synchronous
> standbys".
>
> [second version]
> Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
> additionally supports simple quorum commit case like "I want to ensure
> that WAL is replicated synchronously to at least two standbys from five
> ones listed in s_s_names".
>
> Or
>
> Add something like quorum_replication_num and quorum_standby_names, i.e.,
> the master must wait for at least q_r_num standbys from ones listed in
> q_s_names before marking sync replication as completed. Also the master
> must wait for sync replication according to s_s_num and s_s_num.
> That is, this approach separates 'priority' and 'quorum' to each parameters.
> This increases the number of GUC parameters, but ISTM less confusing, and
> it supports a bit complicated case like "there is one local standby and three
> remote standbys, then I want to ensure that WAL is replicated synchronously
> to the local standby and at least two remote one", e.g.,
>
>   s_s_num = 1, s_s_names = 'local'
>   q_s_num = 2, q_s_names = 'remote1, remote2, remote3'
>
> [third version]
> Add the hooks for more complicated sync replication cases.
>
> I'm thinking that the realistic target for 9.6 might be the first one.
>

Thank you for suggestion.

I agree with first version, and attached the updated patch which are
modified so that it supports simple multiple sync replication you
suggested.
(but test cases are not included yet.)

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Mon, Feb 1, 2016 at 9:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> So what about the following plan?
>
> [first version]
> Add only synchronous_standby_num which specifies the number of standbys
> that the master must wait for before marking sync replication as completed.
> This version supports simple use cases like "I want to have two synchronous
> standbys".
>
> [second version]
> Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
> additionally supports simple quorum commit case like "I want to ensure
> that WAL is replicated synchronously to at least two standbys from five
> ones listed in s_s_names".
>
> Or
>
> Add something like quorum_replication_num and quorum_standby_names, i.e.,
> the master must wait for at least q_r_num standbys from ones listed in
> q_s_names before marking sync replication as completed. Also the master
> must wait for sync replication according to s_s_num and s_s_num.
> That is, this approach separates 'priority' and 'quorum' to each parameters.
> This increases the number of GUC parameters, but ISTM less confusing, and
> it supports a bit complicated case like "there is one local standby and three
> remote standbys, then I want to ensure that WAL is replicated synchronously
> to the local standby and at least two remote one", e.g.,
>
>   s_s_num = 1, s_s_names = 'local'
>   q_s_num = 2, q_s_names = 'remote1, remote2, remote3'
>
> [third version]
> Add the hooks for more complicated sync replication cases.

-1.  We're wrapping ourselves around the axle here and ending up with
a design that will not let someone say "the local standby and at least
one remote standby" without writing C code.  I understand nobody likes
the mini-language I proposed and nobody likes a JSON configuration
file either.  I also understand that either of those things would
allow ridiculously complicated configurations that nobody will ever
need in the real world.  But I think "one local and one remote" is a
fairly common case and that you shouldn't need a PhD in
PostgreSQLology to configure it.

Also, to be frank, I think we ought to be putting more effort into
another patch in this same area, specifically Thomas Munro's causal
reads patch.  I think a lot of people today are trying to use
synchronous replication to build load-balancing clusters and avoid the
problem where you write some data and then read back stale data from a
standby server.  Of course, our current synchronous replication
facilities make no such guarantees - his patch does, and I think
that's pretty important.  I'm not saying that we shouldn't do this
too, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 1, 2016 at 9:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> So what about the following plan?
>>
>> [first version]
>> Add only synchronous_standby_num which specifies the number of standbys
>> that the master must wait for before marking sync replication as completed.
>> This version supports simple use cases like "I want to have two synchronous
>> standbys".
>>
>> [second version]
>> Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
>> additionally supports simple quorum commit case like "I want to ensure
>> that WAL is replicated synchronously to at least two standbys from five
>> ones listed in s_s_names".
>>
>> Or
>>
>> Add something like quorum_replication_num and quorum_standby_names, i.e.,
>> the master must wait for at least q_r_num standbys from ones listed in
>> q_s_names before marking sync replication as completed. Also the master
>> must wait for sync replication according to s_s_num and s_s_num.
>> That is, this approach separates 'priority' and 'quorum' to each parameters.
>> This increases the number of GUC parameters, but ISTM less confusing, and
>> it supports a bit complicated case like "there is one local standby and three
>> remote standbys, then I want to ensure that WAL is replicated synchronously
>> to the local standby and at least two remote one", e.g.,
>>
>>   s_s_num = 1, s_s_names = 'local'
>>   q_s_num = 2, q_s_names = 'remote1, remote2, remote3'
>>
>> [third version]
>> Add the hooks for more complicated sync replication cases.
>
> -1.  We're wrapping ourselves around the axle here and ending up with
> a design that will not let someone say "the local standby and at least
> one remote standby" without writing C code.  I understand nobody likes
> the mini-language I proposed and nobody likes a JSON configuration
> file either.  I also understand that either of those things would
> allow ridiculously complicated configurations that nobody will ever
> need in the real world.  But I think "one local and one remote" is a
> fairly common case and that you shouldn't need a PhD in
> PostgreSQLology to configure it.

So you disagree with only third version that I proposed, i.e.,
adding some hooks for sync replication? If yes and you're OK
with the first and second versions, ISTM that we almost reached
consensus on the direction of multiple sync replication feature.
The first version can cover "one local and one remote sync standbys" case,
and the second can cover "one local and at least one from several remote
standbys" case. I'm thinking to focus on the first version now,
and then we can work on the second to support the quorum commit

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> So you disagree with only third version that I proposed, i.e.,
> adding some hooks for sync replication? If yes and you're OK
> with the first and second versions, ISTM that we almost reached
> consensus on the direction of multiple sync replication feature.
> The first version can cover "one local and one remote sync standbys" case,
> and the second can cover "one local and at least one from several remote
> standbys" case. I'm thinking to focus on the first version now,
> and then we can work on the second to support the quorum commit

Well, I think the only hard part of the third problem is deciding on
what syntax to use.  It seems like a waste of time to me to go to a
bunch of trouble to implement #1 and #2 using one syntax and then have
to invent a whole new syntax for #3.  Seriously, this isn't that hard:
it's not a technical problem.  It's just that we've got a bunch of
people who can't agree on what syntax to use.  IMO, you should just
pick something.  You're presumably the committer for this patch, and I
think you should just decide which of the 47,123 things proposed so
far is best and insist on that.  I trust that you will make a good
decision even if it's different than the decision that I would have
made.

Now, if it's easier to implement a subset of that syntax first and
then extend it later, fine.   But it makes no sense to me to implement
the easy cases without having some idea of how you're go to extend
that to the hard cases.  Then you'll just end up with a mishmash.
Pick something that can be extended to handle all of the plausible
cases, whether it's a mini-language or a JSON blob or a
pg_hba.conf-type file or some other crazy thing that you invent, and
just do it and be done with it.  We've wasted far too much time trying
to reach consensus on this: it's time for you to exercise your vast
dictatorial power.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> So you disagree with only third version that I proposed, i.e.,
>> adding some hooks for sync replication? If yes and you're OK
>> with the first and second versions, ISTM that we almost reached
>> consensus on the direction of multiple sync replication feature.
>> The first version can cover "one local and one remote sync standbys" case,
>> and the second can cover "one local and at least one from several remote
>> standbys" case. I'm thinking to focus on the first version now,
>> and then we can work on the second to support the quorum commit
>
> Well, I think the only hard part of the third problem is deciding on
> what syntax to use.  It seems like a waste of time to me to go to a
> bunch of trouble to implement #1 and #2 using one syntax and then have
> to invent a whole new syntax for #3.  Seriously, this isn't that hard:
> it's not a technical problem.  It's just that we've got a bunch of
> people who can't agree on what syntax to use.  IMO, you should just
> pick something.  You're presumably the committer for this patch, and I
> think you should just decide which of the 47,123 things proposed so
> far is best and insist on that.  I trust that you will make a good
> decision even if it's different than the decision that I would have
> made.

If we use one syntax for every cases, possible approaches that we can choose
are mini-language, json, etc. Since my previous proposal covers only very
simple cases, extra syntax needs to be supported for more complicated cases.
My plan was to add the hooks so that the developers can choose their own
syntax. But which might confuse users.

Now I'm thinking that mini-language is better choice. A json has some good
points, but its big problem is that the setting value is likely to be very long.
For example, when the master needs to wait for one local standby and
at least one from three remote standbys in London data center, the setting
value (synchronous_standby_names) would be
 s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
"nodes":["london1", "london2", "london3"]}]}'

OTOH, the value with mini-language is simple and not so long as follows.
 s_s_names = '2[local1, 1(london1, london2, london3)]'

This is why I'm now thinking that mini-language is better. But it's not easy
to completely implement mini-language. There seems to be many problems
that we need to resolve. For example, please imagine the case where
the master needs to wait for at least one from two standbys "tokyo1", "tokyo2"
in Tokyo data center. If Tokyo data center fails, the master needs to
wait for at least one from two standbys "london1", "london2" in London
data center, instead. This case can be configured as follows in mini-language.
 s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]'

One problem here is; what pg_stat_replication.sync_state value should be
shown for each standbys? Which standby should be marked as sync? potential?
any other value like quorum? The current design of pg_stat_replication
doesn't fit complicated sync replication cases, so maybe we need to separate
it into several views. It's almost impossible to complete those problems.

My current plan for 9.6 is to support the minimal subset of mini-language;
simple syntax of "<number>[name, ...]". "<number>" specifies the number of
sync standbys that the master needs to wait for. "[name, ...]" specifies
the priorities of the listed standbys. This first version supports neither
quorum commit nor nested sync replication configuration like
"<number>[name, <number>[name, ...]]". It just supports very simple
"1-level" configuration.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Thom Brown
Date:
On 4 February 2016 at 14:34, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> So you disagree with only third version that I proposed, i.e.,
>>> adding some hooks for sync replication? If yes and you're OK
>>> with the first and second versions, ISTM that we almost reached
>>> consensus on the direction of multiple sync replication feature.
>>> The first version can cover "one local and one remote sync standbys" case,
>>> and the second can cover "one local and at least one from several remote
>>> standbys" case. I'm thinking to focus on the first version now,
>>> and then we can work on the second to support the quorum commit
>>
>> Well, I think the only hard part of the third problem is deciding on
>> what syntax to use.  It seems like a waste of time to me to go to a
>> bunch of trouble to implement #1 and #2 using one syntax and then have
>> to invent a whole new syntax for #3.  Seriously, this isn't that hard:
>> it's not a technical problem.  It's just that we've got a bunch of
>> people who can't agree on what syntax to use.  IMO, you should just
>> pick something.  You're presumably the committer for this patch, and I
>> think you should just decide which of the 47,123 things proposed so
>> far is best and insist on that.  I trust that you will make a good
>> decision even if it's different than the decision that I would have
>> made.
>
> If we use one syntax for every cases, possible approaches that we can choose
> are mini-language, json, etc. Since my previous proposal covers only very
> simple cases, extra syntax needs to be supported for more complicated cases.
> My plan was to add the hooks so that the developers can choose their own
> syntax. But which might confuse users.
>
> Now I'm thinking that mini-language is better choice. A json has some good
> points, but its big problem is that the setting value is likely to be very long.
> For example, when the master needs to wait for one local standby and
> at least one from three remote standbys in London data center, the setting
> value (synchronous_standby_names) would be
>
>   s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
> "nodes":["london1", "london2", "london3"]}]}'
>
> OTOH, the value with mini-language is simple and not so long as follows.
>
>   s_s_names = '2[local1, 1(london1, london2, london3)]'
>
> This is why I'm now thinking that mini-language is better. But it's not easy
> to completely implement mini-language. There seems to be many problems
> that we need to resolve. For example, please imagine the case where
> the master needs to wait for at least one from two standbys "tokyo1", "tokyo2"
> in Tokyo data center. If Tokyo data center fails, the master needs to
> wait for at least one from two standbys "london1", "london2" in London
> data center, instead. This case can be configured as follows in mini-language.
>
>   s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]'
>
> One problem here is; what pg_stat_replication.sync_state value should be
> shown for each standbys? Which standby should be marked as sync? potential?
> any other value like quorum? The current design of pg_stat_replication
> doesn't fit complicated sync replication cases, so maybe we need to separate
> it into several views. It's almost impossible to complete those problems.
>
> My current plan for 9.6 is to support the minimal subset of mini-language;
> simple syntax of "<number>[name, ...]". "<number>" specifies the number of
> sync standbys that the master needs to wait for. "[name, ...]" specifies
> the priorities of the listed standbys. This first version supports neither
> quorum commit nor nested sync replication configuration like
> "<number>[name, <number>[name, ...]]". It just supports very simple
> "1-level" configuration.

Whatever the solution, I'm really don't like the idea of changing the
definition of s_s_names based on the value of another GUC, mainly
because it seems hacky, but also because the name of the GUC stops
making sense.

Thom



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Feb 4, 2016 at 9:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Now I'm thinking that mini-language is better choice. A json has some good
> points, but its big problem is that the setting value is likely to be very long.
> For example, when the master needs to wait for one local standby and
> at least one from three remote standbys in London data center, the setting
> value (synchronous_standby_names) would be
>
>   s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
> "nodes":["london1", "london2", "london3"]}]}'
>
> OTOH, the value with mini-language is simple and not so long as follows.
>
>   s_s_names = '2[local1, 1(london1, london2, london3)]'

Yeah, that was my thought also.  Another idea which was suggested is
to create a completely new configuration file for this.  Most people
would only have simple stuff in there, of course, but then you could
have the information spread across multiple lines.

I don't in the end care very much about how we solve this problem.
But I'm glad you agree that whatever we do to solve the simple problem
should be a logical subset of what the full solution will eventually
look like, not a completely different design.  I think that's
important.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 7:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't in the end care very much about how we solve this problem.
> But I'm glad you agree that whatever we do to solve the simple problem
> should be a logical subset of what the full solution will eventually
> look like, not a completely different design.  I think that's
> important.

Yes, please let's use the custom language, and let's not care of not
more than 1 level of nesting so as it is possible to represent
pg_stat_replication in a simple way for the user.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Yes, please let's use the custom language, and let's not care of not
> more than 1 level of nesting so as it is possible to represent
> pg_stat_replication in a simple way for the user.

"not" is used twice in this sentence in a way that renders me not able
to be sure that I'm not understanding it not properly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Yes, please let's use the custom language, and let's not care of not
>> more than 1 level of nesting so as it is possible to represent
>> pg_stat_replication in a simple way for the user.
>
> "not" is used twice in this sentence in a way that renders me not able
> to be sure that I'm not understanding it not properly.

4 times here. Score beaten.

Sorry. Perhaps I am tired... I was just wondering if it would be fine
to only support configurations up to one level of nested objects, like
that:
2[node1, node2, node3]
node1, 2[node2, node3], node3
In short, we could restrict things so as we cannot define a group of
nodes within an existing group.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Feb 4, 2016 at 2:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Yes, please let's use the custom language, and let's not care of not
>>> more than 1 level of nesting so as it is possible to represent
>>> pg_stat_replication in a simple way for the user.
>>
>> "not" is used twice in this sentence in a way that renders me not able
>> to be sure that I'm not understanding it not properly.
>
> 4 times here. Score beaten.
>
> Sorry. Perhaps I am tired... I was just wondering if it would be fine
> to only support configurations up to one level of nested objects, like
> that:
> 2[node1, node2, node3]
> node1, 2[node2, node3], node3
> In short, we could restrict things so as we cannot define a group of
> nodes within an existing group.

I see.  Such a restriction doesn't seem likely to me to prevent people
from doing anything actually useful.  But I don't know that it buys
very much either.  It's often not very much simpler to handle 2 levels
than n levels.  However, I ain't writing the code so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Yes, please let's use the custom language, and let's not care of not
>>> more than 1 level of nesting so as it is possible to represent
>>> pg_stat_replication in a simple way for the user.
>>
>> "not" is used twice in this sentence in a way that renders me not able
>> to be sure that I'm not understanding it not properly.
>
> 4 times here. Score beaten.
>
> Sorry. Perhaps I am tired... I was just wondering if it would be fine
> to only support configurations up to one level of nested objects, like
> that:
> 2[node1, node2, node3]
> node1, 2[node2, node3], node3
> In short, we could restrict things so as we cannot define a group of
> nodes within an existing group.

No, actually, that's stupid. Having up to two nested levels makes more
sense, a quite common case for this feature being something like that:
2{node1,[node2,node3]}
In short, sync confirmation is waited from node1 and (node2 or node3).

Flattening groups of nodes with a new catalog will be necessary to
ease the view of this data to users:
- group name?
- array of members with nodes/groups
- group type: quorum or priority
- number of items to wait for in this group
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 4 Feb 2016 23:06:45 +0300, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTMV5sZkemGf=SWMyA8QpzV2VW9bRrysXtKzuSVk99ocw@mail.gmail.com>
> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Sorry. Perhaps I am tired... I was just wondering if it would be fine
> > to only support configurations up to one level of nested objects, like
> > that:
> > 2[node1, node2, node3]
> > node1, 2[node2, node3], node3
> > In short, we could restrict things so as we cannot define a group of
> > nodes within an existing group.
> 
> No, actually, that's stupid. Having up to two nested levels makes more
> sense, a quite common case for this feature being something like that:
> 2{node1,[node2,node3]}
> In short, sync confirmation is waited from node1 and (node2 or node3).
> 
> Flattening groups of nodes with a new catalog will be necessary to
> ease the view of this data to users:
> - group name?
> - array of members with nodes/groups
> - group type: quorum or priority
> - number of items to wait for in this group

Though I personally love the format, I don't fully recognize what
the upcoming consensus is and the discussion looks to be looping
back to the past, so please forgive me to confirm the current
discussion status.


We are coming to agree to have configuration manner including
syntax which is compatible with future possible use, I think this
is correct.

(Though I haven't seen it explicitly written upthread, ) we
regard it as important to keep validity of previous setting using
s_s_names as 1-priority method. Is this correct?

The most promising syntax is now considered as n-level
quorum/priority nesting as Michael's proposal above. Correct?

But aiming to 9.6, we are to support (1 or 2)-levels quorum *or*
priority setup with the subset of the syntax. I don't think this
is fully agreed yet.

We don't consider using extension or some plugin mechanism for
additional configuration method for this feature at least as of
9.6. Correct?

I proposed that s_s_method for backward compatibility, but there
is a voice that such a way of changing the semantics of s_s_names
is confising. I can be in sympathy with him. If so, do we have
another variable (named standbys_definition or likewise?)  which
is to be set alternatively with s_s_names? Or take another way?


Sorry for the maybe-noise in advance.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> Yes, please let's use the custom language, and let's not care of not
>>>> more than 1 level of nesting so as it is possible to represent
>>>> pg_stat_replication in a simple way for the user.
>>>
>>> "not" is used twice in this sentence in a way that renders me not able
>>> to be sure that I'm not understanding it not properly.
>>
>> 4 times here. Score beaten.
>>
>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>> to only support configurations up to one level of nested objects, like
>> that:
>> 2[node1, node2, node3]
>> node1, 2[node2, node3], node3
>> In short, we could restrict things so as we cannot define a group of
>> nodes within an existing group.
>
> No, actually, that's stupid. Having up to two nested levels makes more
> sense, a quite common case for this feature being something like that:
> 2{node1,[node2,node3]}
> In short, sync confirmation is waited from node1 and (node2 or node3).
>
> Flattening groups of nodes with a new catalog will be necessary to
> ease the view of this data to users:
> - group name?
> - array of members with nodes/groups
> - group type: quorum or priority
> - number of items to wait for in this group

So, here are some thoughts to make that more user-friendly. I think
that the critical issue here is to properly flatten the meta data in
the custom language and represent it properly in a new catalog,
without messing up too much with the existing pg_stat_replication that
people are now used to for 5 releases since 9.0. So, I would think
that we will need to have a new catalog, say
pg_stat_replication_groups with the following things:
- One line of this catalog represents the status of a group or of a single node.
- The status of a node/group is either sync or potential, if a
node/group is specified more than once, it may be possible that it
would be sync and potential depending on where it is defined, in which
case setting its status to 'sync' has the most sense. If it is in sync
state I guess.
- Move sync_priority and sync_state, actually an equivalent from
pg_stat_replication into this new catalog, because those represent the
status of a node or group of nodes.
- group name, and by that I think that we had perhaps better make
mandatory the need to append a name with a quorum or priority group.
The group at the highest level is forcibly named as 'top', 'main', or
whatever if not directly specified by the user. If the entry is
directly a node, use the application_name.
- Type of group, quorum or priority
- Elements in this group, an element can be a group name or a node
name, aka application_name. If group is of type priority, the elements
are listed in increasing order. So the elements with lower priority
get first, etc. We could have one column listing explicitly a list of
integers that map with the elements of a group but it does not seem
worth it, what users would like to know is what are the nodes that are
prioritized. This covers the former 'priority' field of
pg_stat_replication.

We may have a good idea of how to define a custom language, still we
are going to need to design a clean interface at catalog level more or
less close to what is written here. If we can get a clean interface,
the custom language implemented, and TAP tests that take advantage of
this user interface to check the node/group statuses, I guess that we
would be in good shape for this patch.

Anyway that's not a small project, and perhaps I am over-complicating
the whole thing.

Thoughts?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Joshua Berkus
Date:
> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
> 
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.

Yes.  The more I look at this, the worse the idea of custom syntax looks.  Yes, I realize there are drawbacks to using
JSON,but this is worse.
 

Further, there's a lot of horse-cart inversion here.  This proposal involves letting the syntax for sync_list
configurationdetermine the feature set for N-sync.  That's backwards; we should decide the total list of features we
wantto support, and then adopt a syntax which will make it possible to have them.
 

-- 
Josh Berkus
Red Hat OSAS
(opinions are my own)



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> Yes, please let's use the custom language, and let's not care of not
>>>>> more than 1 level of nesting so as it is possible to represent
>>>>> pg_stat_replication in a simple way for the user.
>>>>
>>>> "not" is used twice in this sentence in a way that renders me not able
>>>> to be sure that I'm not understanding it not properly.
>>>
>>> 4 times here. Score beaten.
>>>
>>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>>> to only support configurations up to one level of nested objects, like
>>> that:
>>> 2[node1, node2, node3]
>>> node1, 2[node2, node3], node3
>>> In short, we could restrict things so as we cannot define a group of
>>> nodes within an existing group.
>>
>> No, actually, that's stupid. Having up to two nested levels makes more
>> sense, a quite common case for this feature being something like that:
>> 2{node1,[node2,node3]}
>> In short, sync confirmation is waited from node1 and (node2 or node3).
>>
>> Flattening groups of nodes with a new catalog will be necessary to
>> ease the view of this data to users:
>> - group name?
>> - array of members with nodes/groups
>> - group type: quorum or priority
>> - number of items to wait for in this group
>
> So, here are some thoughts to make that more user-friendly. I think
> that the critical issue here is to properly flatten the meta data in
> the custom language and represent it properly in a new catalog,
> without messing up too much with the existing pg_stat_replication that
> people are now used to for 5 releases since 9.0. So, I would think
> that we will need to have a new catalog, say
> pg_stat_replication_groups with the following things:
> - One line of this catalog represents the status of a group or of a single node.
> - The status of a node/group is either sync or potential, if a
> node/group is specified more than once, it may be possible that it
> would be sync and potential depending on where it is defined, in which
> case setting its status to 'sync' has the most sense. If it is in sync
> state I guess.
> - Move sync_priority and sync_state, actually an equivalent from
> pg_stat_replication into this new catalog, because those represent the
> status of a node or group of nodes.
> - group name, and by that I think that we had perhaps better make
> mandatory the need to append a name with a quorum or priority group.
> The group at the highest level is forcibly named as 'top', 'main', or
> whatever if not directly specified by the user. If the entry is
> directly a node, use the application_name.
> - Type of group, quorum or priority
> - Elements in this group, an element can be a group name or a node
> name, aka application_name. If group is of type priority, the elements
> are listed in increasing order. So the elements with lower priority
> get first, etc. We could have one column listing explicitly a list of
> integers that map with the elements of a group but it does not seem
> worth it, what users would like to know is what are the nodes that are
> prioritized. This covers the former 'priority' field of
> pg_stat_replication.
>
> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
>
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.
>

I agree with adding new system catalog to easily checking replication
status for user. And group name will needed for this.
What about adding group name with ":" to immediately after set of
standbys like follows?

2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]

Also, regarding sync replication according to configuration, the view
I'm thinking is following definition.

=# \d pg_synchronous_replication    Column          |  Type   | Modifiers
-------------------------+-----------+-----------name                | text      |sync_type         | text
|wait_num         | integer  |sync_priority     | inteter   |sync_state        | text      |member            | text[]
  |level                 | integer  |write_location    | pg_lsn  |flush_location    | pg_lsn  |apply_location   |
pg_lsn  |
 

- "name" : node name or group name, or "main" meaning top level node.
- "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL.
- "wait_num" : number of nodes/groups to wait for in this group.
- "sync_priority" : priority of node/group in this group. "main" node has "0".                         - the standby is
inquorum group always has
 
priority 1.                         - the standby is in priority group has
priority according to definition order.
- "sync_state" : 'sync' or 'potential' or 'quorum'.                        - the standby is in quorum group is always
'quorum'.                       - the standby is in priority group is 'sync'
 
/ 'potential'.
- "member" : array of members for group node, otherwise NULL.
- "level" : nested level. "main" node is level 0.
- "write/flush/apply_location" : group/node calculated LSN according
to configuration.

When sync replication is set as above, the new system view shows,

=# select * from pg_stat_replication_group; name   | sync_type | wait_num | sync_priority | sync_state |member
        | level | write_location | flush_location |
 
apply_location

-------------+---------------+---------------+-------------------+-----------------+---------------------------------------+-------+---------------------+---------------------+----------------main
   | priority      |        2       |                 0 | sync        | {local,london,tokyo}          |     0  | |
               |local      |                |        0       |                 1 |
 
sync           |                                        |     1 |               |                      |london   |
quorum   |        2       |                 2 | potential    | {london1,london2,london3} |     1  |
|                |london1 |                |        0       |                 1 |
 
potential      |                                        |     2  |                |                      |london2 |
          |        0       |                 2 |
 
potential      |                                        |     2  |                |                      |london3 |
          |        0       |                 3 |
 
potential      |                                        |     2  |                |                      |tokyo    |
quorum   |        1       |                 3 | potential    | {tokyo1,tokyo2}                 |     1  |
 
|                      |tokyo1  |                |        0       |                 1 |
quorum       |                                         |     2  |              |                       |tokyo2  |
        |        0       |                 1 |
 
quorum       |                                         |     2  |              |                       |
(9 rows)

Thought?

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Feb 5, 2016 at 12:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> I agree with adding new system catalog to easily checking replication
> status for user. And group name will needed for this.
> What about adding group name with ":" to immediately after set of
> standbys like follows?

This way is fine for me.

> 2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]
>
> Also, regarding sync replication according to configuration, the view
> I'm thinking is following definition.
>
> =# \d pg_synchronous_replication
>      Column          |  Type   | Modifiers
> -------------------------+-----------+-----------
>  name                | text      |
>  sync_type         | text      |
>  wait_num          | integer  |
>  sync_priority     | inteter   |
>  sync_state        | text      |
>  member            | text[]     |
>  level                 | integer  |
>  write_location    | pg_lsn  |
>  flush_location    | pg_lsn  |
>  apply_location   | pg_lsn   |
>
> - "name" : node name or group name, or "main" meaning top level node.

Check.

> - "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL.

That would be one or the other.

> - "wait_num" : number of nodes/groups to wait for in this group.

Check. This is taken directly from the meta data.

> - "sync_priority" : priority of node/group in this group. "main" node has "0".
>                           - the standby is in quorum group always has
> priority 1.
>                           - the standby is in priority group has
> priority according to definition order.

This is a bit confusing if the same node or group in in multiple
groups. My previous suggestion was to list the elements of the group
in increasing order of priority. That's an important point.

> - "sync_state" : 'sync' or 'potential' or 'quorum'.
>                          - the standby is in quorum group is always 'quorum'.
>                          - the standby is in priority group is 'sync'
> / 'potential'.

potential and quorum are the same thing, no? The only difference is
based on the group type here.

> - "member" : array of members for group node, otherwise NULL.

This can be NULL only when the entry is a node.

> - "level" : nested level. "main" node is level 0.

Not sure this one is necessary.

> - "write/flush/apply_location" : group/node calculated LSN according
> to configuration.

This does not need to be part of this catalog, that's a representation
of the data that is part of the WAL sender.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
kharagesuraj
Date:
hello,

I have tested v7 patch.
but i think you forgot to remove some debug points in patch from
src/backend/replication/syncrep.c file.

for (i = 0; i < num_sync; i++)
+    {
+        elog(WARNING, "sync_standbys[%d] = %d", i, sync_standbys[i]);
+    }
+    elog(WARNING, "num_sync = %d, s_s_num = %d", num_sync,
synchronous_standby_num);

Please correct my understanding if i am wrong.

Regards
Suraj Kharage 





--
View this message in context:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5886259.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> Yes, please let's use the custom language, and let's not care of not
>>>>> more than 1 level of nesting so as it is possible to represent
>>>>> pg_stat_replication in a simple way for the user.
>>>>
>>>> "not" is used twice in this sentence in a way that renders me not able
>>>> to be sure that I'm not understanding it not properly.
>>>
>>> 4 times here. Score beaten.
>>>
>>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>>> to only support configurations up to one level of nested objects, like
>>> that:
>>> 2[node1, node2, node3]
>>> node1, 2[node2, node3], node3
>>> In short, we could restrict things so as we cannot define a group of
>>> nodes within an existing group.
>>
>> No, actually, that's stupid. Having up to two nested levels makes more
>> sense, a quite common case for this feature being something like that:
>> 2{node1,[node2,node3]}
>> In short, sync confirmation is waited from node1 and (node2 or node3).
>>
>> Flattening groups of nodes with a new catalog will be necessary to
>> ease the view of this data to users:
>> - group name?
>> - array of members with nodes/groups
>> - group type: quorum or priority
>> - number of items to wait for in this group
>
> So, here are some thoughts to make that more user-friendly. I think
> that the critical issue here is to properly flatten the meta data in
> the custom language and represent it properly in a new catalog,
> without messing up too much with the existing pg_stat_replication that
> people are now used to for 5 releases since 9.0. So, I would think
> that we will need to have a new catalog, say
> pg_stat_replication_groups with the following things:
> - One line of this catalog represents the status of a group or of a single node.
> - The status of a node/group is either sync or potential, if a
> node/group is specified more than once, it may be possible that it
> would be sync and potential depending on where it is defined, in which
> case setting its status to 'sync' has the most sense. If it is in sync
> state I guess.
> - Move sync_priority and sync_state, actually an equivalent from
> pg_stat_replication into this new catalog, because those represent the
> status of a node or group of nodes.
> - group name, and by that I think that we had perhaps better make
> mandatory the need to append a name with a quorum or priority group.
> The group at the highest level is forcibly named as 'top', 'main', or
> whatever if not directly specified by the user. If the entry is
> directly a node, use the application_name.
> - Type of group, quorum or priority
> - Elements in this group, an element can be a group name or a node
> name, aka application_name. If group is of type priority, the elements
> are listed in increasing order. So the elements with lower priority
> get first, etc. We could have one column listing explicitly a list of
> integers that map with the elements of a group but it does not seem
> worth it, what users would like to know is what are the nodes that are
> prioritized. This covers the former 'priority' field of
> pg_stat_replication.
>
> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
>
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.
>
> Thoughts?

I agree that we would need something like such new view in the future,
however it seems too late to work on that for 9.6 unfortunately.
There is only one CommitFest left. Let's focus on very simple case, i.e.,
1-level priority list, now, then we can extend it to cover other cases.

If we can commit the simple version too early and there is enough
time before the date of feature freeze, of course I'm happy to review
the extended version like you proposed, for 9.6.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
kharagesuraj
Date:

Hello,

 

 

>> I agree with first version, and attached the updated patch which are 
>> modified so that it supports simple multiple sync replication you 
>>suggested. 
>> (but test cases are not included yet.) 

 

I have tried for some basic in-built test cases for multisync rep.

I have created one patch over Michael's <a href="http://www.postgresql.org/message-id/CAB7nPqTEqou=[hidden email]">patch</a> patch.

Still it is in progress.

Please have look and correct me if i am wrong and suggest remaining test cases.

 

Regards

Suraj Kharage


If you reply to this email, your message will be added to the discussion below:

http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5886259.html

This email was sent by kharagesuraj (via Nabble)
To receive all replies by email, subscribe to this discussion


______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

recovery_test_suite_with_multisync.patch (36K) Download Attachment


View this message in context: RE: Support for N synchronous standby servers - take 2
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:


On Tue, Feb 9, 2016 at 12:16 PM, kharagesuraj <suraj.kharage@nttdata.com> wrote:

Hello,

 

 

>> I agree with first version, and attached the updated patch which are 
>> modified so that it supports simple multiple sync replication you 
>>suggested. 
>> (but test cases are not included yet.) 

 

I have tried for some basic in-built test cases for multisync rep.

I have created one patch over Michael's <a href="http://www.postgresql.org/message-id/CAB7nPqTEqou=[hidden email]">patch</a> patch.

Still it is in progress.

Please have look and correct me if i am wrong and suggest remaining test cases.


So the interesting part of this patch is 006_sync_rep.pl. I think that you had better build something on top of my patch as a separate patch. This would make things clearer.

+my $result = $node_master->psql('postgres', "select application_name, sync_state from pg_stat_replication;");
+print "$result \n";
+is($result, "standby_1|sync\nstandby_2|sync\nstandby_3|potential", 'checked for sync standbys state initially');
Now regarding the test, you visibly got the idea, though I think that we'd want to update a bit the parameters of postgresql.conf and re-run those queries a couple of times, that's cheaper than having to re-create new cluster nodes all the time, so just create a base, then switch s_s_names a bit, and query pg_stat_replication, and you are already doing the latter.

Also, please attach patches directly to your emails. When loading something on nabble this is located only there and not within postgresql.org which would be annoying if nabble disappears at some point. You would also want to use directly an email client and interact with the community mailing lists this way instead of going through the nabble's forum-like interface (never used it, not really willing to use it, but I guess that it is similar to that).

I am attaching what you posted on this email for the archive's sake.
--
Michael
Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 9 Feb 2016 00:48:57 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHnTKmd90Vu19Swu0C+2mnWxvAH=1FE=-xUbo3s94pRRg@mail.gmail.com>
> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
> >> <michael.paquier@gmail.com> wrote:
> >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
> >>>> <michael.paquier@gmail.com> wrote:
> >>>>> Yes, please let's use the custom language, and let's not care of not
> >>>>> more than 1 level of nesting so as it is possible to represent
> >>>>> pg_stat_replication in a simple way for the user.
> >>>>
> >>>> "not" is used twice in this sentence in a way that renders me not able
> >>>> to be sure that I'm not understanding it not properly.
> >>>
> >>> 4 times here. Score beaten.
> >>>
> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
> >>> to only support configurations up to one level of nested objects, like
> >>> that:
> >>> 2[node1, node2, node3]
> >>> node1, 2[node2, node3], node3
> >>> In short, we could restrict things so as we cannot define a group of
> >>> nodes within an existing group.
> >>
> >> No, actually, that's stupid. Having up to two nested levels makes more
> >> sense, a quite common case for this feature being something like that:
> >> 2{node1,[node2,node3]}
> >> In short, sync confirmation is waited from node1 and (node2 or node3).
> >>
> >> Flattening groups of nodes with a new catalog will be necessary to
> >> ease the view of this data to users:
> >> - group name?
> >> - array of members with nodes/groups
> >> - group type: quorum or priority
> >> - number of items to wait for in this group
> >
> > So, here are some thoughts to make that more user-friendly. I think
> > that the critical issue here is to properly flatten the meta data in
> > the custom language and represent it properly in a new catalog,
> > without messing up too much with the existing pg_stat_replication that
> > people are now used to for 5 releases since 9.0. So, I would think
> > that we will need to have a new catalog, say
> > pg_stat_replication_groups with the following things:
> > - One line of this catalog represents the status of a group or of a single node.
> > - The status of a node/group is either sync or potential, if a
> > node/group is specified more than once, it may be possible that it
> > would be sync and potential depending on where it is defined, in which
> > case setting its status to 'sync' has the most sense. If it is in sync
> > state I guess.
> > - Move sync_priority and sync_state, actually an equivalent from
> > pg_stat_replication into this new catalog, because those represent the
> > status of a node or group of nodes.
> > - group name, and by that I think that we had perhaps better make
> > mandatory the need to append a name with a quorum or priority group.
> > The group at the highest level is forcibly named as 'top', 'main', or
> > whatever if not directly specified by the user. If the entry is
> > directly a node, use the application_name.
> > - Type of group, quorum or priority
> > - Elements in this group, an element can be a group name or a node
> > name, aka application_name. If group is of type priority, the elements
> > are listed in increasing order. So the elements with lower priority
> > get first, etc. We could have one column listing explicitly a list of
> > integers that map with the elements of a group but it does not seem
> > worth it, what users would like to know is what are the nodes that are
> > prioritized. This covers the former 'priority' field of
> > pg_stat_replication.
> >
> > We may have a good idea of how to define a custom language, still we
> > are going to need to design a clean interface at catalog level more or
> > less close to what is written here. If we can get a clean interface,
> > the custom language implemented, and TAP tests that take advantage of
> > this user interface to check the node/group statuses, I guess that we
> > would be in good shape for this patch.
> >
> > Anyway that's not a small project, and perhaps I am over-complicating
> > the whole thing.
> >
> > Thoughts?
> 
> I agree that we would need something like such new view in the future,
> however it seems too late to work on that for 9.6 unfortunately.
> There is only one CommitFest left. Let's focus on very simple case, i.e.,
> 1-level priority list, now, then we can extend it to cover other cases.
> 
> If we can commit the simple version too early and there is enough
> time before the date of feature freeze, of course I'm happy to review
> the extended version like you proposed, for 9.6.

I agree to Fujii-san. There would be many of convenient gadgets
around this and they are completely welcome, but having
fundamental functionality in 9.6 seems to be far benetifical for
most of us.

At least the extensible syntax is fixed, internal structures can
be gradually exnteded along with syntactical enhancement. Over
three levels of definition or group name are syntactically
reserved and they are allowed to be nothing for now. JSON could
be added but it is too complicated for simple cases.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
Hi Suraj,

On 2016/02/09 12:16, kharagesuraj wrote:
> Hello,
> 
> 
>>> I agree with first version, and attached the updated patch which are
>>> modified so that it supports simple multiple sync replication you
>>> suggested.
>>> (but test cases are not included yet.)
> 
> I have tried for some basic in-built test cases for multisync rep.
> I have created one patch over Michael's <a
href="http://www.postgresql.org/message-id/CAB7nPqTEqou=xrYrGSgA13QW1xxsSD6tFHz-Sm_J3EgDvSOCHw@mail.gmail.com">patch</a>
patch.
> Still it is in progress.
> Please have look and correct me if i am wrong and suggest remaining test cases.
> 
> recovery_test_suite_with_multisync.patch (36K)
<http://postgresql.nabble.com/attachment/5886503/0/recovery_test_suite_with_multisync.patch>

Thanks for creating the patch. Sorry to nitpick but as has been brought up
before, it's better to send patches as email attachments (that is, not as
a links to external sites).

Also, it would be helpful if your patch is submitted as a diff over
applying Michael's patch. That is, only the stuff specific to testing the
multiple sync feature and let the rest be taken care of by Michael's base
patch.

Thanks,
Amit





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Tue, Feb 9, 2016 at 1:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 9 Feb 2016 00:48:57 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHnTKmd90Vu19Swu0C+2mnWxvAH=1FE=-xUbo3s94pRRg@mail.gmail.com>
>> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
>> > <michael.paquier@gmail.com> wrote:
>> >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>> >> <michael.paquier@gmail.com> wrote:
>> >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>> >>>> <michael.paquier@gmail.com> wrote:
>> >>>>> Yes, please let's use the custom language, and let's not care of not
>> >>>>> more than 1 level of nesting so as it is possible to represent
>> >>>>> pg_stat_replication in a simple way for the user.
>> >>>>
>> >>>> "not" is used twice in this sentence in a way that renders me not able
>> >>>> to be sure that I'm not understanding it not properly.
>> >>>
>> >>> 4 times here. Score beaten.
>> >>>
>> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>> >>> to only support configurations up to one level of nested objects, like
>> >>> that:
>> >>> 2[node1, node2, node3]
>> >>> node1, 2[node2, node3], node3
>> >>> In short, we could restrict things so as we cannot define a group of
>> >>> nodes within an existing group.
>> >>
>> >> No, actually, that's stupid. Having up to two nested levels makes more
>> >> sense, a quite common case for this feature being something like that:
>> >> 2{node1,[node2,node3]}
>> >> In short, sync confirmation is waited from node1 and (node2 or node3).
>> >>
>> >> Flattening groups of nodes with a new catalog will be necessary to
>> >> ease the view of this data to users:
>> >> - group name?
>> >> - array of members with nodes/groups
>> >> - group type: quorum or priority
>> >> - number of items to wait for in this group
>> >
>> > So, here are some thoughts to make that more user-friendly. I think
>> > that the critical issue here is to properly flatten the meta data in
>> > the custom language and represent it properly in a new catalog,
>> > without messing up too much with the existing pg_stat_replication that
>> > people are now used to for 5 releases since 9.0. So, I would think
>> > that we will need to have a new catalog, say
>> > pg_stat_replication_groups with the following things:
>> > - One line of this catalog represents the status of a group or of a single node.
>> > - The status of a node/group is either sync or potential, if a
>> > node/group is specified more than once, it may be possible that it
>> > would be sync and potential depending on where it is defined, in which
>> > case setting its status to 'sync' has the most sense. If it is in sync
>> > state I guess.
>> > - Move sync_priority and sync_state, actually an equivalent from
>> > pg_stat_replication into this new catalog, because those represent the
>> > status of a node or group of nodes.
>> > - group name, and by that I think that we had perhaps better make
>> > mandatory the need to append a name with a quorum or priority group.
>> > The group at the highest level is forcibly named as 'top', 'main', or
>> > whatever if not directly specified by the user. If the entry is
>> > directly a node, use the application_name.
>> > - Type of group, quorum or priority
>> > - Elements in this group, an element can be a group name or a node
>> > name, aka application_name. If group is of type priority, the elements
>> > are listed in increasing order. So the elements with lower priority
>> > get first, etc. We could have one column listing explicitly a list of
>> > integers that map with the elements of a group but it does not seem
>> > worth it, what users would like to know is what are the nodes that are
>> > prioritized. This covers the former 'priority' field of
>> > pg_stat_replication.
>> >
>> > We may have a good idea of how to define a custom language, still we
>> > are going to need to design a clean interface at catalog level more or
>> > less close to what is written here. If we can get a clean interface,
>> > the custom language implemented, and TAP tests that take advantage of
>> > this user interface to check the node/group statuses, I guess that we
>> > would be in good shape for this patch.
>> >
>> > Anyway that's not a small project, and perhaps I am over-complicating
>> > the whole thing.
>> >
>> > Thoughts?
>>
>> I agree that we would need something like such new view in the future,
>> however it seems too late to work on that for 9.6 unfortunately.
>> There is only one CommitFest left. Let's focus on very simple case, i.e.,
>> 1-level priority list, now, then we can extend it to cover other cases.
>>
>> If we can commit the simple version too early and there is enough
>> time before the date of feature freeze, of course I'm happy to review
>> the extended version like you proposed, for 9.6.
>
> I agree to Fujii-san. There would be many of convenient gadgets
> around this and they are completely welcome, but having
> fundamental functionality in 9.6 seems to be far benetifical for
> most of us.

Hm. Rushing features in because we need them now is not really
community-like. I'd rather not have us taking decisions like that
knowing that we may pay a certain price in the long-term, while it
pays in the short term, aka the 9.6 release. However, having a base in
place for the mini-language would give enough room for future
improvements, so I am fine with having only 1-level of nesting, with
{} and [] supported. This can as well be simply represented within
pg_stat_replication because we'd have basically only one group of
nodes for now (if I got the idea correctly), the and status of each
entry in pg_stat_replication would just need to reflect either
potential or sync, which is something that now users are used to.

So, if I got the vibe correctly, we would basically just allow that in
a first shot:
N{node_list}, to define a priority group
N[node_list], to define a quorum group
There can be only one group, and elements in a node list cannot be a
group. No need of group names either.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote:
> Also, to be frank, I think we ought to be putting more effort into
> another patch in this same area, specifically Thomas Munro's causal
> reads patch.  I think a lot of people today are trying to use
> synchronous replication to build load-balancing clusters and avoid the
> problem where you write some data and then read back stale data from a
> standby server.  Of course, our current synchronous replication
> facilities make no such guarantees - his patch does, and I think
> that's pretty important.  I'm not saying that we shouldn't do this
> too, of course.

Yeah, sure. Each one of those patches is trying to solve a different
problem where Postgres is deficient, here we'd like to be sure a
commit WAL record is correctly flushed on multiple standbys, while the
patch of Thomas is trying to ensure that there is no need to scan for
the replay position of a standby using some GUC parameters and a
validation/sanity layer in syncrep.c to do that. Surely the patch of
this thread has got more attention than Thomas', and both of them have
merits and try to address real problems. FWIW, the patch of Thomas is
a topic that I find rather interesting, and I am planning to look at
it as well, perhaps for next CF or even before that. We'll see how
other things move on.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote:
>> Also, to be frank, I think we ought to be putting more effort into
>> another patch in this same area, specifically Thomas Munro's causal
>> reads patch.  I think a lot of people today are trying to use
>> synchronous replication to build load-balancing clusters and avoid the
>> problem where you write some data and then read back stale data from a
>> standby server.  Of course, our current synchronous replication
>> facilities make no such guarantees - his patch does, and I think
>> that's pretty important.  I'm not saying that we shouldn't do this
>> too, of course.
>
> Yeah, sure. Each one of those patches is trying to solve a different
> problem where Postgres is deficient, here we'd like to be sure a
> commit WAL record is correctly flushed on multiple standbys, while the
> patch of Thomas is trying to ensure that there is no need to scan for
> the replay position of a standby using some GUC parameters and a
> validation/sanity layer in syncrep.c to do that. Surely the patch of
> this thread has got more attention than Thomas', and both of them have
> merits and try to address real problems. FWIW, the patch of Thomas is
> a topic that I find rather interesting, and I am planning to look at
> it as well, perhaps for next CF or even before that. We'll see how
> other things move on.

Attached first version dedicated language patch (document patch is not yet.)

This patch supports only 1-nest priority method, but this feature will
be expanded with adding quorum method or > 1 level nesting.
So this patch are implemented while being considered about its extensibility.
And I've implemented the new system view we discussed on this thread
but that feature is not included in this patch (because it's not
necessary yet now)

== Syntax ==
s_s_names can have two type syntaxes like follows,

1. s_s_names = 'node1, node2, node3'
2. s_s_names = '2[node1, node2, node3]'

#1 syntax is for backward compatibility, which implies the master
server wait for only 1 server.
#2 syntax is new syntax using dedicated language.

In above #2 setting, node1 standby has lowest priority and node3
standby has highest priority.
And master server will wait for COMMIT until at least 2 lowest
priority standbys send ACK to master.

== Memory Structure ==
Previously, master server has value of s_s_names as string, and used
it when master server determine standby priority.
This patch changed it so that master server has new memory structure
(called SyncGroupNode) in order to be able to handle multiple (and
nested in the future) standby nodes flexibly.
All information of SyncGroupNode are set during parsing s_s_names.

The memory structure is,

struct    SyncGroupNode
{
   /* Common information */
   int        type;
   char    *name;
   SyncGroupNode    *next; /* same group next name node */

   /* For group ndoe */
   int sync_method; /* priority */
   int    wait_num;
   SyncGroupNode    *member; /* member of its group */
   bool (*SyncRepGetSyncedLsnsFn) (SyncGroupNode *group, XLogRecPtr *write_pos,
                                   XLogRecPtr *flush_pos);
   int (*SyncRepGetSyncStandbysFn) (SyncGroupNode *group, int *list);
};

SyncGroupNode can be different two types; name node, group node, and
have pointer to another name/group node in same group and list of
group members.
name node represents a synchronous standby.
group node represents a group of some name nodes, which can have list
of group member, and synchronous method, number of waiting node.
The list of members are linked with one-way list, and are located in
s_s_names definition order.
e.g. in case of above #2 setting, member list could be,

"main".member -> "node1".next -> "node2".next -> "node3".next -> NULL

The most top level node is always "main" group node. i.g., in this
version patch, only 1 group ("main" group) is created which has some
name nodes (not group node).
And group node has two functions pointer;

* SyncRepGetSyncedLsnsFn
This function decides group write/flush LSNs at that moment.
For example in case of priority method, the lowest LSNs of standbys
that are considered as synchronous should be selected.
If there are not synchronous standbys enough to decide LSNs then this
function return false.

* SyncRepGetSyncStandbysFn :
This function obtains array of walsnd positions of its standby members
that are considered as synchronous.

This implementation might not good in some reason, so please give me feedbacks.
And I will create new commitfest entry for this patch to CF5.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote:
>>> Also, to be frank, I think we ought to be putting more effort into
>>> another patch in this same area, specifically Thomas Munro's causal
>>> reads patch.  I think a lot of people today are trying to use
>>> synchronous replication to build load-balancing clusters and avoid the
>>> problem where you write some data and then read back stale data from a
>>> standby server.  Of course, our current synchronous replication
>>> facilities make no such guarantees - his patch does, and I think
>>> that's pretty important.  I'm not saying that we shouldn't do this
>>> too, of course.
>>
>> Yeah, sure. Each one of those patches is trying to solve a different
>> problem where Postgres is deficient, here we'd like to be sure a
>> commit WAL record is correctly flushed on multiple standbys, while the
>> patch of Thomas is trying to ensure that there is no need to scan for
>> the replay position of a standby using some GUC parameters and a
>> validation/sanity layer in syncrep.c to do that. Surely the patch of
>> this thread has got more attention than Thomas', and both of them have
>> merits and try to address real problems. FWIW, the patch of Thomas is
>> a topic that I find rather interesting, and I am planning to look at
>> it as well, perhaps for next CF or even before that. We'll see how
>> other things move on.
>
> Attached first version dedicated language patch (document patch is not yet.)

Thanks for the patch! Will review it.

I think that it's time to write the documentation patch.

Though I've not read the patch yet, I found that your patch
changed s_s_names so that it rejects non-alphabet character
like *, according to my simple test. It should accept any
application_name which we can use.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached first version dedicated language patch (document patch is not yet.)
>
> Thanks for the patch! Will review it.
>
> I think that it's time to write the documentation patch.
>
> Though I've not read the patch yet, I found that your patch
> changed s_s_names so that it rejects non-alphabet character
> like *, according to my simple test. It should accept any
> application_name which we can use.

Cool. Planning to look at it as well. Could you as well submit a
regression test based on the recovery infrastructure and submit it as
a separate patch? There is a version upthread of such a test but it
would be good to extract it properly.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 9 Feb 2016 13:31:46 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSJgDLLsVk_Et-O=NBfJNqx3GbHszCYGvuTLRxHaZV3xQ@mail.gmail.com>
> On Tue, Feb 9, 2016 at 1:16 PM, Kyotaro HORIGUCHI
> >> > Anyway that's not a small project, and perhaps I am over-complicating
> >> > the whole thing.
> >> >
> >> > Thoughts?
> >>
> >> I agree that we would need something like such new view in the future,
> >> however it seems too late to work on that for 9.6 unfortunately.
> >> There is only one CommitFest left. Let's focus on very simple case, i.e.,
> >> 1-level priority list, now, then we can extend it to cover other cases.
> >>
> >> If we can commit the simple version too early and there is enough
> >> time before the date of feature freeze, of course I'm happy to review
> >> the extended version like you proposed, for 9.6.
> >
> > I agree to Fujii-san. There would be many of convenient gadgets
> > around this and they are completely welcome, but having
> > fundamental functionality in 9.6 seems to be far benetifical for
> > most of us.
> 
> Hm. Rushing features in because we need them now is not really
> community-like. I'd rather not have us taking decisions like that
> knowing that we may pay a certain price in the long-term, while it
> pays in the short term, aka the 9.6 release. However, having a base in
> place for the mini-language would give enough room for future
> improvements, so I am fine with having only 1-level of nesting, with
> {} and [] supported. This can as well be simply represented within
> pg_stat_replication because we'd have basically only one group of
> nodes for now (if I got the idea correctly), the and status of each
> entry in pg_stat_replication would just need to reflect either
> potential or sync, which is something that now users are used to.

I agree to be more prudent for more 'stiff', a
hard-to-modify-later things. But if once we decede to use []{}
format at the beginning (I believe) for this feature, it is
surely nextensible enough and 1-level of replication sets is
sufficient to cover many new cases and make implement
simple. Internal structure can be evolutionary in contrast to its
user interface. Such a way of development is I don't think not
community-like, concerning the cases like this.

Anyway thank you very much for understanding.

> So, if I got the vibe correctly, we would basically just allow that in
> a first shot:
> N{node_list}, to define a priority group
> N[node_list], to define a quorum group
> There can be only one group, and elements in a node list cannot be a
> group. No need of group names either.
> -- 

That's quite reasonable for the first release of this feature. We
can/should consider the extensibility of the implement of this
feature through reviewing.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Feb 10, 2016 at 9:18 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Attached first version dedicated language patch (document patch is not yet.)
>>
>> Thanks for the patch! Will review it.
>>
>> I think that it's time to write the documentation patch.
>>
>> Though I've not read the patch yet, I found that your patch
>> changed s_s_names so that it rejects non-alphabet character
>> like *, according to my simple test. It should accept any
>> application_name which we can use.
>
> Cool. Planning to look at it as well. Could you as well submit a
> regression test based on the recovery infrastructure and submit it as
> a separate patch? There is a version upthread of such a test but it
> would be good to extract it properly.

Yes, I will implement regression test patch and documentation patch as well.

Attached latest version patch supporting s_s_names = '*'.
Unlike currently behaviour a bit, s_s_names can have only one '*' character.
e.g, The following setting will get syntax error.

s_s_names = '*, node1,node2'
s_s_names = `2[node1, *, node2]`

when we use '*' character as s_s_names element, we must set s_s_names
like follows.

s_s_names = '*'
s_s_names = '2[*]'

BTW, we've discussed about mini language syntax.
IIRC, the syntax uses [] and () like,
'N[node1, node2, ...]', to define priority standbys.
'N(node1, node2, ...)', to define quorum standbys.
And current patch behaves so.

Which type of parentheses should be used for this syntax to be more clarity?
Or other character should be used such as <>, // ?

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello, 

At Wed, 10 Feb 2016 02:57:54 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHR1MNpAgRMh9T0oy0OnydkGaymcNgVOE-1VLZ8Z9twjA@mail.gmail.com>
> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote:
> >>> Also, to be frank, I think we ought to be putting more effort into
> >>> another patch in this same area, specifically Thomas Munro's causal
> >>> reads patch.  I think a lot of people today are trying to use
> >>> synchronous replication to build load-balancing clusters and avoid the
> >>> problem where you write some data and then read back stale data from a
> >>> standby server.  Of course, our current synchronous replication
> >>> facilities make no such guarantees - his patch does, and I think
> >>> that's pretty important.  I'm not saying that we shouldn't do this
> >>> too, of course.
> >>
> >> Yeah, sure. Each one of those patches is trying to solve a different
> >> problem where Postgres is deficient, here we'd like to be sure a
> >> commit WAL record is correctly flushed on multiple standbys, while the
> >> patch of Thomas is trying to ensure that there is no need to scan for
> >> the replay position of a standby using some GUC parameters and a
> >> validation/sanity layer in syncrep.c to do that. Surely the patch of
> >> this thread has got more attention than Thomas', and both of them have
> >> merits and try to address real problems. FWIW, the patch of Thomas is
> >> a topic that I find rather interesting, and I am planning to look at
> >> it as well, perhaps for next CF or even before that. We'll see how
> >> other things move on.
> >
> > Attached first version dedicated language patch (document patch is not yet.)
> 
> Thanks for the patch! Will review it.
> 
> I think that it's time to write the documentation patch.
> 
> Though I've not read the patch yet, I found that your patch
> changed s_s_names so that it rejects non-alphabet character
> like *, according to my simple test. It should accept any
> application_name which we can use.

Thanks for the quick response. At a glance, I'd like to show you
some random suggestions, mainly on writing conventions.


===
Running postgresql with s_s_names = '*', makes error as Fujii-san
said. And it yeilds the following message.

| $ postgres 
| FATAL:  syntax error: unexpected character "*"

Mmm.. It should be tough to find what has happened..


===

check_synchronous_standby_names frees parsed SyncRepStandbyNames
immediately but no reason is explained there. The following
comment looks to be saying something related to this but it
doesn't explain the reason to free.

+ /*
+  * Any additional validation of standby names should go here.
+  *
+  * Don't attempt to set WALSender priority because this is executed by
+  * postmaster at startup, not WALSender, so the application_name is not
+  * yet correctly set.
+  */


Addtion to that, I'd like to see a description like
'syncgroup_yyparse sets the global SyncRepStandbyNames as side
effect' around it.

===
malloc/free are used in create_name_node and other functions to
be used in scanner, but syncgroup_gram.y is said to use
palloc/pfree. Maybe they should use the same memory
allocation/freeing functions.

===
The variable name SyncRepStandbyNames holds the list of
SyncGroupNode*. This is somewhat confusing. How about
SyncRepStandbys?


===
+static void
+SyncRepClearStandbyGroupList(SyncGroupNode *group)
+{
+    SyncGroupNode *n = group->member;

The name 'n' is a bit confusing, I believe that the one-letter
variables should be used following implicit (and ancient?)
convention otherwise pretty short-term and obvious cases.  name,
or group_name instead might be better. There's similar usage of
'n' in other places.


===
+ * Find active walsender position of WalSnd by name. Returns index of walsnds
+ * array if found, otherwise return -1.

I didn't get what is 'walsender position' within this
comment. And as the discussion upthread, there can be multiple
walsenders with the same name. So this might be like this.

> * Finds the first active synchronous walsender with given name
> * in WalSndCtl->wansnds and returns the index of that. Returns
> * -1 if not found.

===
+ * Get both synced LSNS: write and flush, using its group function and check
+ * whether each LSN has advanced to, or not.

This is question for all. Which to use synced, synched or
synchronized? Maybe we should use non-abbreviated spellings
unless the description become too long to make it hard to read.

> * Return true if we have enough synchronized standbys and the 'safe'
> * written and flushed LSNs, which are LSNs assured in all standbys
> * considered should be synchronized.

# Please rewrite me.

===
+SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
+{
+    XLogRecPtr    cur_write_pos;
+    XLogRecPtr    cur_flush_pos;
+    bool        ret;

The name cur_*_pos are a bit confusing. They hold LSNs where all
of standbys choosed as synchronized ones. So how about
safe_*_pos? And 'ret' is not the return value of this function
and it can have more specific name, such like... satisfied? or
else..


===
+SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
...
+    /* Check whether each LSN has advanced to */
+    if (ret)
+    {
...
+        return true;
+    }
+
+    return false;

This might be a kind of favor, It would be simple to be written with
reverse-condition.

===
+ SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
...
+  ret = SyncRepStandbyNames->SyncRepGetSyncedLsnsFn(SyncRepStandbyNames,
+                            &cur_write_pos,
+                            &cur_flush_pos);
...
+    if (MyWalSnd->write >= cur_write_pos)

I suppose SyncRepGetSyncedLsnsFn, or SyncRepGetSyncedLsnsPriority
can return InvalidXLogRecPtr as cur_*_pos even when it returns
true. And, I suppose comparison of LSN values with
InvalidXLogRecPtr is not well-defined. Anyway the condition goes
wrong when cur_write_pos = InvalidXLogRecPtr (but ret = true).

===
+ * Obtain a array containing positions of standbys of specified group
+ * currently considered as synchronous up to wait_num of its group.
+ * Caller is respnsible for allocating the data obtained.

# Anyone please reedit my rewriting below.. Perhaps my writing is
# quite unreadable..

> * Return the positions of the first group->wait_num
> * synchronized standbys in group->member list into
> * sync_list. sync_list is assumed to have enough space for
> * at least group->wait_num elements.

===
+bool
+SyncRepGetSyncedLsnsPriority(SyncGroupNode *group, XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
+{
...
+    for(n = group->member; n != NULL; n = n->next)

group->member holds two or more items, so the name would be
better to be group->members, or member_list.


===
+  /* We already got enough synchronous standbys, return */
+  if (num == group->wait_num)

As convention for saftiness, this kind of comparison is to use
inequality operators.

>  if (num >= group->wait_num)

===
At a glance, SyncRepGetSyncedLsnsPriority and
SyncRepGetSyncStandbysPriority does almost the same thing and both
runs loops over group members. Couldn't they run at once?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 10 Feb 2016 11:25:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCHytB88ZdC0899J7PLNTKWTg0gczC2M7dqLmK71vdY0w@mail.gmail.com>
> On Wed, Feb 10, 2016 at 9:18 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>> Attached first version dedicated language patch (document patch is not yet.)
> >>
> >> Thanks for the patch! Will review it.
> >>
> >> I think that it's time to write the documentation patch.
> >>
> >> Though I've not read the patch yet, I found that your patch
> >> changed s_s_names so that it rejects non-alphabet character
> >> like *, according to my simple test. It should accept any
> >> application_name which we can use.
> >
> > Cool. Planning to look at it as well. Could you as well submit a
> > regression test based on the recovery infrastructure and submit it as
> > a separate patch? There is a version upthread of such a test but it
> > would be good to extract it properly.
> 
> Yes, I will implement regression test patch and documentation patch as well.
> 
> Attached latest version patch supporting s_s_names = '*'.
> Unlike currently behaviour a bit, s_s_names can have only one '*' character.
> e.g, The following setting will get syntax error.
> 
> s_s_names = '*, node1,node2'
> s_s_names = `2[node1, *, node2]`

We could use the setting s_s_names = 'node1, node2, *' as a
extended representation of old s_s_names. It tests node1, node2
as first and try any name if they failed. Similary, '2[node1,
node2, *]' is also meaningful.

> when we use '*' character as s_s_names element, we must set s_s_names
> like follows.
> 
> s_s_names = '*'
> s_s_names = '2[*]'
> 
> BTW, we've discussed about mini language syntax.
> IIRC, the syntax uses [] and () like,
> 'N[node1, node2, ...]', to define priority standbys.
> 'N(node1, node2, ...)', to define quorum standbys.
> And current patch behaves so.
> 
> Which type of parentheses should be used for this syntax to be more clarity?
> Or other character should be used such as <>, // ?

I believed that [] and {} are used respectively for no distinct
reason. I think symmetrical pair of characters is preferable for
readability. Candidate pairs in ascii characters are.

(), {}, [] <> 

{} might be a bit difficult to distinguish from [] on unclear
consoles :p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:


On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Yes, I will implement regression test patch and documentation patch as well.

Cool, now that we have a clear picture of where we want to move, that would be an excellent thing to have. Having the docs in the place is clearly mandatory.

> Attached latest version patch supporting s_s_names = '*'.
> Unlike currently behaviour a bit, s_s_names can have only one '*' character.
> e.g, The following setting will get syntax error.
>
> s_s_names = '*, node1,node2'
> s_s_names = `2[node1, *, node2]`
>
> when we use '*' character as s_s_names element, we must set s_s_names
> like follows.
>
> s_s_names = '*'
> s_s_names = '2[*]'
>
> BTW, we've discussed about mini language syntax.
> IIRC, the syntax uses [] and () like,
> 'N[node1, node2, ...]', to define priority standbys.
> 'N(node1, node2, ...)', to define quorum standbys.
> And current patch behaves so.
>
> Which type of parentheses should be used for this syntax to be more clarity?
> Or other character should be used such as <>, // ?

I am personally fine with () and [] as you mention, we could even consider {}, each one of them has a different meaning mathematically..

I am not entered into a detailed review yet (waiting for the docs), but the patch looks brittle. I have been able to crash the server just by querying pg_stat_replication:
* thread #1: tid = 0x0000, 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783, stop reason = signal SIGSTOP
  * frame #0: 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783
    frame #1: 0x0000000105d4277d postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838, econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8, expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at execQual.c:2211
    frame #2: 0x0000000105d70c24 postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at nodeFunctionscan.c:95
* thread #1: tid = 0x0000, 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783, stop reason = signal SIGSTOP
    frame #0: 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783
   2780        /*
   2781         * Get the currently active synchronous standby.
   2782         */
-> 2783        sync_standbys = (int *) palloc(sizeof(int) * SyncRepStandbyNames->wait_num);
   2784        LWLockAcquire(SyncRepLock, LW_SHARED);
   2785        num_sync = SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys);
   2786        LWLockRelease(SyncRepLock);
(lldb) p SyncRepStandbyNames
(SyncGroupNode *) $0 = 0x0000000000000000

+sync_node_group:
+       sync_list                           { $$ = create_group_node(1, $1); }
+   |   sync_element_ast                    { $$ = create_group_node(1, $1);}
+   |   INT '[' sync_list ']'               { $$ = create_group_node($1, $3);}
+   |   INT '[' sync_element_ast ']'        { $$ = create_group_node($1, $3); }
We may want to be careful with the use of '[' in application_name. I am not much thrilled with forbidding the use of []() in application_name, so we may want to recommend user to use a backslash when using s_s_names when a group is defined.

+void
+yyerror(const char *message)
+{
+    ereport(ERROR,
+       (errcode(ERRCODE_SYNTAX_ERROR),
+           errmsg_internal("%s", message)));
+}
whitespace errors here.
--
Michael

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> I am personally fine with () and [] as you mention, we could even consider
> {}, each one of them has a different meaning mathematically..
>
> I am not entered into a detailed review yet (waiting for the docs), but the
> patch looks brittle. I have been able to crash the server just by querying
> pg_stat_replication:
> * thread #1: tid = 0x0000, 0x0000000105eb36c2
> postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> walsender.c:2783, stop reason = signal SIGSTOP
>   * frame #0: 0x0000000105eb36c2
> postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> walsender.c:2783
>     frame #1: 0x0000000105d4277d
> postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838,
> econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8,
> expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at
> execQual.c:2211
>     frame #2: 0x0000000105d70c24
> postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at
> nodeFunctionscan.c:95
> * thread #1: tid = 0x0000, 0x0000000105eb36c2
> postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> walsender.c:2783, stop reason = signal SIGSTOP
>     frame #0: 0x0000000105eb36c2
> postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> walsender.c:2783
>    2780        /*
>    2781         * Get the currently active synchronous standby.
>    2782         */
> -> 2783        sync_standbys = (int *) palloc(sizeof(int) *
> SyncRepStandbyNames->wait_num);
>    2784        LWLockAcquire(SyncRepLock, LW_SHARED);
>    2785        num_sync =
> SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys);
>    2786        LWLockRelease(SyncRepLock);
> (lldb) p SyncRepStandbyNames
> (SyncGroupNode *) $0 = 0x0000000000000000
>
> +sync_node_group:
> +       sync_list                           { $$ = create_group_node(1, $1);
> }
> +   |   sync_element_ast                    { $$ = create_group_node(1,
> $1);}
> +   |   INT '[' sync_list ']'               { $$ = create_group_node($1,
> $3);}
> +   |   INT '[' sync_element_ast ']'        { $$ = create_group_node($1,
> $3); }
> We may want to be careful with the use of '[' in application_name. I am not
> much thrilled with forbidding the use of []() in application_name, so we may
> want to recommend user to use a backslash when using s_s_names when a group
> is defined.
>
> +void
> +yyerror(const char *message)
> +{
> +    ereport(ERROR,
> +       (errcode(ERRCODE_SYNTAX_ERROR),
> +           errmsg_internal("%s", message)));
> +}
> whitespace errors here.

+#define MAX_WALSENDER_NAME 8192
+typedef enum WalSndState{    WALSNDSTATE_STARTUP = 0,
@@ -62,6 +64,11 @@ typedef struct WalSnd     * SyncRepLock.     */    int            sync_standby_priority;
+
+    /*
+     * Corresponding standby's application_name.
+     */
+    const char       name[MAX_WALSENDER_NAME];} WalSnd;
NAMEDATALEN instead?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 10 Feb 2016 15:22:44 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRk4ZjoQfs4rmF6Di1zp=b4eA=hk0L4GFzUj47GwhgM7g@mail.gmail.com>
> On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> > wrote:
> > I am personally fine with () and [] as you mention, we could even consider
> > {}, each one of them has a different meaning mathematically..
> >
> > I am not entered into a detailed review yet (waiting for the docs), but the
> > patch looks brittle. I have been able to crash the server just by querying
> > pg_stat_replication:
> > * thread #1: tid = 0x0000, 0x0000000105eb36c2
> > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > walsender.c:2783, stop reason = signal SIGSTOP
> >   * frame #0: 0x0000000105eb36c2
> > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > walsender.c:2783
> >     frame #1: 0x0000000105d4277d
> > postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838,
> > econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8,
> > expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at
> > execQual.c:2211
> >     frame #2: 0x0000000105d70c24
> > postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at
> > nodeFunctionscan.c:95
> > * thread #1: tid = 0x0000, 0x0000000105eb36c2
> > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > walsender.c:2783, stop reason = signal SIGSTOP
> >     frame #0: 0x0000000105eb36c2
> > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > walsender.c:2783
> >    2780        /*
> >    2781         * Get the currently active synchronous standby.
> >    2782         */
> > -> 2783        sync_standbys = (int *) palloc(sizeof(int) *
> > SyncRepStandbyNames->wait_num);
> >    2784        LWLockAcquire(SyncRepLock, LW_SHARED);
> >    2785        num_sync =
> > SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys);
> >    2786        LWLockRelease(SyncRepLock);
> > (lldb) p SyncRepStandbyNames
> > (SyncGroupNode *) $0 = 0x0000000000000000
> >
> > +sync_node_group:
> > +       sync_list                           { $$ = create_group_node(1, $1);
> > }
> > +   |   sync_element_ast                    { $$ = create_group_node(1,
> > $1);}
> > +   |   INT '[' sync_list ']'               { $$ = create_group_node($1,
> > $3);}
> > +   |   INT '[' sync_element_ast ']'        { $$ = create_group_node($1,
> > $3); }
> > We may want to be careful with the use of '[' in application_name. I am not
> > much thrilled with forbidding the use of []() in application_name, so we may
> > want to recommend user to use a backslash when using s_s_names when a group
> > is defined.

Mmmm. I found that application_name can contain
commas. Furthermore, there seems to be no limitation for
character in the name.

postgres=# set application_name='ho,ge';
postgres=# select application_name from pg_stat_activity;application_name 
------------------ho,ge

check_application_name() allows all characters in the range
between 32 to 126 in ascii. All other characters are replaced
with '?'.

> > +void
> > +yyerror(const char *message)
> > +{
> > +    ereport(ERROR,
> > +       (errcode(ERRCODE_SYNTAX_ERROR),
> > +           errmsg_internal("%s", message)));
> > +}
> > whitespace errors here.
> 
> +#define MAX_WALSENDER_NAME 8192
> +
>  typedef enum WalSndState
>  {
>      WALSNDSTATE_STARTUP = 0,
> @@ -62,6 +64,11 @@ typedef struct WalSnd
>       * SyncRepLock.
>       */
>      int            sync_standby_priority;
> +
> +    /*
> +     * Corresponding standby's application_name.
> +     */
> +    const char       name[MAX_WALSENDER_NAME];
>  } WalSnd;
> NAMEDATALEN instead?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Feb 10, 2016 at 5:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello,
>
> At Wed, 10 Feb 2016 15:22:44 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRk4ZjoQfs4rmF6Di1zp=b4eA=hk0L4GFzUj47GwhgM7g@mail.gmail.com>
> > On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> > > On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> > > wrote:
> > > I am personally fine with () and [] as you mention, we could even consider
> > > {}, each one of them has a different meaning mathematically..
> > >
> > > I am not entered into a detailed review yet (waiting for the docs), but the
> > > patch looks brittle. I have been able to crash the server just by querying
> > > pg_stat_replication:
> > > * thread #1: tid = 0x0000, 0x0000000105eb36c2
> > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > > walsender.c:2783, stop reason = signal SIGSTOP
> > >   * frame #0: 0x0000000105eb36c2
> > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > > walsender.c:2783
> > >     frame #1: 0x0000000105d4277d
> > > postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838,
> > > econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8,
> > > expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at
> > > execQual.c:2211
> > >     frame #2: 0x0000000105d70c24
> > > postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at
> > > nodeFunctionscan.c:95
> > > * thread #1: tid = 0x0000, 0x0000000105eb36c2
> > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > > walsender.c:2783, stop reason = signal SIGSTOP
> > >     frame #0: 0x0000000105eb36c2
> > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at
> > > walsender.c:2783
> > >    2780        /*
> > >    2781         * Get the currently active synchronous standby.
> > >    2782         */
> > > -> 2783        sync_standbys = (int *) palloc(sizeof(int) *
> > > SyncRepStandbyNames->wait_num);
> > >    2784        LWLockAcquire(SyncRepLock, LW_SHARED);
> > >    2785        num_sync =
> > > SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys);
> > >    2786        LWLockRelease(SyncRepLock);
> > > (lldb) p SyncRepStandbyNames
> > > (SyncGroupNode *) $0 = 0x0000000000000000
> > >
> > > +sync_node_group:
> > > +       sync_list                           { $$ = create_group_node(1, $1);
> > > }
> > > +   |   sync_element_ast                    { $$ = create_group_node(1,
> > > $1);}
> > > +   |   INT '[' sync_list ']'               { $$ = create_group_node($1,
> > > $3);}
> > > +   |   INT '[' sync_element_ast ']'        { $$ = create_group_node($1,
> > > $3); }
> > > We may want to be careful with the use of '[' in application_name. I am not
> > > much thrilled with forbidding the use of []() in application_name, so we may
> > > want to recommend user to use a backslash when using s_s_names when a group
> > > is defined.
>
> Mmmm. I found that application_name can contain
> commas. Furthermore, there seems to be no limitation for
> character in the name.
>
> postgres=# set application_name='ho,ge';
> postgres=# select application_name from pg_stat_activity;
>  application_name
> ------------------
>  ho,ge
>
> check_application_name() allows all characters in the range
> between 32 to 126 in ascii. All other characters are replaced
> with '?'.

Actually I was thinking about that a couple of hours ago. If the
application_name of a node has a comma, it cannot become a sync
replica, no? Wouldn't we need a special handling in s_s_names like
'\,' make a comma part of an application name? Or just ban commas from
the list of supported characters in the application name?
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> So, here are some thoughts to make that more user-friendly. I think
> that the critical issue here is to properly flatten the meta data in
> the custom language and represent it properly in a new catalog,
> without messing up too much with the existing pg_stat_replication that
> people are now used to for 5 releases since 9.0.

Putting the metadata in a catalog doesn't seem great because that only
can ever work on the master.  Maybe there's no need to configure this
on the slaves and therefore it's OK, but I feel nervous about putting
cluster configuration in catalogs.  Another reason for that is that if
synchronous replication is broken, then you need a way to change the
catalog, which involves committing a write transaction; there's a
danger that your efforts to do this will be tripped up by the broken
synchronous replication configuration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Fri, Feb 12, 2016 at 2:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> So, here are some thoughts to make that more user-friendly. I think
>> that the critical issue here is to properly flatten the meta data in
>> the custom language and represent it properly in a new catalog,
>> without messing up too much with the existing pg_stat_replication that
>> people are now used to for 5 releases since 9.0.
>
> Putting the metadata in a catalog doesn't seem great because that only
> can ever work on the master.  Maybe there's no need to configure this
> on the slaves and therefore it's OK, but I feel nervous about putting
> cluster configuration in catalogs.  Another reason for that is that if
> synchronous replication is broken, then you need a way to change the
> catalog, which involves committing a write transaction; there's a
> danger that your efforts to do this will be tripped up by the broken
> synchronous replication configuration.

I was referring to a catalog view that parses the information related
to groups of s_s_names in a flattened way to show each group sync
status. Perhaps my words should have been clearer.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 5:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Feb 12, 2016 at 2:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> So, here are some thoughts to make that more user-friendly. I think
>>> that the critical issue here is to properly flatten the meta data in
>>> the custom language and represent it properly in a new catalog,
>>> without messing up too much with the existing pg_stat_replication that
>>> people are now used to for 5 releases since 9.0.
>>
>> Putting the metadata in a catalog doesn't seem great because that only
>> can ever work on the master.  Maybe there's no need to configure this
>> on the slaves and therefore it's OK, but I feel nervous about putting
>> cluster configuration in catalogs.  Another reason for that is that if
>> synchronous replication is broken, then you need a way to change the
>> catalog, which involves committing a write transaction; there's a
>> danger that your efforts to do this will be tripped up by the broken
>> synchronous replication configuration.
>
> I was referring to a catalog view that parses the information related
> to groups of s_s_names in a flattened way to show each group sync
> status. Perhaps my words should have been clearer.

Ah.  Well, that's different, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 10 Feb 2016 18:36:43 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTHmuuDdKWmoaY1ZAi-gRnT_HRdHGyiqpNfFFr15qc5uA@mail.gmail.com>
> On Wed, Feb 10, 2016 at 5:34 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > +sync_node_group:
> > > > +       sync_list                           { $$ = create_group_node(1, $1);
> > > > }
> > > > +   |   sync_element_ast                    { $$ = create_group_node(1,
> > > > $1);}
> > > > +   |   INT '[' sync_list ']'               { $$ = create_group_node($1,
> > > > $3);}
> > > > +   |   INT '[' sync_element_ast ']'        { $$ = create_group_node($1,
> > > > $3); }
> > > > We may want to be careful with the use of '[' in application_name. I am not
> > > > much thrilled with forbidding the use of []() in application_name, so we may
> > > > want to recommend user to use a backslash when using s_s_names when a group
> > > > is defined.
> >
> > Mmmm. I found that application_name can contain
> > commas. Furthermore, there seems to be no limitation for
> > character in the name.
> >
> > postgres=# set application_name='ho,ge';
> > postgres=# select application_name from pg_stat_activity;
> >  application_name
> > ------------------
> >  ho,ge
> >
> > check_application_name() allows all characters in the range
> > between 32 to 126 in ascii. All other characters are replaced
> > with '?'.
> 
> Actually I was thinking about that a couple of hours ago. If the
> application_name of a node has a comma, it cannot become a sync
> replica, no? Wouldn't we need a special handling in s_s_names like
> '\,' make a comma part of an application name? Or just ban commas from
> the list of supported characters in the application name?

Surprizingly yes. The list is handled as an identifier list and
parsed by SplitIdentifierString thus it can accept deouble-quoted
names.

s_s_names='abc, def, " abc,""def"'

Result list is ["abc", "def", " abc,\"def"]

Simplly supporting the same notation addresses the problem and
accepts strings like the following.

s_s_names='2["comma,name", "foo[bar,baz]"]'


It is currently an undocumented behavior but I doubt the
necessity to have an explict mention.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote:
> Surprizingly yes. The list is handled as an identifier list and
> parsed by SplitIdentifierString thus it can accept double-quoted
> names.

Good point. I was not aware of this trick.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote:
>> Surprizingly yes. The list is handled as an identifier list and
>> parsed by SplitIdentifierString thus it can accept double-quoted
>> names.
>

Attached latest version patch which has only feature logic so far.
I'm writing document patch about this feature now, so this version
patch doesn't have document and regression test patch.

> | $ postgres
> | FATAL:  syntax error: unexpected character "*"
> Mmm.. It should be tough to find what has happened..

I'm trying to implement better error message, but that change is not
included in this version patch yet.

> malloc/free are used in create_name_node and other functions to
> be used in scanner, but syncgroup_gram.y is said to use
> palloc/pfree. Maybe they should use the same memory
> allocation/freeing functions.

Setting like this, I think that we use malloc/free funcion when we
allocate/free memory for SyncRepStandbys variables.
OTOH, we use palloc/pfree function during parsing SyncRepStandbyString.
Am I missing something?

> I suppose SyncRepGetSyncedLsnsFn, or SyncRepGetSyncedLsnsPriority
> can return InvalidXLogRecPtr as cur_*_pos even when it returns
> true. And, I suppose comparison of LSN values with
> InvalidXLogRecPtr is not well-defined. Anyway the condition goes
> wrong when cur_write_pos = InvalidXLogRecPtr (but ret = true).

In this version patch, it's not possible to return InvalidXLogRecPtr
with got_lsns = false (was ret = false).
So we can ensure that we got valid LSNs when got_lsns = true.

> At a glance, SyncRepGetSyncedLsnsPriority and
> SyncRepGetSyncStandbysPriority does almost the same thing and both
> runs loops over group members. Couldn't they run at once?

Yeah, I've optimized that logic.

> We may want to be careful with the use of '[' in application_name.
> I am not much thrilled with forbidding the use of []() in application_name, so we may
> want to recommend user to use a backslash when using s_s_names when a
> group is defined.
> s_s_names='abc, def, " abc,""def"'
>
> Result list is ["abc", "def", " abc,\"def"]
>
> Simplly supporting the same notation addresses the problem and
> accepts strings like the following.
>
> s_s_names='2["comma,name", "foo[bar,baz]"]'

I've changed s_s_names parser so that it can handle special 4
characters (\,\ \[\]) and can handle double-quoted string accurately
same as what SplitIdentifierString does.
We can not use special 4 characters (\,\ \[ \]) without using
double-quoted string. Also if we use "(double-quote) character in
double-quoted string, we should use ""(double double-quotes).
For example,
if application_name = 'hoge " bar', s_s_name = '"hoge "" bar"' would be matched.

Other given comments are fixed.

Remaining tasks are;
- Document patch.
- Regression test patch.
- Syntax error message for s_s_names improvement.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
"Kharage, Suraj"
Date:
Hello,

>Remaining tasks are;
>- Document patch.
>- Regression test patch.
>- Syntax error message for s_s_names improvement.

Please find patch attached for regression test for multisync replication.
I have created this patch over Michael's recovery-test-suite patch.
Please review it.

Regards
Suraj kharage

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.
Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Feb 16, 2016 at 4:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote:
>>> Surprizingly yes. The list is handled as an identifier list and
>>> parsed by SplitIdentifierString thus it can accept double-quoted
>>> names.
>>
>
> Attached latest version patch which has only feature logic so far.
> I'm writing document patch about this feature now, so this version
> patch doesn't have document and regression test patch.

Thanks for updating the patch!

When I changed s_s_names to 'hoge*' and reloaded the configuration file,
the server crashed unexpectedly with the following error message.
This is obviously a bug.
   FATAL:  syntax error

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com>
> On Tue, Feb 16, 2016 at 4:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote:
> >>> Surprizingly yes. The list is handled as an identifier list and
> >>> parsed by SplitIdentifierString thus it can accept double-quoted
> >>> names.
> >>
> >
> > Attached latest version patch which has only feature logic so far.
> > I'm writing document patch about this feature now, so this version
> > patch doesn't have document and regression test patch.
> 
> Thanks for updating the patch!
> 
> When I changed s_s_names to 'hoge*' and reloaded the configuration file,
> the server crashed unexpectedly with the following error message.
> This is obviously a bug.
> 
>     FATAL:  syntax error

I had a glance on the lexer part in the new patch.  It'd be
better to design the lexer from the beginning according to the
required behavior.

The documentation for the syntax is saying as the following,

http://www.postgresql.org/docs/current/static/runtime-config-logging.html

> application_name (string)
> 
> The application_name can be any string of less than NAMEDATALEN
> characters (64 characters in a standard build). <snip> Only
> printable ASCII characters may be used in the application_name
> value. Other characters will be replaced with question marks (?).

And according to what some functions mentioned so far do, totally
an application_name is treated as follwoing, I suppose.

- check_application_name() currently allows [\x20-\x7e], which differs from the definition of the SQL identifiers.

- SplitIdentifierString() and syncrep code
 - allows any byte except a double quote in double-quoted  representation. A double-quote just after a delimiter can
open quoted representation.
 
 - Non-quoted name can contain any character including double   quotes except ',' and white spaces.
 - The syncrep code does case-insensitive matching with the  application_name.

So, to preserve or following the current behavior expct the last
one, the following pattern definitions would do. The
lexer/grammer for the new format of s_s_names could be simpler
than what it is.

space            [ \n\r\f\t\v] /* See the definition of isspace(3) */
whitespace        {space}+
dquote            \"
app_name_chars    [\x21-\x2b\x2d-\x7e]   /* excluding ' ', ',' */
app_name_indq_chars [\x20\x21\x23-\x7e]  /* excluding '"'  */
app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote})
delimiter         {whitespace}*,{whitespace}*
app_name  ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote})
s_s_names {app_name}({delimiter}{app_name})*

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

Ok, I think we should concentrate the parser part for now.

At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com>
> > Thanks for updating the patch!
> > 
> > When I changed s_s_names to 'hoge*' and reloaded the configuration file,
> > the server crashed unexpectedly with the following error message.
> > This is obviously a bug.
> > 
> >     FATAL:  syntax error
> 
> I had a glance on the lexer part in the new patch.  It'd be
> better to design the lexer from the beginning according to the
> required behavior.
> 
> The documentation for the syntax is saying as the following,
> 
> http://www.postgresql.org/docs/current/static/runtime-config-logging.html
> 
> > application_name (string)
> > 
> > The application_name can be any string of less than NAMEDATALEN
> > characters (64 characters in a standard build). <snip> Only
> > printable ASCII characters may be used in the application_name
> > value. Other characters will be replaced with question marks (?).
> 
> And according to what some functions mentioned so far do, totally
> an application_name is treated as follwoing, I suppose.
> 
> - check_application_name() currently allows [\x20-\x7e], which
>   differs from the definition of the SQL identifiers.
> 
> - SplitIdentifierString() and syncrep code
> 
>   - allows any byte except a double quote in double-quoted
>    representation. A double-quote just after a delimiter can open
>    quoted representation.
> 
>   - Non-quoted name can contain any character including double
>     quotes except ',' and white spaces.
> 
>   - The syncrep code does case-insensitive matching with the
>    application_name.
> 
> So, to preserve or following the current behavior expct the last
> one, the following pattern definitions would do. The
> lexer/grammer for the new format of s_s_names could be simpler
> than what it is.
> 
> space            [ \n\r\f\t\v] /* See the definition of isspace(3) */
> whitespace        {space}+
> dquote            \"
> app_name_chars    [\x21-\x2b\x2d-\x7e]   /* excluding ' ', ',' */
> app_name_indq_chars [\x20\x21\x23-\x7e]  /* excluding '"'  */
> app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote})
> delimiter         {whitespace}*,{whitespace}*
> app_name  ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote})
> s_s_names {app_name}({delimiter}{app_name})*


So I made a hasty independent parser for the syntax including the
group names for the convenience for separate testing.  The parser
takes input from stdin and prints the result structure.

It can take old s_s_name format and new list format. We haven't
discussed how to add gruop names but I added it as "<grpname>"
just before the # of syncronous standbys of [] and {} lists.

Is this usable for further discussions?

The sources can be compiles by the following commandline.

$ bison -v test.y; flex -l test.l; gcc -g -DYYDEBUG=1 -DYYERROR_VERBOSE -o ltest test.tab.c

and it makes the output like following.

[horiguti@drain tmp]$ echo '123[1,3,<x>3{a,b,e},4,*]' | ./ltest

TYPE: PRIO_LIST
GROUPNAME: <none>
NSYNC: 123
NEST: 2
CHILDREN { {   TYPE: HOSTNAME   HOSTNAME: 1   QUOTED: No   NEST: 1 } {   TYPE: HOSTNAME   HOSTNAME: 3   QUOTED: No
NEST:0 } TYPE: QUORUM_LIST GROUPNAME: x NSYNC: 3 NEST: 1 CHILDREN {   {     TYPE: HOSTNAME     HOSTNAME: a     QUOTED:
No    NEST: 0   }   {     TYPE: HOSTNAME     HOSTNAME: b     QUOTED: No     NEST: 0   }   {     TYPE: HOSTNAME
HOSTNAME:e     QUOTED: No     NEST: 0   } } {   TYPE: HOSTNAME   HOSTNAME: 4   QUOTED: No   NEST: 0 } {   TYPE:
HOSTNAME  HOSTNAME: *   QUOTED: No   NEST: 0 }
 
}


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
%{
#include <stdio.h>
#include <stdlib.h>

%}

%option noyywrap

%x DQNAME
%x APPNAME

space            [ \t\n\r\f]
whitespace        {space}+

dquote            \"
app_name_chars    [\x21-\x2b\x2d-\x3b\x3d\x3f-\x5a\x5c\x5e-\x7a\x7c\x7e]
app_name_indq_chars [\x20\x21\x23-\x7e]
app_name    {app_name_chars}+
app_name_dq ({app_name_indq_chars}|{dquote}{dquote})+
delimiter         {whitespace}*,{whitespace}*
app_name_start {app_name_chars}
any_app \*|({dquote}\*{dquote})
xdstart {dquote}
xdstop  {dquote}
self    [\[\]\{\}<>]
%%
{xdstart} { BEGIN(DQNAME); }
<DQNAME>{xdstop} { BEGIN(INITIAL); }
<DQNAME>{app_name_dq} { static char name[64]; int i, j;
 for (i = j = 0 ; j < 63 && yytext[i] ; i++, j++) {   if (yytext[i] == '"')   {    if (yytext[i+1] == '"')
name[j]= '"';    else        fprintf(stderr, "illegal quote escape");    i++;}   else    name[j] = yytext[i];
 
 } name[j] = 0;
 yylval.str = strdup(name); return QUOTED_NAME;
}
{app_name_start} { BEGIN(APPNAME); yyless(0);}
<APPNAME>{app_name} {char *p;
yylval.str = strdup(yytext);for (p = yylval.str ; *p ; p++){    if (*p >= 'A' && *p <= 'Z')        *p = *p + ('a' -
'A');}BEGIN(INITIAL);returnNAME_OR_NUMBER;
 
}
{delimiter} { return DELIMITER;}
{self} { return yytext[0];}

%%

//int main(void)
//{
//    int r;
//
//    while(r = yylex()) {
//    fprintf(stderr, "#%d:(%s)#", r, yylval.str);
//    yylval.str = "";
//    }
//}
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//#define YYDEBUG 1
typedef enum treeelemtype
{ TE_HOSTNAME, TE_PRIORITY_LIST, TE_QUORUM_LIST
} treeelemtype;

struct syncdef;
typedef struct syncdef
{ treeelemtype type; char *name; int quoted; int nsync; int nest;
 struct syncdef *elems; struct syncdef *next;
} syncdef;
void yyerror(const char *s);
int yylex(void);
int depth = 0;
syncdef *defroot = NULL;
syncdef *curr = NULL;
%}


%union
{ char *str; int    ival; syncdef *syncdef;
}
%token <str> NAME_OR_NUMBER
%token <str> QUOTED_NAME
%token DELIMITER

%type <syncdef> qlist plist name_list name_elem name_elem_nonlist
%type <syncdef> old_list s_s_names list_maybe_with_name
%type <str> group_name

%%
s_s_names:old_list{  syncdef *t = (syncdef*)malloc(sizeof(syncdef));
  t->type = TE_PRIORITY_LIST;  t->name = NULL;  t->quoted = 0;  t->nsync = 1;  t->elems = $1;  t->next = NULL;  defroot
=$$ = t;}| list_maybe_with_name{ defroot = $$ = $1;};
 

old_list:name_elem_nonlist{  $$ = $1;}| old_list DELIMITER name_elem_nonlist{  syncdef *p = $1;
  while (p->next) p = p->next;  p->next = $3;};
list_maybe_with_name:plist{$$ = $1;}| qlist{$$ = $1;}| '<' group_name '>' plist{    $4->name = $2;    $$ = $4;}| '<'
group_name'>' qlist{    $4->name = $2;    $$ = $4;};
 

group_name:NAME_OR_NUMBER{ $$ = strdup($1); }| QUOTED_NAME{ $$ = strdup($1); };

plist: NAME_OR_NUMBER '[' name_list ']'{  syncdef *t;  int n = atoi($1);  if (n == 0)  {    yyerror("prefix number is 0
ornon-integer");    return 1;  }  if ($3->nest > 1)  {    yyerror("Up to 2 levels of nesting is supported");    return
1; }  for (t = $3 ; t ; t = t->next)  {      if (t->type == TE_HOSTNAME && t->next && strcmp(t->name, "*") == 0)      {
        yyerror("\"*\" is allowed only at the end of priority list");          return 1;      }  }  t =
(syncdef*)malloc(sizeof(syncdef)); t->type = TE_PRIORITY_LIST;  t->nsync = n;  t->name = NULL;  t->quoted = 0;  t->nest
=$3->nest + 1;  t->elems = $3;  t->next = NULL;  $$ = t;}  ;
 

qlist: NAME_OR_NUMBER '{' name_list '}'{  syncdef *t;  int n = atoi($1);  if (n == 0)  {    yyerror("prefix number is 0
ornon-integer");    return 1;  }  if ($3->nest > 1)  {    yyerror("Up to 2 levels of nesting is supported");    return
1; }
 
  for (t = $3 ; t ; t = t->next)  {      if (t->type == TE_HOSTNAME && strcmp(t->name, "*") == 0)      {
yyerror("\"*\"is not allowed in quorum list");          return 1;      }  }              t =
(syncdef*)malloc(sizeof(syncdef)); t->type = TE_QUORUM_LIST;  t->nsync = n;  t->name = NULL;  t->quoted = 0;  t->nest =
$3->nest+ 1;  t->elems = $3;  t->next = NULL;  $$ = t;}
 
;

name_list:name_elem{    $$ = $1;}| name_list DELIMITER name_elem{    syncdef *p = $1;
    if (p->nest < $3->nest)        p->nest = $3->nest;    while (p->next) p = p->next;    p->next = $3;    $$ = $1;};

name_elem:name_elem_nonlist{  $$ = $1;   }| list_maybe_with_name{  $$ = $1;};

name_elem_nonlist:NAME_OR_NUMBER{  syncdef *t = (syncdef*)malloc(sizeof(syncdef));  t->type = TE_HOSTNAME;  t->nsync =
0; t->name = strdup($1);  t->quoted = 0;  t->nest = 0;  t->elems = NULL;  t->next = NULL;  $$ = t;   }| QUOTED_NAME{
syncdef*t = (syncdef*)malloc(sizeof(syncdef));  t->type = TE_HOSTNAME;  t->nsync = 0;  t->name = strdup($1);  t->quoted
=1;  t->nest = 0;  t->elems = NULL;  t->next = NULL;  $$ = t;   };
 
%%
void
indent(int level)
{ int i;
 for (i = 0 ; i < level * 2 ; i++)  putc(' ', stdout);
}

void
dump_def(syncdef *def, int level)
{ char *typelabel[] = {"HOSTNAME", "PRIO_LIST", "QUORUM_LIST"}; syncdef *p;
 if (def == NULL)  return;  switch (def->type) {  case TE_HOSTNAME:      indent(level);      puts("{");
indent(level+1);     printf("TYPE: %s\n", typelabel[def->type]);      indent(level+1);      printf("HOSTNAME: %s\n",
def->name);     indent(level+1);      printf("QUOTED: %s\n", def->quoted ? "Yes" : "No");      indent(level+1);
printf("NEST:%d\n", def->nest);      indent(level);      puts("}");      if (def->next)          dump_def(def->next,
level);     break;  case TE_PRIORITY_LIST:  case TE_QUORUM_LIST:      indent(level);      printf("TYPE: %s\n",
typelabel[def->type]);     indent(level);      printf("GROUPNAME: %s\n", def->name ? def->name : "<none>");
indent(level);     printf("NSYNC: %d\n", def->nsync);      indent(level);      printf("NEST: %d\n", def->nest);
indent(level);     puts("CHILDREN {");      level++;      dump_def(def->elems, level);      level--;
indent(level);     puts("}");      if (def->next)          dump_def(def->next, level);      break;  default:
fprintf(stderr,"Unknown type?\n");      exit(1); } level--;
 
}


int main(void)
{
//    yydebug = 1;if (!yyparse())    dump_def(defroot, 0);
}

void yyerror(const char* s)
{ fprintf(stderr, "Error: %s\n", s);
}

#include "lex.yy.c"

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Feb 24, 2016 at 5:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> Ok, I think we should concentrate the parser part for now.
>
> At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> Hello,
>>
>> At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com>
>> > Thanks for updating the patch!
>> >
>> > When I changed s_s_names to 'hoge*' and reloaded the configuration file,
>> > the server crashed unexpectedly with the following error message.
>> > This is obviously a bug.
>> >
>> >     FATAL:  syntax error
>>
>> I had a glance on the lexer part in the new patch.  It'd be
>> better to design the lexer from the beginning according to the
>> required behavior.
>>
>> The documentation for the syntax is saying as the following,
>>
>> http://www.postgresql.org/docs/current/static/runtime-config-logging.html
>>
>> > application_name (string)
>> >
>> > The application_name can be any string of less than NAMEDATALEN
>> > characters (64 characters in a standard build). <snip> Only
>> > printable ASCII characters may be used in the application_name
>> > value. Other characters will be replaced with question marks (?).
>>
>> And according to what some functions mentioned so far do, totally
>> an application_name is treated as follwoing, I suppose.
>>
>> - check_application_name() currently allows [\x20-\x7e], which
>>   differs from the definition of the SQL identifiers.
>>
>> - SplitIdentifierString() and syncrep code
>>
>>   - allows any byte except a double quote in double-quoted
>>    representation. A double-quote just after a delimiter can open
>>    quoted representation.
>>
>>   - Non-quoted name can contain any character including double
>>     quotes except ',' and white spaces.
>>
>>   - The syncrep code does case-insensitive matching with the
>>    application_name.
>>
>> So, to preserve or following the current behavior expct the last
>> one, the following pattern definitions would do. The
>> lexer/grammer for the new format of s_s_names could be simpler
>> than what it is.
>>
>> space                 [ \n\r\f\t\v] /* See the definition of isspace(3) */
>> whitespace            {space}+
>> dquote            \"
>> app_name_chars    [\x21-\x2b\x2d-\x7e]   /* excluding ' ', ',' */
>> app_name_indq_chars [\x20\x21\x23-\x7e]  /* excluding '"'  */
>> app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote})
>> delimiter         {whitespace}*,{whitespace}*
>> app_name  ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote})
>> s_s_names {app_name}({delimiter}{app_name})*
>
>
> So I made a hasty independent parser for the syntax including the
> group names for the convenience for separate testing.  The parser
> takes input from stdin and prints the result structure.
>
> It can take old s_s_name format and new list format. We haven't
> discussed how to add gruop names but I added it as "<grpname>"
> just before the # of syncronous standbys of [] and {} lists.
>
> Is this usable for further discussions?

Thank you for your suggestion.

Another option is to add group name with ":" to immediately after set
of standbys as I said earlier.
<http://www.postgresql.org/message-id/CAD21AoA9UqcbTnDKi0osd0yhN4FPgTrg6wuZeTtvpSYy2LqL5Q@mail.gmail.com>

s_s_names with group name would be as follows.
s_s_names = '2[local, 2[london1, london2, london3]:london, (tokyo1,
tokyo2):tokyo]'

Though?

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 24 Feb 2016 18:01:59 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCetS5BMcTpXXtMwG0hyszZgNn=zK1U73GcWTgJ-Wn3pQ@mail.gmail.com>
> On Wed, Feb 24, 2016 at 5:37 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello,
> >
> > Ok, I think we should concentrate the parser part for now.
> >
> > At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >> Hello,
...
> >> So, to preserve or following the current behavior expct the last
> >> one, the following pattern definitions would do. The
> >> lexer/grammer for the new format of s_s_names could be simpler
> >> than what it is.
> >>
> >> space                 [ \n\r\f\t\v] /* See the definition of isspace(3) */
> >> whitespace            {space}+
> >> dquote            \"
> >> app_name_chars    [\x21-\x2b\x2d-\x7e]   /* excluding ' ', ',' */
> >> app_name_indq_chars [\x20\x21\x23-\x7e]  /* excluding '"'  */
> >> app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote})
> >> delimiter         {whitespace}*,{whitespace}*
> >> app_name  ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote})
> >> s_s_names {app_name}({delimiter}{app_name})*
> >
> >
> > So I made a hasty independent parser for the syntax including the
> > group names for the convenience for separate testing.  The parser
> > takes input from stdin and prints the result structure.
> >
> > It can take old s_s_name format and new list format. We haven't
> > discussed how to add gruop names but I added it as "<grpname>"
> > just before the # of syncronous standbys of [] and {} lists.
> >
> > Is this usable for further discussions?
> 
> Thank you for your suggestion.
> 
> Another option is to add group name with ":" to immediately after set
> of standbys as I said earlier.
> <http://www.postgresql.org/message-id/CAD21AoA9UqcbTnDKi0osd0yhN4FPgTrg6wuZeTtvpSYy2LqL5Q@mail.gmail.com>
> 
> s_s_names with group name would be as follows.
> s_s_names = '2[local, 2[london1, london2, london3]:london, (tokyo1,
> tokyo2):tokyo]'
> 
> Though?

I have no problem with it. The attached new sample parser does
so.

By the way, your parser also complains for an example I've seen
somewhere upthread "1[2,3,4]". This is because '2', '3' and '4'
are regarded as INT, not NAME. Whether a sequence of digits is a
prefix number of a list or a host name cannot be identified until
reading some following characters. So my previous test.l defined
NAME_OR_INTEGER and it is distinguished in the grammar side to
resolve this problem.

If you want them identified in the lexer side, it should do
looking-forward as <NAME_OR_PREFIX>{prefix} in the attached
test.l does. This makes the lexer a bit complex but in contrast
test.y simpler. The test.l, test.y attached got refactored but .l
gets a bit tricky..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
%{
#include <stdio.h>
#include <stdlib.h>

%}

%option noyywrap

%x DQNAME
%x NAME_OR_PREFIX
%x APPNAME
%x GRPCLOSED

space            [ \t\n\r\f]
whitespace        {space}+

dquote            \"
app_name_chars    [\x21-\x27\x2a\x2b\x2d-\x5a\x5c\x5e-\x7a\x7c\x7e]
app_name_indq_chars [\x20\x21\x23-\x7e]
app_name    {app_name_chars}+
app_name_dq ({app_name_indq_chars}|{dquote}{dquote})+
delimiter         {whitespace}*,{whitespace}*
app_name_start {app_name_chars}
any_app \*|({dquote}\*{dquote})
xdstart {dquote}
xdstop  {dquote}
openlist [\[\(]
prefix [0-9]+{whitespace}*{openlist}
closelist [\]\)]
%%
{xdstart} { BEGIN(DQNAME); }
<DQNAME>{xdstop} { BEGIN(INITIAL); }
<DQNAME>{app_name_dq} { appname *name = (appname *)malloc(sizeof(appname)); int i, j;
 for (i = j = 0 ; j < 63 && yytext[i] ; i++, j++) {   if (yytext[i] == '"')   {    if (yytext[i+1] == '"')
name->str[j]= '"';    else        fprintf(stderr, "illegal quote escape\n");    i++;}   else    name->str[j] =
yytext[i];
 } name->str[j] = 0; name->quoted = 1;
 yylval.name = name; return NAME;
}
{app_name_start} { BEGIN(NAME_OR_PREFIX); yyless(0);}
<NAME_OR_PREFIX>{app_name} {   appname *name = (appname *)malloc(sizeof(appname));char *p;
   name->quoted = 0;strncpy(name->str, yytext, 63);name->str[63] = 0;for (p = name->str ; *p ; p++){    if (*p >= 'A'
&&*p <= 'Z')        *p = *p + ('a' - 'A');}yylval.name = name;BEGIN(INITIAL);return NAME;
 
}
<NAME_OR_PREFIX>{prefix} {static char prefix[16];int i, l;
/* find the last digit */for (l = 0 ; l < 16 && isdigit(yytext[l]) ; l++);if (l > 15)    fprintf(stderr, "too long
prefixnumber for lists\n");for (i = 0 ; i < l ; i++)    prefix[i] = yytext[i];prefix[i] = 0;yylval.str =
strdup(prefix);
   /* prefix ends with a left brace or paren, so go backward by 1      char for further readin */
yyless(yyleng - 1);   BEGIN(INITIAL);return PREFIX;
}
<GRPCLOSED>{whitespace}*. {BEGIN(INITIAL);if (yytext[yyleng - 1] == ':')    return yytext[yyleng - 1];yyless(0);
}
{delimiter} { return DELIMITER;}
{openlist} {yylval.character = yytext[0];return OPENLIST;    
}
{closelist} { BEGIN(GRPCLOSED);yylval.character = yytext[0];return CLOSELIST;    
}

%%

//int main(void)
//{
//    int r;
//
//    while(r = yylex()) {
//    fprintf(stderr, "#%d:(%s)#", r, yylval.str);
//    yylval.str = "";
//    }
//}
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
//#define YYDEBUG 1
typedef enum treeelemtype
{ TE_HOSTNAME, TE_PRIORITY_LIST, TE_QUORUM_LIST
} treeelemtype;

struct syncdef;
typedef struct syncdef
{ treeelemtype type; char *name; int quoted; int nsync; int nest;
 struct syncdef *elems; struct syncdef *next;
} syncdef;
typedef struct
{int quoted;char str[64];
} appname;
void yyerror(const char *s);
int yylex(void);
int depth = 0;
syncdef *defroot = NULL;
syncdef *curr = NULL;
%}

%union
{ char  character; char *str; appname  *name; int    ival; syncdef *syncdef;
}
%token <str> PREFIX
%token <name> NAME
%token <character> OPENLIST CLOSELIST
%token DELIMITER 

%type <syncdef> group_list name_list name_elem name_elem_nonlist
%type <syncdef> old_list s_s_names
%type <name> opt_groupname
%type <str> opt_prefix

%%
s_s_names:old_list{  syncdef *t = (syncdef*)malloc(sizeof(syncdef));
  t->type = TE_PRIORITY_LIST;  t->name = NULL;  t->quoted = 0;  t->nsync = 1;  t->elems = $1;  t->next = NULL;  defroot
=$$ = t;}| group_list{ defroot = $$ = $1;};
 

old_list:name_elem_nonlist{  $$ = $1;}| old_list DELIMITER name_elem_nonlist{  syncdef *p = $1;
  while (p->next) p = p->next;  p->next = $3;};

group_list: opt_prefix OPENLIST name_list CLOSELIST opt_groupname{  syncdef *t;  char *p = $1;  int n = atoi($1);  if
(n== 0)  {    yyerror("prefix number is 0 or non-integer");    return 1;  }  if ($3->nest > 1)  {    yyerror("Up to 2
levelsof nesting is supported");    return 1;  }  for (t = $3 ; t ; t = t->next)  {      if (t->type == TE_HOSTNAME &&
t->next&& strcmp(t->name, "*") == 0)      {          yyerror("\"*\" is allowed only at the end of priority list");
   return 1;      }  }  if ($2 == '[' && $4 != ']' || $2 == '(' && $4 != ')')  {      yyerror("Unmatched group
parentheses");     return 1;  }        t = (syncdef*)malloc(sizeof(syncdef));  t->type = ($2 == '[' ? TE_PRIORITY_LIST
:TE_QUORUM_LIST);  t->nsync = n;  t->name = $5->str;  t->quoted = $5->quoted;  t->nest = $3->nest + 1;  t->elems = $3;
t->next= NULL;  $$ = t;}  ;
 
opt_prefix:PREFIX{ $$ = $1; }| { $$ = "1"; };

opt_groupname:':' NAME{ $$ = $2; }| /* EMPTY */{     appname *name = (appname *)malloc(sizeof(name));     name->str[0]
=0;  name->quoted = 0;     $$ = name;   };
 

name_list:name_elem{    $$ = $1;}| name_list DELIMITER name_elem{    syncdef *p = $1;
    if (p->nest < $3->nest)        p->nest = $3->nest;    while (p->next) p = p->next;    p->next = $3;    $$ = $1;};

name_elem:name_elem_nonlist{  $$ = $1;   }| group_list{  $$ = $1;};

name_elem_nonlist:NAME{  syncdef *t = (syncdef*)malloc(sizeof(syncdef));  t->type = TE_HOSTNAME;  t->nsync = 0;
t->name= strdup($1->str);  t->quoted = $1->quoted;  t->nest = 0;  t->elems = NULL;  t->next = NULL;  $$ = t;   };
 
%%
void
indent(int level)
{ int i;
 for (i = 0 ; i < level * 2 ; i++)  putc(' ', stdout);
}

void
dump_def(syncdef *def, int level)
{ char *typelabel[] = {"HOSTNAME", "PRIO_LIST", "QUORUM_LIST"}; syncdef *p;
 if (def == NULL)  return;  switch (def->type) {  case TE_HOSTNAME:      indent(level);      puts("{");
indent(level+1);     printf("TYPE: %s\n", typelabel[def->type]);      indent(level+1);      printf("HOSTNAME: %s\n",
def->name);     indent(level+1);      printf("QUOTED: %s\n", def->quoted ? "Yes" : "No");      indent(level+1);
printf("NEST:%d\n", def->nest);      indent(level);      puts("}");      if (def->next)          dump_def(def->next,
level);     break;  case TE_PRIORITY_LIST:  case TE_QUORUM_LIST:      indent(level);      printf("TYPE: %s\n",
typelabel[def->type]);     indent(level);      printf("GROUPNAME: %s\n", def->name ? def->name : "<none>");
indent(level);     printf("NSYNC: %d\n", def->nsync);      indent(level);      printf("NEST: %d\n", def->nest);
indent(level);     puts("CHILDREN {");      level++;      dump_def(def->elems, level);      level--;
indent(level);     puts("}");      if (def->next)          dump_def(def->next, level);      break;  default:
fprintf(stderr,"Unknown type?\n");      exit(1); } level--;
 
}


int main(void)
{
//    yydebug = 1;if (!yyparse())    dump_def(defroot, 0);
}

void yyerror(const char* s)
{ fprintf(stderr, "Error: %s\n", s);
}

#include "lex.yy.c"

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
Attached latest patch includes document patch.

> When I changed s_s_names to 'hoge*' and reloaded the configuration file,
> the server crashed unexpectedly with the following error message.
> This is obviously a bug.

Fixed.

>   - allows any byte except a double quote in double-quoted
>    representation. A double-quote just after a delimiter can open
>    quoted representation.

No. double quote is also allowed in double-quoted representation using
by two double-quotes.
if s_s_names = '"node""hoge"' then standby name will be 'node"hoge'.

>
> I have no problem with it. The attached new sample parser does
> so.
>
> By the way, your parser also complains for an example I've seen
> somewhere upthread "1[2,3,4]". This is because '2', '3' and '4'
> are regarded as INT, not NAME. Whether a sequence of digits is a
> prefix number of a list or a host name cannot be identified until
> reading some following characters. So my previous test.l defined
> NAME_OR_INTEGER and it is distinguished in the grammar side to
> resolve this problem.
>
> If you want them identified in the lexer side, it should do
> looking-forward as <NAME_OR_PREFIX>{prefix} in the attached
> test.l does. This makes the lexer a bit complex but in contrast
> test.y simpler. The test.l, test.y attached got refactored but .l
> gets a bit tricky..

I think that lexer can pass both INT and NAME as char* to parser, and
then parser regards them as integer or char*.
It would be more simple.
Thoughts?

Thank you for giving lexer and parser example but I'm not sure that it
makes thing more easier.
It seems to make thing more complex.

Attached patch handles parameter using similar way as postgres parses SQL.
Please having a look it and give me feedbacks.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Feb 26, 2016 at 1:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached latest patch includes document patch.
>
>> When I changed s_s_names to 'hoge*' and reloaded the configuration file,
>> the server crashed unexpectedly with the following error message.
>> This is obviously a bug.
>
> Fixed.
>
>>   - allows any byte except a double quote in double-quoted
>>    representation. A double-quote just after a delimiter can open
>>    quoted representation.
>
> No. double quote is also allowed in double-quoted representation using
> by two double-quotes.
> if s_s_names = '"node""hoge"' then standby name will be 'node"hoge'.
>
>>
>> I have no problem with it. The attached new sample parser does
>> so.
>>
>> By the way, your parser also complains for an example I've seen
>> somewhere upthread "1[2,3,4]". This is because '2', '3' and '4'
>> are regarded as INT, not NAME. Whether a sequence of digits is a
>> prefix number of a list or a host name cannot be identified until
>> reading some following characters. So my previous test.l defined
>> NAME_OR_INTEGER and it is distinguished in the grammar side to
>> resolve this problem.
>>
>> If you want them identified in the lexer side, it should do
>> looking-forward as <NAME_OR_PREFIX>{prefix} in the attached
>> test.l does. This makes the lexer a bit complex but in contrast
>> test.y simpler. The test.l, test.y attached got refactored but .l
>> gets a bit tricky..
>
> I think that lexer can pass both INT and NAME as char* to parser, and
> then parser regards them as integer or char*.
> It would be more simple.
> Thoughts?
>
> Thank you for giving lexer and parser example but I'm not sure that it
> makes thing more easier.
> It seems to make thing more complex.
>
> Attached patch handles parameter using similar way as postgres parses SQL.
> Please having a look it and give me feedbacks.
>

Previous patch could not parse one character standby name correctly.
Attached latest patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello, Thanks for the new patch.


At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com>
> Previous patch could not parse one character standby name correctly.
> Attached latest patch.

I haven't looked it in detail but it won't work as you
expected. flex compains as the following for v12 patch.

syncgroup_scanner.l:80: warning, rule cannot be matched
syncgroup_scanner.l:84: warning, rule cannot be matched

They are warnings about the patterns [1-9][0-9]* and {asterisk}
because it is matched by {node_name}+. The latter would no harm
(or the pattern is useless) but the former will make '1[a,b,c]'
to fail.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Fri, 26 Feb 2016 10:38:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160226.103822.12680005.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello, Thanks for the new patch.
> 
> 
> At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com>
> > Previous patch could not parse one character standby name correctly.
> > Attached latest patch.
> 
> I haven't looked it in detail but it won't work as you
> expected. flex compains as the following for v12 patch.
> 
> syncgroup_scanner.l:80: warning, rule cannot be matched
> syncgroup_scanner.l:84: warning, rule cannot be matched

Making it independent from postgres body then compile it with
-DYYDEBUG and set yydebug = 1 would give you valuable information
and make testing of the parser far easier.

| $ flex test2.l; bison -v test2.y; gcc -g -DYYDEBUG -o ltest2 test2.tab.c
| $ echo '1[aa,bb,cc]' | ./ltest2
| Starting parse
| Entering state 0
| Reading a token: Next token is token NAME ()
| Shifting token NAME ()
| ...
| Entering state 4
| Next token is token '[' ()
| syntax error at or near "[" in "(null)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
%{
//#include "postgres.h"

/* No reason to constrain amount of data slurped */
#define YY_READ_BUF_SIZE 16777216

#define BUFSIZE 8192

/* Handles to the buffer that the lexer uses internally */
static YY_BUFFER_STATE scanbufhandle;

/* Functions for handling double quoted string */
static void init_xd_string(void);
static void addlit_xd_string(char *ytext, int yleng);
static void addlitchar_xd_string(unsigned char ychar);

char  *scanbuf;
char *xd_string;
int    xd_size; /* actual size of xd_string */
int    xd_len; /* string length of xd_string  */
%}
%option 8bit
/* %option never-interactive*/
/* %option nounput*/
/* %option noinput*/
%option noyywrap
%option warn
/* %option prefix="syncgroup_yy" */

/** <xd> delimited identifiers (double-quoted identifiers)*/
%x xd

space        [ \t\n\r\f]
non_newline    [^\n\r]
whitespace    ({space}+)
self        [\[\]\,]
asterisk    \*

/** Basically all ascii characteres except for {self} and {whitespace} are allowed* to be used for node name. These
specialcharater could be used by double-quoted.*//* excluding ' ', '\"', '*', ',', '[', ']' */
 
node_name    [\x21\x23-\x29\x29-\x2b\x2d-\x5a\x5c\x5e-\x7e]
/* excluding '\"' */
dquoted_name    [\x20\x21\x23-\x7e]

/* Double-quoted string */
dquote        \"
xdstart        {dquote}
xddouble    {dquote}{dquote}
xdstop        {dquote}
xdinside    {dquoted_name}+

%%
{whitespace}    { /* ignore */ }

{xdstart} {            init_xd_string();            BEGIN(xd);    }
<xd>{xddouble} {            addlitchar_xd_string('\"');    }
<xd>{xdinside} {            addlit_xd_string(yytext, yyleng);    }
<xd>{xdstop} {            xd_string[xd_len] = '\0';            yylval.str = xd_string;            BEGIN(INITIAL);
    return NAME;    }
 
{node_name}+ {            yylval.str = strdup(yytext);            return NAME;        }
[1-9][0-9]* {            yylval.str = yytext;            return NUM;    }
{asterisk} {            yylval.str = strdup(yytext);            return AST;        }
{self} {            return yytext[0];    }
. {
//                ereport(ERROR,
//                    (errcode(ERRCODE_SYNTAX_ERROR),
//                        errmsg("syntax error: unexpected character \"%s\"", yytext)));            fprintf(stderr,
"syntaxerror: unexpected character \"%s\"", yytext);                 exit(1);}
 
%%

void
yyerror(const char *message)
{
//    ereport(ERROR,
//        (errcode(ERRCODE_SYNTAX_ERROR),
//            errmsg("%s at or near \"%s\" in \"%s\"", message,
//                   yytext, scanbuf)));fprintf(stderr, "%s at or near \"%s\" in \"%s\"", message, yytext,
scanbuf);exit(1);
}

void
syncgroup_scanner_init(const char *str)
{Size        slen = strlen(str);
/* * Might be left over after ereport() */if (YY_CURRENT_BUFFER)    yy_delete_buffer(YY_CURRENT_BUFFER);
/* * Make a scan buffer with special termination needed by flex. */scanbuf = (char *) palloc(slen + 2);memcpy(scanbuf,
str,slen);scanbuf[slen] = scanbuf[slen + 1] = YY_END_OF_BUFFER_CHAR;scanbufhandle = yy_scan_buffer(scanbuf, slen + 2);
 
}

void
syncgroup_scanner_finish(void)
{yy_delete_buffer(scanbufhandle);scanbufhandle = NULL;
}

static void
init_xd_string()
{xd_string = palloc(sizeof(char) * BUFSIZE);xd_size = BUFSIZE;xd_len = 0;
}

static void
addlit_xd_string(char *ytext, int yleng)
{/* enlarge buffer if needed */if ((xd_len + yleng) > xd_size)    xd_string = repalloc(xd_string, xd_size + BUFSIZE);
memcpy(xd_string + xd_len, ytext, yleng);xd_len += yleng;
}

static void
addlitchar_xd_string(unsigned char ychar)
{/* enlarge buffer if needed */if ((xd_len + 1) > xd_size)    xd_string = repalloc(xd_string, xd_size + BUFSIZE);
xd_string[xd_len] = ychar;xd_len += 1;
}
%{
/*-------------------------------------------------------------------------** syncgroup_gram.y                - Parser
forsynchronous replication group** Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group* Portions
Copyright(c) 1994, Regents of the University of California*** IDENTIFICATION*
src/backend/replication/syncgroup_gram.y**-------------------------------------------------------------------------*/

//#include "postgres.h"

//#include "replication/syncrep.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define palloc malloc
#define repalloc realloc
#define pfree free

#define SYNC_REP_GROUP_MAIN            0x01
#define SYNC_REP_GROUP_NAME            0x02
#define SYNC_REP_GROUP_GROUP        0x04

#define SYNC_REP_METHOD_PRIORITY    0

struct SyncGroupNode;
typedef struct SyncGroupNode SyncGroupNode;

struct    SyncGroupNode
{/* Common information */int        type;char    *name;SyncGroupNode    *next; /* Same group, next name node */
/* For group ndoe */int sync_method; /* priority */int    wait_num;SyncGroupNode    *members; /* member of its group
*/
};

static SyncGroupNode *create_name_node(char *name);
static SyncGroupNode *add_node(SyncGroupNode *node_list, SyncGroupNode *node);
static SyncGroupNode *create_group_node(char *wait_num, SyncGroupNode *node_list);
static void yyerror(const char *message);typedef int Size;

/** Bison doesn't allocate anything that needs to live across parser calls,* so we can easily have it use palloc
insteadof malloc.  This prevents* memory leaks if we error out during parsing.  Note this only works with* bison >=
2.0. However, in bison 1.875 the default is to use alloca()* if possible, so there's not really much problem anyhow, at
leastif* you're building with gcc.*/
 
#define YYMALLOC palloc
#define YYFREE   pfreeSyncGroupNode *SyncRepStandbys;
%}

%expect 0/*%name-prefix="syncgroup_yy"*/

%union
{char       *str;SyncGroupNode  *expr;
}

%token <str> NAME NUM
%token <str> AST

%type <expr> result sync_list sync_list_ast sync_element sync_element_ast         sync_node_group sync_group_old
sync_group

%start result

%%
result:    sync_node_group                        { SyncRepStandbys = $1; }
;
sync_node_group:    sync_group_old                        { $$ = $1; }    | sync_group                        { $$ =
$1;}
 
;
sync_group_old:    sync_list                            { $$ = create_group_node("1", $1); }    | sync_list_ast
              { $$ = create_group_node("1", $1); }
 
;
sync_group:    NUM '[' sync_list ']'                 { $$ = create_group_node($1, $3); }    | NUM '[' sync_list_ast ']'
          { $$ = create_group_node($1, $3); }
 
;
sync_list:    sync_element                         { $$ = $1;}    | sync_list ',' sync_element        { $$ =
add_node($1,$3);}
 
;
sync_list_ast:    sync_element_ast                    { $$ = $1;}    | sync_list ',' sync_element_ast    { $$ =
add_node($1,$3);}
 
;
sync_element:    NAME                                 { $$ = create_name_node($1); }    | NUM
    { $$ = create_name_node($1); }
 
;
sync_element_ast:    AST                                    { $$ = create_name_node($1); }
;
%%

static SyncGroupNode *
create_name_node(char *name)
{SyncGroupNode *name_node = (SyncGroupNode *)malloc(sizeof(SyncGroupNode));
/* Common information */name_node->type = SYNC_REP_GROUP_NAME;name_node->name = strdup(name);name_node->next = NULL;
/* For GROUP node */name_node->sync_method = 0;name_node->wait_num = 0;name_node->members = NULL;//
name_node->SyncRepGetSyncedLsnsFn= NULL;//    name_node->SyncRepGetSyncStandbysFn = NULL;
 
return name_node;
}

static SyncGroupNode *
create_group_node(char *wait_num, SyncGroupNode *node_list)
{SyncGroupNode *group_node = (SyncGroupNode *)malloc(sizeof(SyncGroupNode));
/* For NAME node */group_node->type = SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN;group_node->name =
"main";group_node->next= NULL;
 
/* For GROUP node */group_node->sync_method = SYNC_REP_METHOD_PRIORITY;group_node->wait_num =
atoi(wait_num);group_node->members= node_list;//    group_node->SyncRepGetSyncedLsnsFn =
SyncRepGetSyncedLsnsUsingPriority;//   group_node->SyncRepGetSyncStandbysFn = SyncRepGetSyncStandbysUsingPriority;
 
return group_node;
}

static SyncGroupNode *
add_node(SyncGroupNode *node_list, SyncGroupNode *node)
{SyncGroupNode *tmp = node_list;
/* Add node to tailing of node_list */while(tmp->next != NULL) tmp = tmp->next;
tmp->next = node;return node_list;
}

void
indent(int level)
{ int i;
 for (i = 0 ; i < level * 2 ; i++)  putc(' ', stdout);
}

static void
dump_syncgroupnode(SyncGroupNode *def, int level)
{ char *typelabel[] = {"MAIN", "NAME", "GROUP"}; SyncGroupNode *p;
 if (def == NULL)  return;
 switch(def->type) { case SYNC_REP_GROUP_NAME:   indent(level);   puts("{");   indent(level+1);   printf("NODE_TYPE:
SYNC_REP_GROUP_NAME\n");  indent(level+1);   printf("NAME: %s\n", def->name);   indent(level);   puts("}");   if
(def->next)    dump_syncgroupnode(def->next, level);   break; case SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN:
indent(level);  puts("{");   indent(level+1);   printf("NODE_TYPE: SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN\n");
indent(level+1);  printf("NAME: %s\n", def->name);   indent(level+1);   printf("SYNC_METHOD: PRIORITY\n");
indent(level+1);  printf("WAIT_NUM: %d\n", def->wait_num);   indent(level+1);   if (def->members)
dump_syncgroupnode(def->members,level+1);   indent(level);   puts("}");   if (def->next)
dump_syncgroupnode(def->next,level);   break; default:   fprintf(stderr, "ERR\n");   exit(1); }     level--;
 
}

int main(void)
{ yydebug = 1; yyparse(); dump_syncgroupnode(SyncRepStandbys, 0);
}

//#include "syncgroup_scanner.c"
#include "lex.yy.c"


Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Feb 26, 2016 at 10:53 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 26 Feb 2016 10:38:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160226.103822.12680005.horiguchi.kyotaro@lab.ntt.co.jp> 
>> Hello, Thanks for the new patch.
>>
>>
>> At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com>
>> > Previous patch could not parse one character standby name correctly.
>> > Attached latest patch.
>>
>> I haven't looked it in detail but it won't work as you
>> expected. flex compains as the following for v12 patch.
>>
>> syncgroup_scanner.l:80: warning, rule cannot be matched
>> syncgroup_scanner.l:84: warning, rule cannot be matched
>
> Making it independent from postgres body then compile it with
> -DYYDEBUG and set yydebug = 1 would give you valuable information
> and make testing of the parser far easier.

Thank you for your suggestion.
Attached latest version patch.

The changes from previous version are,
- Fix parser, lexer bugs.
- Add regression test patch based on patch Suraji submitted.

Please review it.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Sorry, I misread the previous patch. It actually worked.


At Sun, 28 Feb 2016 04:04:37 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoB69-tNLVzKRZ0Opzsr6LcLY36GJ2tHGohW33Btq3yRsw@mail.gmail.com>
> The changes from previous version are,
> - Fix parser, lexer bugs.
> - Add regression test patch based on patch Suraji submitted.

Thank you for the new patch. The parser almost looks to work as
expected, but the following warnings were seen on build.

> In file included from syncgroup_gram.y:138:0:
> syncgroup_scanner.l:23:12: warning: ‘xd_size’ defined but not used [-Wunused-variable]
>  static int xd_size; /* actual size of xd_string */
>             ^
> syncgroup_scanner.l:24:12: warning: ‘xd_len’ defined but not used [-Wunused-variable]
>  static int xd_len; /* string length of xd_string  */


Some random comments follow.


Commnents for the lexer part.

===
> +node_name    [^\ \,\[\]]

This accepts 'abc^Id' as a name, which is wrong behavior (but
such appliction names are not allowed anyway. If you assume so,
I'd like to see a comment for that.). And the excessive escaping
make it hard to read a bit.  The pattern can be written as the
following more precisely. (but I don't know whether it is
generally easy to read..)

| node_name    [\x20-\x7f]{-}[ \[\],]

===
The pattern name {node_name} gives me a bit
uneasiness. node_name_cont or name_chars would be preferable.

===
> [1-9][0-9]* {

I see no necessity to inhibit 0-prefixed integers as NUM. Would
you mind allowing [0-9]+ there?

===
addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
char ychar) requires differnt character types. Is there any reason
for that?

===
I personally don't like addlit*string() things for such simple
syntax but itself is acceptble enough for me. However it uses
StringInfo to hold double-quoted names, which pallocs 1024 bytes
of memory chunk for every double-quoted name. The chunks are
finally stacked up left uncollected until the current
memorycontext is deleted or reset (It is deleted just after
finishing config file processing). Addition to that, setting
s_s_names runs the parser twice. It seems to me too greedy and
seems that static char [NAMEDATALEN] is enough using the v12 way
without palloc/repalloc.


Comments for parser part.

===
The rule "result" in syncgruop_gram.y sets malloced chunk to
SyncRepStandbys ignoring exiting content so repetitive setting to
the gud s_s_names causes a memory leak. Using
SyncRepClearStandbyGroupList would be enough.

===
The meaning of SyncGroupNode.type seems oscure. The member seems
to be referred to decide how to treat the node, but the following
code will break the assumption.

> group_node->type = SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN;

It seems me that *_MAIN is an equivalent of *_GROUP &&
sync_method = *_PRIORITY. If so, *_MAIN is useless. The reader of
SyncGroupNode needs not to see wheter it was in traditional
s_s_names or in new format.

===
Bare names in s_s_names are down-cased and double-quoted ones are
not. The parser of this patch doesn't for both.

===
xd_stringdup() doesn't make a copy of the string against its
name. It's error-prone.

===
I found that the name SyncGroupName.wait_num is not
instinctive. How about sync_num, sync_member_num or
sync_standby_num? If the last is preferable, .members also should
be .standbys .


Comment for the quorum commit body part.
===
I am quite uncomfortable with the existence of
WanSnd.sync_standby_priority. It represented the pirority in the
old linear s_s_names format but nested groups or even
single-level quarum list obviously doesn't fit it. Can we get rid
of sync_standby_priority, even though we realize atmost
n-priority for now?

===
The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
have specific code for every prioritizing method (which are
priority, quorum, nested and so). Is there any reson to use it as
a callback of SyncGroupNode?



Others - random commnets
===
SyncRepClearStandbyGroupList is defined in syncrep.c but the
other related functions are defined in syncgroup_gram.y. It would
be better to place them together.

===
SyncRepStandbys are to be in multilevel and the struct is
naturally allowed to be so but SyncRepClearStandbyGroupList
assumes it in single level. Make the function to free multilevel
or explicitly inhibit multilevel using asserttion.

===
-           errdetail("The transaction has already committed locally, but might not have been replicated to the
standby.")));
+           errdetail("The transaction has already committed locally, but might not have been replicated to the
standby(s).")));

The message doesn't contain specific number of standbys so just
using plural seems to be enough for me. And besides, the message
should describe the situation more precisely. Word correction is
left to anyone else:)

+           errdetail("The transaction has already committed locally, but might not have been replicated to some of the
requiredstandbys.")));
 

===
+ * Check whether specified standby is active, which means not only having
+ * pid but also having any priority.

"active" means not only defined priority but also have informed
WAL flush position.

+ * Check whether specified standby is active, which means not only having
+ * pid but also having any priority and valid flush position reported.

===
If there's no reason for SyncRepStandbyIsSync not to take WalSnd
directly, taking walsnd is simpler.

static bool SyncRepStandbyIsSync(volatile WalSnd *walsnd);

===
>  * Update the LSNs on each queue based upon our latest state. This
>  * implements a simple policy of first-valid-standby-releases-waiter.
>  *
>  * Other policies are possible, which would change what we do here and what
>  * perhaps also which information we store as well.
>  */
> void
> SyncRepReleaseWaiters(void)

This comment looks wrong for the new code.

===
>    * Select low priority standbys from walsnds array. If there are same
>    * priority standbys, first defined standby is selected. It's possible
>    * to have same priority different standbys, so we can not break loop
>    * even when standby having target_prioirty priority is found.

"low priority" here seems to be a mistake of "high priority
standbys" or "standbys with low priority value".

>    * Returns the list of standbys in sync up to the number that
>    * required to satisfy synchronous_standby_names. If there
>    * are standbys with the same priority values, the first
>    * defined ones are selected. It's possible for multiple
>    * standbys to have a same priority value when multiple
>    * walreceiver gives the same name, so we do not break the
>    * inner loop just by finding a standby with the
>    * target_priority.

===
>   /* Got enough synchronous stnadby */

"staneby" => "standbys"

===
This is a comment from the aspect of abstractness of objects.
The callers of SyncRepGetSyncStandbysUsingPriority() need to care
the inside of SyncGroupNode but what the function should just
return seems to be the list of wansnds element. Element number is
useless when the SyncGroupNode nests.

> int
> SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)

This might need to expose 'volatile WalSnd*' (only pointer type)
outside of walsender.

Or it should return the list of index number of
*WalSndCtl->walsnds*.

===
The dependency definition seems to be wrong in Makefile so

editing related files won't cause appropriate
compilation. syncgroup_gram.h and syncgroup_gram.c are generated
at once from the .y file. and syncgroup_gram.o is generated from
syncgroup_gram.c and syncgroup_scanner.c.

-syncgroup_gram.o: syncgroup_scanner.c
-
-syncgroup_gram.h: syncgroup_gram.c ;
+syncgroup_gram.o: syncgroup_scanner.c syncgroup_gram.c

===
In pg_stat_get_wal_senders, the num_sync looks to have a chance
to be used uninitialized but I don't know why the compiler don't
complain about it.

regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Sun, Feb 28, 2016 at 8:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached latest version patch.
>
> The changes from previous version are,
> - Fix parser, lexer bugs.
> - Add regression test patch based on patch Suraji submitted.
>
> Please review it.
>
> [000_multi_sync_replication_v13.patch]

Hi Masahiko,

Hi,

I have a couple of small suggestions for the documentation and comments:

+        Specifies a standby names that can support
<firstterm>synchronous replication</> using
+        either two types of syntax; comma-separated list or dedicated
language, as
+        described in <xref linkend="synchronous-replication">.
+        Transcations waiting for commit will be allowed to proceed after the
+        specified number of standby servers confirms receipt of their data.

Suggestion: Specifies the standby names that can support
<firstterm>synchronous replication</> using either of two syntaxes: a
comma-separated list, or a more flexible syntax described in <xref
linkend="synchronous-replication">.  Transactions waiting for commit
will be allowed to proceed after a configurable subset of standby
servers confirms receipt of their data.  For the simple
comma-separated list syntax, it is one server.

+        If the current any of synchronous standbys disconnects for
whatever reason,

s/the current any of/any of the current/

+        no mechanism to enforce uniqueness. For each specified standby name,
+        only the specified count of standbys will be chosen to be synchronous
+        standbys, though exactly which one is indeterminate, the rest will
+        represent potential synchronous standbys.

s/one/ones/
s/indeterminate, the/indeterminate.  The/

+    made by a transcation have been transferred to one or more
synchronous standby
+    server. This extends that standard levelof durability

s/transcation/transaction/
s/that standard levelof/the standard level of/
    offered by a transaction commit. This level of protection is referred    to as 2-safe replication in computer
sciencetheory.
 

Is this still called "2-safe" or does this patch make it "N-safe",
"group-safe", or something else?

-    The minimum wait time is the roundtrip time between primary to standby.
+    The minimum wait time is the roundtrip time between primary to standbys.

Suggestion: The minimum wait time is the roundtrip time between the
primary and the slowest synchronous standby.

+    Multiple synchronous replication is set up by setting <xref
linkend="guc-synchronous-standby-names">
+    using dedicated language. The syntax of dedicated language is following.

Suggestion:  Multiple synchronous replication is set up by setting
<xref linkend="guc-synchronous-standby-names"> using the following
syntax.

+    Using dedicated language, we can define a synchronous group with
a number N.
+    synchronous group can have some members which are consdiered as
synchronous standby using comma-separated list.
+    Any standby name is accepted at any position of its list, but '*'
is accepted at only tailing of the standby list.
+    The leading N is a number which specifies that how many standbys
the master server waits to commit for. This number
+    must be less than actual number of members of its group.
+    The listed standby are given highest priority from left defined
starting with 1.

Suggestion: This syntax allows us to define a synchronous group that
will wait for at least N standbys, and a comma-separated list of group
members.  The special value <literal>*</> is accepted at the tail of
the member list, and matches any standby.  The number N must not be
greater than the number of members listed in the group, unless
<literal>*</> is used.  Priority is given to servers in the order that
they appear in the list.  The first named server has the highest
priority.

+    All ASCII characters except for special characters(',', '"',
'[', ']', ' ') are allowed as standby name.
+    When these special characters are used as standby name, whole
standby name string need to be written in
+    double-quoted representation.

Suggestion:  ... are allowed in unquoted standby names.  To use these
special characters, the standby name should be enclosed in double
quotes.

+ * In 9.5 we support the possibility to have multiple synchronous standbys,

s/9.5/9.6/

+ * as defined in synchronous_standby_names. Before on standby can become a

s/ on / a /

+ * Waiters will be released from the queue once the number of standbys
+ * specified in synchronous_standby_names have caught.

s/caught/processed the commit record/

+ * Check whether specified standby is active, which means not only having
+ * pid but also having any priority.

s/having any priority/having a non-zero priority (meaning it is
configured as potential sync standby)./

- announce_next_takeover = true;

By removing this, haven't we lost the ability to announce takeover
more than once per walsender?  I'm not sure exactly where this should
go now but the walsender needs to detect its own transition from
potential to sync state.  Also, that message, where it appears below
should probably be tweaked slightly s/the/a/, so "standby \"%s\" is
now a synchronous standby with priority %u", not "... the synchronous
standby ...".
/*
+ * Return true if we have enough synchrononized standbys and the 'safe' written
+ * flushed LSNs, which are LSNs assured in all standbys considered should be
+ * synchronized.
+ */

Suggestion:  Return true if we have enough synchronous standbys.  If
true, also store the 'safe' write and flush position in the output
parameters write_pos and flush_pos, but only if the standby managed by
this walsender is one of the standbys that has reached each safe
position respectively.

+ /* Check whether each LSN has advanced to */

Suggestion: /* Check whether this standby has reached the safe positions. */

+/*
+ * Decide synced LSNs at this moment using priority method.
+ * If there are not active standbys enough to determine LSNs, return false.

s/not active standbys enough/not enough active standbys/

+/*
+ * Return the positions of the first group->wait_num synchronized standbys
+ * in group->member list into sync_list. sync_list is assumed to have enough
+ * space for at least group->wait_num elements.
+ */

s/Return/Write/
s/sychronized/synchronous/
Then add: "Return the number found."

+int
+SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, int *sync_list)
+{
+ int target_priority = 1; /* lowest priority is 1 */

1 is actually the *highest* priority standby.

+ /*
+ * Select low priority standbys from walsnds array. If there are same
+ * priority standbys, first defined standby is selected. It's possible
+ * to have same priority different standbys, so we can not break loop
+ * even when standby having target_prioirty priority is found.

s/target_prioirty/target_priority/

+ /* Got enough synchronous stnadby */

s/stnadby/standbys/

+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ (errmsg_internal("The number of group memebers must be less than its
group waits."))));

I'm not sure what the right error code is, but this isn't an syntax
error.  Maybe ERRCODE_CONFIG_FILE_ERROR or
ERRCODE_INVALID_PARAMETER_VALUE?  Suggestion for the message:  "the
configured number of synchronous standbys exceeds the length of the
group of standby names: %d"

+ /*
+ * syncgroup_yyparse sets the global SyncRepStandbys as side effect.
+ * But this function is required to just check, so frees SyncRepStandbyNanes

s/SyncRepStandbyNanes/SyncRepStandbys/ ???

+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ (errmsg_internal("Invalid syntax. synchronous_standby_names parse
returned %d",
+  parse_rc))));

Looking at other error messages I see that they always start with
lower case and then put extra details after ':' rather than using a
'.'.  Maybe this could be "could not parse synchronous_standby_names:
error code %d"?

+#define MAX_WALSENDER_NAME 8192

Seems to be unused.

Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
Hi,

Thank you so much for reviewing this patch!

All review comments regarding document and comment are fixed.
Attached latest v14 patch.

> This accepts 'abc^Id' as a name, which is wrong behavior (but
> such appliction names are not allowed anyway. If you assume so,
> I'd like to see a comment for that.).

'abc^Id' is accepted as application_name, no?
postgres(1)=# set application_name to 'abc^Id';
SET
postgres(1)=# show application_name ;
 application_name
------------------
 abc^Id
(1 row)

> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
> char ychar) requires differnt character types. Is there any reason
> for that?

Because addlit_xd_string() is for adding string(char *) to xd_string,
OTOH addlit_xd_char() is for adding just one character to xd_string.

> I personally don't like addlit*string() things for such simple
> syntax but itself is acceptble enough for me. However it uses
> StringInfo to hold double-quoted names, which pallocs 1024 bytes
> of memory chunk for every double-quoted name. The chunks are
> finally stacked up left uncollected until the current
> memorycontext is deleted or reset (It is deleted just after
> finishing config file processing). Addition to that, setting
> s_s_names runs the parser twice. It seems to me too greedy and
> seems that static char [NAMEDATALEN] is enough using the v12 way
> without palloc/repalloc.

I though that length of group name could be more than NAMEDATALEN, so
I use StringInfo.
Is it not necessary?

> I found that the name SyncGroupName.wait_num is not
> instinctive. How about sync_num, sync_member_num or
> sync_standby_num? If the last is preferable, .members also should
> be .standbys .

Thanks, sync_num is preferable to me.

===
> I am quite uncomfortable with the existence of
> WanSnd.sync_standby_priority. It represented the pirority in the
> old linear s_s_names format but nested groups or even
> single-level quarum list obviously doesn't fit it. Can we get rid
> of sync_standby_priority, even though we realize atmost
> n-priority for now?

We could get rid of sync_standby_priority.
But if so, we will not be able to see the next sync standby in
pg_stat_replication system view.
Regarding each node priority, I was thinking that standbys in quorum
list have same priority, and in nested group each standbys are given
the priority starting from 1.

===
> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
> have specific code for every prioritizing method (which are
> priority, quorum, nested and so). Is there any reson to use it as
> a callback of SyncGroupNode?

The reason why the current code is so is that current code is for only
priority method supporting.
At first version of this feature, I'd like to implement it more simple.

Aside from this, of course I'm planning to have specific code for nested design.
- The group can have some name nodes or group nodes.
- The group can use either 2 types of method: priority or quorum.
- The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
  - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
at that moment using group's method.
  - SyncRepGetStandbysFn() function returns standbys of its group,
which are considered as sync using group's method.

For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
memory structure will be,

"main(quorum)" --- "a"
                        |
                        -- "b"
                        |
                        -- "group1(priority)" --- "c"
                                                     |
                                                     -- "d"

When determine synced LSNs, we need to consider group1's LSN using by
priority method at first, and then we can determine main's LSN using
by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
So SyncRepGetSyncedLsnsUsingPriority() function would be,

bool
SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
{
    sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);

    if (sync_num < group->sync_num)
        return false;

    for (each member of sync_list)
    {
        if (member->type == group node)
            call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
f into lsn_list.
        else
            Store name node LSNs into lsn_list.
    }

    Determine synced LSNs of this group using lsn_list and priority method.
    Store synced LSNs into write_lsn and flush_lsn.
    return true;
}

> SyncRepClearStandbyGroupList is defined in syncrep.c but the
> other related functions are defined in syncgroup_gram.y. It would
> be better to place them together.

SyncRepClearStandbyGroupList() is used by
check_synchronous_standby_names(), so I put this function syncrep.c.

> SyncRepStandbys are to be in multilevel and the struct is
> naturally allowed to be so but SyncRepClearStandbyGroupList
> assumes it in single level.

Because I think that we don't need to implement to fully support
nested style at first version.
We have to carefully design this feature while considering
expandability, but overkill implementation could be cause of crash.
Consider remaining time for 9.6, I feel we could implement quorum
method at best.

> This is a comment from the aspect of abstractness of objects.
> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
> the inside of SyncGroupNode but what the function should just
> return seems to be the list of wansnds element. Element number is
> useless when the SyncGroupNode nests.
> > int
> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
> This might need to expose 'volatile WalSnd*' (only pointer type)
> outside of walsender.
> Or it should return the list of index number of
> *WalSndCtl->walsnds*.

SyncRepGetSyncStandbysUsingPriority() already returns the list of
index number of "WalSndCtl->walsnd" as sync_list, no?
As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
inside of SyncGroupNode in my design.
Selecting sync nodes from its group doesn't depend on the type of node.
What SyncRepGetSyncStandbyFn() should do is to select sync node from
*its* group.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Hi,
>
> Thank you so much for reviewing this patch!
>
> All review comments regarding document and comment are fixed.
> Attached latest v14 patch.
>
>> This accepts 'abc^Id' as a name, which is wrong behavior (but
>> such appliction names are not allowed anyway. If you assume so,
>> I'd like to see a comment for that.).
>
> 'abc^Id' is accepted as application_name, no?
> postgres(1)=# set application_name to 'abc^Id';
> SET
> postgres(1)=# show application_name ;
>  application_name
> ------------------
>  abc^Id
> (1 row)
>
>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>> char ychar) requires differnt character types. Is there any reason
>> for that?
>
> Because addlit_xd_string() is for adding string(char *) to xd_string,
> OTOH addlit_xd_char() is for adding just one character to xd_string.
>
>> I personally don't like addlit*string() things for such simple
>> syntax but itself is acceptble enough for me. However it uses
>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
>> of memory chunk for every double-quoted name. The chunks are
>> finally stacked up left uncollected until the current
>> memorycontext is deleted or reset (It is deleted just after
>> finishing config file processing). Addition to that, setting
>> s_s_names runs the parser twice. It seems to me too greedy and
>> seems that static char [NAMEDATALEN] is enough using the v12 way
>> without palloc/repalloc.
>
> I though that length of group name could be more than NAMEDATALEN, so
> I use StringInfo.
> Is it not necessary?
>
>> I found that the name SyncGroupName.wait_num is not
>> instinctive. How about sync_num, sync_member_num or
>> sync_standby_num? If the last is preferable, .members also should
>> be .standbys .
>
> Thanks, sync_num is preferable to me.
>
> ===
>> I am quite uncomfortable with the existence of
>> WanSnd.sync_standby_priority. It represented the pirority in the
>> old linear s_s_names format but nested groups or even
>> single-level quarum list obviously doesn't fit it. Can we get rid
>> of sync_standby_priority, even though we realize atmost
>> n-priority for now?
>
> We could get rid of sync_standby_priority.
> But if so, we will not be able to see the next sync standby in
> pg_stat_replication system view.
> Regarding each node priority, I was thinking that standbys in quorum
> list have same priority, and in nested group each standbys are given
> the priority starting from 1.
>
> ===
>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
>> have specific code for every prioritizing method (which are
>> priority, quorum, nested and so). Is there any reson to use it as
>> a callback of SyncGroupNode?
>
> The reason why the current code is so is that current code is for only
> priority method supporting.
> At first version of this feature, I'd like to implement it more simple.
>
> Aside from this, of course I'm planning to have specific code for nested design.
> - The group can have some name nodes or group nodes.
> - The group can use either 2 types of method: priority or quorum.
> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
> at that moment using group's method.
>   - SyncRepGetStandbysFn() function returns standbys of its group,
> which are considered as sync using group's method.
>
> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
> memory structure will be,
>
> "main(quorum)" --- "a"
>                         |
>                         -- "b"
>                         |
>                         -- "group1(priority)" --- "c"
>                                                      |
>                                                      -- "d"
>
> When determine synced LSNs, we need to consider group1's LSN using by
> priority method at first, and then we can determine main's LSN using
> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
> So SyncRepGetSyncedLsnsUsingPriority() function would be,
>
> bool
> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
> {
>     sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
>
>     if (sync_num < group->sync_num)
>         return false;
>
>     for (each member of sync_list)
>     {
>         if (member->type == group node)
>             call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
> f into lsn_list.
>         else
>             Store name node LSNs into lsn_list.
>     }
>
>     Determine synced LSNs of this group using lsn_list and priority method.
>     Store synced LSNs into write_lsn and flush_lsn.
>     return true;
> }
>
>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
>> other related functions are defined in syncgroup_gram.y. It would
>> be better to place them together.
>
> SyncRepClearStandbyGroupList() is used by
> check_synchronous_standby_names(), so I put this function syncrep.c.
>
>> SyncRepStandbys are to be in multilevel and the struct is
>> naturally allowed to be so but SyncRepClearStandbyGroupList
>> assumes it in single level.
>
> Because I think that we don't need to implement to fully support
> nested style at first version.
> We have to carefully design this feature while considering
> expandability, but overkill implementation could be cause of crash.
> Consider remaining time for 9.6, I feel we could implement quorum
> method at best.
>
>> This is a comment from the aspect of abstractness of objects.
>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
>> the inside of SyncGroupNode but what the function should just
>> return seems to be the list of wansnds element. Element number is
>> useless when the SyncGroupNode nests.
>> > int
>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
>> This might need to expose 'volatile WalSnd*' (only pointer type)
>> outside of walsender.
>> Or it should return the list of index number of
>> *WalSndCtl->walsnds*.
>
> SyncRepGetSyncStandbysUsingPriority() already returns the list of
> index number of "WalSndCtl->walsnd" as sync_list, no?
> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
> inside of SyncGroupNode in my design.
> Selecting sync nodes from its group doesn't depend on the type of node.
> What SyncRepGetSyncStandbyFn() should do is to select sync node from
> *its* group.
>

Previous patch has bug around GUC parameter handling.
Attached updated version.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Fri, Mar 4, 2016 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Previous patch has bug around GUC parameter handling.
> Attached updated version.

I spotted a couple of typos:

+    used.  Priority is given to servers in the order that the appear
in the list.

s/the appear/they appear/

-    The minimum wait time is the roundtrip time between primary to standby.
+    The minimum wait time is the roundtrip time between the primary and the
+    almost synchronous standby.

s/almost/slowest/

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

Sorry for long, hard-to-read writings in advance..

At Thu, 3 Mar 2016 23:30:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD3XGZtuvgc5uKJdvcoJP5S0rvGQQCJLRL4rLsruRch5Q@mail.gmail.com>
> Hi,
> 
> Thank you so much for reviewing this patch!
> 
> All review comments regarding document and comment are fixed.
> Attached latest v14 patch.
> 
> > This accepts 'abc^Id' as a name, which is wrong behavior (but
> > such appliction names are not allowed anyway. If you assume so,
> > I'd like to see a comment for that.).
> 
> 'abc^Id' is accepted as application_name, no?
> postgres(1)=# set application_name to 'abc^Id';
> SET
> postgres(1)=# show application_name ;
>  application_name
> ------------------
>  abc^Id
> (1 row)

Sorry, I implicitly used "^" in the meaning of "ctrl key". So
"^I" is so-called Ctrl-I, that is horizontal tab or 0x09. So the
following in psql shows that.

=# set application_name to E'abc\td';
=# show application_name ;application_name 
------------------ab?d
(1 row)

The <tab> is replaced with '?' (literally) at the time of
guc assinment.

> > addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
> > char ychar) requires differnt character types. Is there any reason
> > for that?
> 
> Because addlit_xd_string() is for adding string(char *) to xd_string,
> OTOH addlit_xd_char() is for adding just one character to xd_string.

Umm. My qustion might have been a bit out of the point.

The addlitchar_xd_string(str,unsigned char c) does
appendStringInfoChar(, c). On the other hand, the signature of
the function of stringinfo is the following.

AppendStringInfoChar(StringInfo str, char ch);

Of course "char" is equivalent of "signed char" as
default. addlitchar_xd_string assigns the given character in
"unsigned char" to the parameter of AppendStringInfoChar of
"signed char".

These two are incompatible types. Imagine the
following codelet, 

#include <stdio.h>

void hoge(signed char c){ int ch = c; fprintf(stderr, "char = %d\n", ch);
}

int main(void)
{ unsigned char u;
 u = 200; hoge(u); return 0;
}

The result is -56. So we generally should get rid of such type of
mixture of signedness for no particular reason.

In this case, the domain of the variable is 0x20-0x7e so no
problem won't be actualized but also there's no reason for the
signedness mixture.

> > I personally don't like addlit*string() things for such simple
> > syntax but itself is acceptble enough for me. However it uses
> > StringInfo to hold double-quoted names, which pallocs 1024 bytes
> > of memory chunk for every double-quoted name. The chunks are
> > finally stacked up left uncollected until the current
> > memorycontext is deleted or reset (It is deleted just after
> > finishing config file processing). Addition to that, setting
> > s_s_names runs the parser twice. It seems to me too greedy and
> > seems that static char [NAMEDATALEN] is enough using the v12 way
> > without palloc/repalloc.
> 
> I though that length of group name could be more than NAMEDATALEN, so
> I use StringInfo.
> Is it not necessary?

Such long names doesn't seem to necessary. Too long identifiers
no longer act as identifier for human eyeballs. We are limiting
the length of identifiers of the whole database system to
NAMEDATALEN-1, which seems to have been enough so I don't see any
reason to have a group name longer than that.

> > I found that the name SyncGroupName.wait_num is not
> > instinctive. How about sync_num, sync_member_num or
> > sync_standby_num? If the last is preferable, .members also should
> > be .standbys .
> 
> Thanks, sync_num is preferable to me.
> 
> ===
> > I am quite uncomfortable with the existence of
> > WanSnd.sync_standby_priority. It represented the pirority in the
> > old linear s_s_names format but nested groups or even
> > single-level quarum list obviously doesn't fit it. Can we get rid
> > of sync_standby_priority, even though we realize atmost
> > n-priority for now?
> 
> We could get rid of sync_standby_priority.
> But if so, we will not be able to see the next sync standby in
> pg_stat_replication system view.
> Regarding each node priority, I was thinking that standbys in quorum
> list have same priority, and in nested group each standbys are given
> the priority starting from 1.

As far as I can see the varialbe is referred to as a boolean to
indicate whether a walsernder is connected to a candidate
synchronous standby. So the value is totally useless, at least
for now. However, SyncRepRelaseWaiters uses the value to check if
the synced LSNs can be advaned by a walsender so the variable is
useful as a boolean.

In the previous versions, the reason why WanSnd had the priority
value is that a pair of synchronized LSNs is determined only by
one wansender, which has the highest priority among active
wansenders. So even if a walsender receives a response from
walreceiver, it doesn't need to do nothing if it is not at the
highest priority. It's a simple world.

In the quorum commit word, in contrast, what
SyncRepGetSyncStandbysFn shoud do is returning certain private
information to be used to calculate a pair of safe/synched LSNs
in SyncRepGetSYncedLsnsFn looking into WalSndCtl->wansnds
list. The latter passes a pair of safe/synced LSNs to the upper
level list or SyncRepSyncedLsnAdvancedTo as the topmost
caller. There's no room for sync_standby_priority to work as the
original objective.

Even if we assign the value in the explained way, the values are
always 1 for quorum method and duplicate values for multiple
priority method. What do you want to show by the value to users?


> ===
> > The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
> > have specific code for every prioritizing method (which are
> > priority, quorum, nested and so). Is there any reson to use it as
> > a callback of SyncGroupNode?
> 
> The reason why the current code is so is that current code is for only
> priority method supporting.
> At first version of this feature, I'd like to implement it more simple.
> 
> Aside from this, of course I'm planning to have specific code for nested design.
> - The group can have some name nodes or group nodes.
> - The group can use either 2 types of method: priority or quorum.
> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
> at that moment using group's method.
>   - SyncRepGetStandbysFn() function returns standbys of its group,
> which are considered as sync using group's method.
> 
> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
> memory structure will be,
> 
> "main(quorum)" --- "a"
>                         |
>                         -- "b"
>                         |
>                         -- "group1(priority)" --- "c"
>                                                      |
>                                                      -- "d"
> 
> When determine synced LSNs, we need to consider group1's LSN using by
> priority method at first, and then we can determine main's LSN using
> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
> So SyncRepGetSyncedLsnsUsingPriority() function would be,

Thank you for the explanation. I *recalled* that.

> > SyncRepClearStandbyGroupList is defined in syncrep.c but the
> > other related functions are defined in syncgroup_gram.y. It would
> > be better to place them together.
> 
> SyncRepClearStandbyGroupList() is used by
> check_synchronous_standby_names(), so I put this function syncrep.c.

Thanks.

> > SyncRepStandbys are to be in multilevel and the struct is
> > naturally allowed to be so but SyncRepClearStandbyGroupList
> > assumes it in single level.
> 
> Because I think that we don't need to implement to fully support
> nested style at first version.
> We have to carefully design this feature while considering
> expandability, but overkill implementation could be cause of crash.
> Consider remaining time for 9.6, I feel we could implement quorum
> method at best.

Yes, so I proposed to ass Aseert() in the function.

> > This is a comment from the aspect of abstractness of objects.
> > The callers of SyncRepGetSyncStandbysUsingPriority() need to care
> > the inside of SyncGroupNode but what the function should just
> > return seems to be the list of wansnds element. Element number is
> > useless when the SyncGroupNode nests.
> > > int
> > > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
> > This might need to expose 'volatile WalSnd*' (only pointer type)
> > outside of walsender.
> > Or it should return the list of index number of
> > *WalSndCtl->walsnds*.
> 
> SyncRepGetSyncStandbysUsingPriority() already returns the list of
> index number of "WalSndCtl->walsnd" as sync_list, no?

Yes, myself don't understand what I tried to say by this:( Maybe
I mistook what sync_list returns as an index list of
SyncGroupNode. Anyway sorry for the noise.

> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
> inside of SyncGroupNode in my design.
> Selecting sync nodes from its group doesn't depend on the type of node.
> What SyncRepGetSyncStandbyFn() should do is to select sync node from
> *its* group.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
Reply to multiple hackers.
Thank you for reviewing this patch.

> +    used.  Priority is given to servers in the order that the appear
> in the list.
>
> s/the appear/they appear/
>
> -    The minimum wait time is the roundtrip time between primary to standby.
> +    The minimum wait time is the roundtrip time between the primary and the
> +    almost synchronous standby.
>
> s/almost/slowest/

Will fix this typo. Thanks!

On Fri, Mar 4, 2016 at 5:22 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> Sorry for long, hard-to-read writings in advance..
>
> At Thu, 3 Mar 2016 23:30:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoD3XGZtuvgc5uKJdvcoJP5S0rvGQQCJLRL4rLsruRch5Q@mail.gmail.com>
>> Hi,
>>
>> Thank you so much for reviewing this patch!
>>
>> All review comments regarding document and comment are fixed.
>> Attached latest v14 patch.
>>
>> > This accepts 'abc^Id' as a name, which is wrong behavior (but
>> > such appliction names are not allowed anyway. If you assume so,
>> > I'd like to see a comment for that.).
>>
>> 'abc^Id' is accepted as application_name, no?
>> postgres(1)=# set application_name to 'abc^Id';
>> SET
>> postgres(1)=# show application_name ;
>>  application_name
>> ------------------
>>  abc^Id
>> (1 row)
>
> Sorry, I implicitly used "^" in the meaning of "ctrl key". So
> "^I" is so-called Ctrl-I, that is horizontal tab or 0x09. So the
> following in psql shows that.
>
> =# set application_name to E'abc\td';
> =# show application_name ;
>  application_name
> ------------------
>  ab?d
> (1 row)
>
> The <tab> is replaced with '?' (literally) at the time of
> guc assinment.

Oh, I see.
I will comment for that.

>> > addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>> > char ychar) requires differnt character types. Is there any reason
>> > for that?
>>
>> Because addlit_xd_string() is for adding string(char *) to xd_string,
>> OTOH addlit_xd_char() is for adding just one character to xd_string.
>
> Umm. My qustion might have been a bit out of the point.
>
> The addlitchar_xd_string(str,unsigned char c) does
> appendStringInfoChar(, c). On the other hand, the signature of
> the function of stringinfo is the following.
>
> AppendStringInfoChar(StringInfo str, char ch);
>
> Of course "char" is equivalent of "signed char" as
> default. addlitchar_xd_string assigns the given character in
> "unsigned char" to the parameter of AppendStringInfoChar of
> "signed char".
>
> These two are incompatible types. Imagine the
> following codelet,
>
> #include <stdio.h>
>
> void hoge(signed char c){
>   int ch = c;
>   fprintf(stderr, "char = %d\n", ch);
> }
>
> int main(void)
> {
>   unsigned char u;
>
>   u = 200;
>   hoge(u);
>   return 0;
> }
>
> The result is -56. So we generally should get rid of such type of
> mixture of signedness for no particular reason.
>
> In this case, the domain of the variable is 0x20-0x7e so no
> problem won't be actualized but also there's no reason for the
> signedness mixture.

Thank you for explanation.
I will fix this.

>> > I personally don't like addlit*string() things for such simple
>> > syntax but itself is acceptble enough for me. However it uses
>> > StringInfo to hold double-quoted names, which pallocs 1024 bytes
>> > of memory chunk for every double-quoted name. The chunks are
>> > finally stacked up left uncollected until the current
>> > memorycontext is deleted or reset (It is deleted just after
>> > finishing config file processing). Addition to that, setting
>> > s_s_names runs the parser twice. It seems to me too greedy and
>> > seems that static char [NAMEDATALEN] is enough using the v12 way
>> > without palloc/repalloc.
>>
>> I though that length of group name could be more than NAMEDATALEN, so
>> I use StringInfo.
>> Is it not necessary?
>
> Such long names doesn't seem to necessary. Too long identifiers
> no longer act as identifier for human eyeballs. We are limiting
> the length of identifiers of the whole database system to
> NAMEDATALEN-1, which seems to have been enough so I don't see any
> reason to have a group name longer than that.
>

I see. I will fix this.

>> > I found that the name SyncGroupName.wait_num is not
>> > instinctive. How about sync_num, sync_member_num or
>> > sync_standby_num? If the last is preferable, .members also should
>> > be .standbys .
>>
>> Thanks, sync_num is preferable to me.
>>
>> ===
>> > I am quite uncomfortable with the existence of
>> > WanSnd.sync_standby_priority. It represented the pirority in the
>> > old linear s_s_names format but nested groups or even
>> > single-level quarum list obviously doesn't fit it. Can we get rid
>> > of sync_standby_priority, even though we realize atmost
>> > n-priority for now?
>>
>> We could get rid of sync_standby_priority.
>> But if so, we will not be able to see the next sync standby in
>> pg_stat_replication system view.
>> Regarding each node priority, I was thinking that standbys in quorum
>> list have same priority, and in nested group each standbys are given
>> the priority starting from 1.
>
> As far as I can see the varialbe is referred to as a boolean to
> indicate whether a walsernder is connected to a candidate
> synchronous standby. So the value is totally useless, at least
> for now. However, SyncRepRelaseWaiters uses the value to check if
> the synced LSNs can be advaned by a walsender so the variable is
> useful as a boolean.
>
> In the previous versions, the reason why WanSnd had the priority
> value is that a pair of synchronized LSNs is determined only by
> one wansender, which has the highest priority among active
> wansenders. So even if a walsender receives a response from
> walreceiver, it doesn't need to do nothing if it is not at the
> highest priority. It's a simple world.
>
> In the quorum commit word, in contrast, what
> SyncRepGetSyncStandbysFn shoud do is returning certain private
> information to be used to calculate a pair of safe/synched LSNs
> in SyncRepGetSYncedLsnsFn looking into WalSndCtl->wansnds
> list. The latter passes a pair of safe/synced LSNs to the upper
> level list or SyncRepSyncedLsnAdvancedTo as the topmost
> caller. There's no room for sync_standby_priority to work as the
> original objective.
>
> Even if we assign the value in the explained way, the values are
> always 1 for quorum method and duplicate values for multiple
> priority method. What do you want to show by the value to users?

I agree with you.
When we implement nested style of multiple sync replication, it would
tough to show to users using by sync_standby_priority.
But in current our first goal (implementing 1-nest style), it doesn't
seem to need to get rid of sync_standby_priority from WalSnd so far,
no?
Towards multiple nested style, I'm roughly planning to have new system
view is defined like follows.

- New system view shows all groups and nodes informations.
- Move sync_state from pg_stat_replication to new system view.
- Get rid of sync_priority from pg_stat_replication.
- Add new sync_state 'quorum' that indicates candidate sync standbys
of its group using quorum method.
- If parent group state is potential, 'potential:' prefix is added to
the child standby's sync_state.

* s_s_names = '2[a, 1(b,c):group1, 1[d,e]:gorup2]' name  | sync_method |      member         | sync_num |
sync_state      | parant_group

-----------+--------------------+---------------------------+---------------+--------------------------+--------------main
  |  priority         | {a,group1,group2}  |        2      |                      |a         |                     |
                       |       | sync                   | maingroup1 |  quorum        | {b,c}                     |
  1      |
 
sync                    | mainb         |                    |                             |       | sync
    | group1c         |                    |                             |       | potential               |
group1group2|  priority         | {d,e}                     |        1| potential               | maind         |
            |                             |       | potential:sync      | group2e         |                    |
                    |       | potential:potential | group2
 
(8 rows)

* s_s_names = '2(a, 1[b,c]:group1, 1(d,e):group2)' name  | sync_method |      member         | sync_num |
sync_state      | parant_group

-----------+--------------------+--------------------------+----------------+--------------------------+--------------main
  |  quorum        | {a,group1,group2} |        2      |                 |a         |                     |
             |      | quorum              | maingroup1 |  priority         | {b,c}                    |        1
 
| quorum              | mainb         |                     |                           |      | sync
|group1c         |                     |                           |      | potential             | group1group2 |
quorum       | {d,e}                    |        1      |
 
quorum              | maind         |                    |                            |      | quorum              |
group2e        |                    |                            |      | quorum              | group2
 
(8 rows)

>> > SyncRepStandbys are to be in multilevel and the struct is
>> > naturally allowed to be so but SyncRepClearStandbyGroupList
>> > assumes it in single level.
>>
>> Because I think that we don't need to implement to fully support
>> nested style at first version.
>> We have to carefully design this feature while considering
>> expandability, but overkill implementation could be cause of crash.
>> Consider remaining time for 9.6, I feel we could implement quorum
>> method at best.
>
> Yes, so I proposed to ass Aseert() in the function.

Will add it.


Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Hi,
>>
>> Thank you so much for reviewing this patch!
>>
>> All review comments regarding document and comment are fixed.
>> Attached latest v14 patch.
>>
>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
>>> such appliction names are not allowed anyway. If you assume so,
>>> I'd like to see a comment for that.).
>>
>> 'abc^Id' is accepted as application_name, no?
>> postgres(1)=# set application_name to 'abc^Id';
>> SET
>> postgres(1)=# show application_name ;
>>  application_name
>> ------------------
>>  abc^Id
>> (1 row)
>>
>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>>> char ychar) requires differnt character types. Is there any reason
>>> for that?
>>
>> Because addlit_xd_string() is for adding string(char *) to xd_string,
>> OTOH addlit_xd_char() is for adding just one character to xd_string.
>>
>>> I personally don't like addlit*string() things for such simple
>>> syntax but itself is acceptble enough for me. However it uses
>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
>>> of memory chunk for every double-quoted name. The chunks are
>>> finally stacked up left uncollected until the current
>>> memorycontext is deleted or reset (It is deleted just after
>>> finishing config file processing). Addition to that, setting
>>> s_s_names runs the parser twice. It seems to me too greedy and
>>> seems that static char [NAMEDATALEN] is enough using the v12 way
>>> without palloc/repalloc.
>>
>> I though that length of group name could be more than NAMEDATALEN, so
>> I use StringInfo.
>> Is it not necessary?
>>
>>> I found that the name SyncGroupName.wait_num is not
>>> instinctive. How about sync_num, sync_member_num or
>>> sync_standby_num? If the last is preferable, .members also should
>>> be .standbys .
>>
>> Thanks, sync_num is preferable to me.
>>
>> ===
>>> I am quite uncomfortable with the existence of
>>> WanSnd.sync_standby_priority. It represented the pirority in the
>>> old linear s_s_names format but nested groups or even
>>> single-level quarum list obviously doesn't fit it. Can we get rid
>>> of sync_standby_priority, even though we realize atmost
>>> n-priority for now?
>>
>> We could get rid of sync_standby_priority.
>> But if so, we will not be able to see the next sync standby in
>> pg_stat_replication system view.
>> Regarding each node priority, I was thinking that standbys in quorum
>> list have same priority, and in nested group each standbys are given
>> the priority starting from 1.
>>
>> ===
>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
>>> have specific code for every prioritizing method (which are
>>> priority, quorum, nested and so). Is there any reson to use it as
>>> a callback of SyncGroupNode?
>>
>> The reason why the current code is so is that current code is for only
>> priority method supporting.
>> At first version of this feature, I'd like to implement it more simple.
>>
>> Aside from this, of course I'm planning to have specific code for nested design.
>> - The group can have some name nodes or group nodes.
>> - The group can use either 2 types of method: priority or quorum.
>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
>> at that moment using group's method.
>>   - SyncRepGetStandbysFn() function returns standbys of its group,
>> which are considered as sync using group's method.
>>
>> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
>> memory structure will be,
>>
>> "main(quorum)" --- "a"
>>                         |
>>                         -- "b"
>>                         |
>>                         -- "group1(priority)" --- "c"
>>                                                      |
>>                                                      -- "d"
>>
>> When determine synced LSNs, we need to consider group1's LSN using by
>> priority method at first, and then we can determine main's LSN using
>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
>>
>> bool
>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
>> {
>>     sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
>>
>>     if (sync_num < group->sync_num)
>>         return false;
>>
>>     for (each member of sync_list)
>>     {
>>         if (member->type == group node)
>>             call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
>> f into lsn_list.
>>         else
>>             Store name node LSNs into lsn_list.
>>     }
>>
>>     Determine synced LSNs of this group using lsn_list and priority method.
>>     Store synced LSNs into write_lsn and flush_lsn.
>>     return true;
>> }
>>
>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
>>> other related functions are defined in syncgroup_gram.y. It would
>>> be better to place them together.
>>
>> SyncRepClearStandbyGroupList() is used by
>> check_synchronous_standby_names(), so I put this function syncrep.c.
>>
>>> SyncRepStandbys are to be in multilevel and the struct is
>>> naturally allowed to be so but SyncRepClearStandbyGroupList
>>> assumes it in single level.
>>
>> Because I think that we don't need to implement to fully support
>> nested style at first version.
>> We have to carefully design this feature while considering
>> expandability, but overkill implementation could be cause of crash.
>> Consider remaining time for 9.6, I feel we could implement quorum
>> method at best.
>>
>>> This is a comment from the aspect of abstractness of objects.
>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
>>> the inside of SyncGroupNode but what the function should just
>>> return seems to be the list of wansnds element. Element number is
>>> useless when the SyncGroupNode nests.
>>> > int
>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
>>> This might need to expose 'volatile WalSnd*' (only pointer type)
>>> outside of walsender.
>>> Or it should return the list of index number of
>>> *WalSndCtl->walsnds*.
>>
>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
>> index number of "WalSndCtl->walsnd" as sync_list, no?
>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
>> inside of SyncGroupNode in my design.
>> Selecting sync nodes from its group doesn't depend on the type of node.
>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
>> *its* group.
>>
>
> Previous patch has bug around GUC parameter handling.
> Attached updated version.

Thanks for updating the patch!

Now I'm fixing some problems (e.g., current patch doesn't work
with EXEC_BACKEND environment) and revising the patch.
I will post the revised version this weekend or the first half
of next week.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
<para>    Synchronous replication offers the ability to confirm that all changes
-    made by a transaction have been transferred to one synchronous standby
-    server. This extends the standard level of durability
+    made by a transaction have been transferred to one or more
synchronous standby
+    server. This extends that standard level of durability    offered by a transaction commit. This level of
protectionis referred
 
-    to as 2-safe replication in computer science theory.
+    to as group-safe replication in computer science theory.   </para>

A message on the -general list today pointed me to some earlier
discussion[1] which quoted and referenced definitions of these
academic terms[2].  I think the above documentation should say:

"This level of protection is referred to as 2-safe replication in
computer science literature when <variable>synchronous_commit</> is
set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when
<variable>synchronous_commit</> is set to <literal>remote_write</>."

By my reading, the situation doesn't actually change with this patch.
It doesn't matter whether you need 1 or 42 synchronous standbys to
make a quorum: 2-safe means durable (fsync) on all of them,
group-1-safe means durable on one server and received (implied by
remote_write) by all of them.

I think we should be using those definitions because Gray's earlier
definition of 2-safe from Transaction Processing 12.6.3 doesn't really
fit:  It can optionally mean remote receipt or remote durable storage,
but it doesn't wait if the 'backup' is down, so it's not the same type
of guarantee.  (He also has 'very safe' which might describe our
syncrep, I'm not sure.)

[1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com
[2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76

On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Hi,
>>>
>>> Thank you so much for reviewing this patch!
>>>
>>> All review comments regarding document and comment are fixed.
>>> Attached latest v14 patch.
>>>
>>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
>>>> such appliction names are not allowed anyway. If you assume so,
>>>> I'd like to see a comment for that.).
>>>
>>> 'abc^Id' is accepted as application_name, no?
>>> postgres(1)=# set application_name to 'abc^Id';
>>> SET
>>> postgres(1)=# show application_name ;
>>>  application_name
>>> ------------------
>>>  abc^Id
>>> (1 row)
>>>
>>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>>>> char ychar) requires differnt character types. Is there any reason
>>>> for that?
>>>
>>> Because addlit_xd_string() is for adding string(char *) to xd_string,
>>> OTOH addlit_xd_char() is for adding just one character to xd_string.
>>>
>>>> I personally don't like addlit*string() things for such simple
>>>> syntax but itself is acceptble enough for me. However it uses
>>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
>>>> of memory chunk for every double-quoted name. The chunks are
>>>> finally stacked up left uncollected until the current
>>>> memorycontext is deleted or reset (It is deleted just after
>>>> finishing config file processing). Addition to that, setting
>>>> s_s_names runs the parser twice. It seems to me too greedy and
>>>> seems that static char [NAMEDATALEN] is enough using the v12 way
>>>> without palloc/repalloc.
>>>
>>> I though that length of group name could be more than NAMEDATALEN, so
>>> I use StringInfo.
>>> Is it not necessary?
>>>
>>>> I found that the name SyncGroupName.wait_num is not
>>>> instinctive. How about sync_num, sync_member_num or
>>>> sync_standby_num? If the last is preferable, .members also should
>>>> be .standbys .
>>>
>>> Thanks, sync_num is preferable to me.
>>>
>>> ===
>>>> I am quite uncomfortable with the existence of
>>>> WanSnd.sync_standby_priority. It represented the pirority in the
>>>> old linear s_s_names format but nested groups or even
>>>> single-level quarum list obviously doesn't fit it. Can we get rid
>>>> of sync_standby_priority, even though we realize atmost
>>>> n-priority for now?
>>>
>>> We could get rid of sync_standby_priority.
>>> But if so, we will not be able to see the next sync standby in
>>> pg_stat_replication system view.
>>> Regarding each node priority, I was thinking that standbys in quorum
>>> list have same priority, and in nested group each standbys are given
>>> the priority starting from 1.
>>>
>>> ===
>>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
>>>> have specific code for every prioritizing method (which are
>>>> priority, quorum, nested and so). Is there any reson to use it as
>>>> a callback of SyncGroupNode?
>>>
>>> The reason why the current code is so is that current code is for only
>>> priority method supporting.
>>> At first version of this feature, I'd like to implement it more simple.
>>>
>>> Aside from this, of course I'm planning to have specific code for nested design.
>>> - The group can have some name nodes or group nodes.
>>> - The group can use either 2 types of method: priority or quorum.
>>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>>>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
>>> at that moment using group's method.
>>>   - SyncRepGetStandbysFn() function returns standbys of its group,
>>> which are considered as sync using group's method.
>>>
>>> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
>>> memory structure will be,
>>>
>>> "main(quorum)" --- "a"
>>>                         |
>>>                         -- "b"
>>>                         |
>>>                         -- "group1(priority)" --- "c"
>>>                                                      |
>>>                                                      -- "d"
>>>
>>> When determine synced LSNs, we need to consider group1's LSN using by
>>> priority method at first, and then we can determine main's LSN using
>>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
>>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
>>>
>>> bool
>>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
>>> {
>>>     sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
>>>
>>>     if (sync_num < group->sync_num)
>>>         return false;
>>>
>>>     for (each member of sync_list)
>>>     {
>>>         if (member->type == group node)
>>>             call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
>>> f into lsn_list.
>>>         else
>>>             Store name node LSNs into lsn_list.
>>>     }
>>>
>>>     Determine synced LSNs of this group using lsn_list and priority method.
>>>     Store synced LSNs into write_lsn and flush_lsn.
>>>     return true;
>>> }
>>>
>>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
>>>> other related functions are defined in syncgroup_gram.y. It would
>>>> be better to place them together.
>>>
>>> SyncRepClearStandbyGroupList() is used by
>>> check_synchronous_standby_names(), so I put this function syncrep.c.
>>>
>>>> SyncRepStandbys are to be in multilevel and the struct is
>>>> naturally allowed to be so but SyncRepClearStandbyGroupList
>>>> assumes it in single level.
>>>
>>> Because I think that we don't need to implement to fully support
>>> nested style at first version.
>>> We have to carefully design this feature while considering
>>> expandability, but overkill implementation could be cause of crash.
>>> Consider remaining time for 9.6, I feel we could implement quorum
>>> method at best.
>>>
>>>> This is a comment from the aspect of abstractness of objects.
>>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
>>>> the inside of SyncGroupNode but what the function should just
>>>> return seems to be the list of wansnds element. Element number is
>>>> useless when the SyncGroupNode nests.
>>>> > int
>>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
>>>> This might need to expose 'volatile WalSnd*' (only pointer type)
>>>> outside of walsender.
>>>> Or it should return the list of index number of
>>>> *WalSndCtl->walsnds*.
>>>
>>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
>>> index number of "WalSndCtl->walsnd" as sync_list, no?
>>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
>>> inside of SyncGroupNode in my design.
>>> Selecting sync nodes from its group doesn't depend on the type of node.
>>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
>>> *its* group.
>>>
>>
>> Previous patch has bug around GUC parameter handling.
>> Attached updated version.
>
> Thanks for updating the patch!
>
> Now I'm fixing some problems (e.g., current patch doesn't work
> with EXEC_BACKEND environment) and revising the patch.
> I will post the revised version this weekend or the first half
> of next week.
>
> Regards,
>
> --
> Fujii Masao



-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
It seems to me a matter of definition of "available replicas".

At Wed, 16 Mar 2016 14:13:48 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=3Ye+Ax_5=MZeHMkx9DFn25QoRzs362sQGNvGcVWx+18w@mail.gmail.com>
>     <para>
>      Synchronous replication offers the ability to confirm that all changes
> -    made by a transaction have been transferred to one synchronous standby
> -    server. This extends the standard level of durability
> +    made by a transaction have been transferred to one or more
> synchronous standby
> +    server. This extends that standard level of durability
>      offered by a transaction commit. This level of protection is referred
> -    to as 2-safe replication in computer science theory.
> +    to as group-safe replication in computer science theory.
>     </para>
> 
> A message on the -general list today pointed me to some earlier
> discussion[1] which quoted and referenced definitions of these
> academic terms[2].  I think the above documentation should say:
> 
> "This level of protection is referred to as 2-safe replication in
> computer science literature when <variable>synchronous_commit</> is
> set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when
> <variable>synchronous_commit</> is set to <literal>remote_write</>."

I suppose that the "available replica" on the paper is equivalent
to "one choosen synchronous server" at the top of the queue of
living standbys specified by s_s_names. The original description
is true based on this interpretation.

> By my reading, the situation doesn't actually change with this patch.
> It doesn't matter whether you need 1 or 42 synchronous standbys to
> make a quorum: 2-safe means durable (fsync) on all of them,
> group-1-safe means durable on one server and received (implied by
> remote_write) by all of them.

Likewise, "the first two of the living standbys" (2[r01, ..r42])
and the master is translated to "three replicas". So it keeps
2-safe for the case.

> I think we should be using those definitions because Gray's earlier
> definition of 2-safe from Transaction Processing 12.6.3 doesn't really
> fit:  It can optionally mean remote receipt or remote durable storage,
> but it doesn't wait if the 'backup' is down, so it's not the same type
> of guarantee.  (He also has 'very safe' which might describe our
> syncrep, I'm not sure.)

If the discussion above is true, the description doesn't seem to
need to be amended in the view of the safe-criteria.

>     <para>
>      Synchronous replication offers the ability to confirm that all changes
> -    made by a transaction have been transferred to one synchronous standby
> -    server. This extends the standard level of durability
> +    made by a transaction have been transferred to one or more synchronous standby
> +    server. This extends that standard level of durability
>      offered by a transaction commit. This level of protection is referred
>      to as 2-safe replication in computer science theory.
>     </para>

But some additional explanation might be needed.

For the true quorum commit, a client will be notified when the
master and any n of all standbys have committed. This won't fit
exactly to the criterias in the paper.

In regard to Gray's definition, "2-safe" looks to be PG's syncrep
with automatic release mechanism, such like what pgsql-RA
offers. And "high availability" doesn't seem to fit to
PostgreSQL's behavior because the master virtually commits a
transaction before making an agreement to commit among all of
replicas.

# I'm reading it in Japanese so some words may be incorrect.


Thoughts?

> [1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com
> [2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76
> 
> On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> Thank you so much for reviewing this patch!
> >>>
> >>> All review comments regarding document and comment are fixed.
> >>> Attached latest v14 patch.
> >>>
> >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
> >>>> such appliction names are not allowed anyway. If you assume so,
> >>>> I'd like to see a comment for that.).
> >>>
> >>> 'abc^Id' is accepted as application_name, no?
> >>> postgres(1)=# set application_name to 'abc^Id';
> >>> SET
> >>> postgres(1)=# show application_name ;
> >>>  application_name
> >>> ------------------
> >>>  abc^Id
> >>> (1 row)
> >>>
> >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
> >>>> char ychar) requires differnt character types. Is there any reason
> >>>> for that?
> >>>
> >>> Because addlit_xd_string() is for adding string(char *) to xd_string,
> >>> OTOH addlit_xd_char() is for adding just one character to xd_string.
> >>>
> >>>> I personally don't like addlit*string() things for such simple
> >>>> syntax but itself is acceptble enough for me. However it uses
> >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
> >>>> of memory chunk for every double-quoted name. The chunks are
> >>>> finally stacked up left uncollected until the current
> >>>> memorycontext is deleted or reset (It is deleted just after
> >>>> finishing config file processing). Addition to that, setting
> >>>> s_s_names runs the parser twice. It seems to me too greedy and
> >>>> seems that static char [NAMEDATALEN] is enough using the v12 way
> >>>> without palloc/repalloc.
> >>>
> >>> I though that length of group name could be more than NAMEDATALEN, so
> >>> I use StringInfo.
> >>> Is it not necessary?
> >>>
> >>>> I found that the name SyncGroupName.wait_num is not
> >>>> instinctive. How about sync_num, sync_member_num or
> >>>> sync_standby_num? If the last is preferable, .members also should
> >>>> be .standbys .
> >>>
> >>> Thanks, sync_num is preferable to me.
> >>>
> >>> ===
> >>>> I am quite uncomfortable with the existence of
> >>>> WanSnd.sync_standby_priority. It represented the pirority in the
> >>>> old linear s_s_names format but nested groups or even
> >>>> single-level quarum list obviously doesn't fit it. Can we get rid
> >>>> of sync_standby_priority, even though we realize atmost
> >>>> n-priority for now?
> >>>
> >>> We could get rid of sync_standby_priority.
> >>> But if so, we will not be able to see the next sync standby in
> >>> pg_stat_replication system view.
> >>> Regarding each node priority, I was thinking that standbys in quorum
> >>> list have same priority, and in nested group each standbys are given
> >>> the priority starting from 1.
> >>>
> >>> ===
> >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
> >>>> have specific code for every prioritizing method (which are
> >>>> priority, quorum, nested and so). Is there any reson to use it as
> >>>> a callback of SyncGroupNode?
> >>>
> >>> The reason why the current code is so is that current code is for only
> >>> priority method supporting.
> >>> At first version of this feature, I'd like to implement it more simple.
> >>>
> >>> Aside from this, of course I'm planning to have specific code for nested design.
> >>> - The group can have some name nodes or group nodes.
> >>> - The group can use either 2 types of method: priority or quorum.
> >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
> >>>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
> >>> at that moment using group's method.
> >>>   - SyncRepGetStandbysFn() function returns standbys of its group,
> >>> which are considered as sync using group's method.
> >>>
> >>> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
> >>> memory structure will be,
> >>>
> >>> "main(quorum)" --- "a"
> >>>                         |
> >>>                         -- "b"
> >>>                         |
> >>>                         -- "group1(priority)" --- "c"
> >>>                                                      |
> >>>                                                      -- "d"
> >>>
> >>> When determine synced LSNs, we need to consider group1's LSN using by
> >>> priority method at first, and then we can determine main's LSN using
> >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
> >>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
> >>>
> >>> bool
> >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
> >>> {
> >>>     sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
> >>>
> >>>     if (sync_num < group->sync_num)
> >>>         return false;
> >>>
> >>>     for (each member of sync_list)
> >>>     {
> >>>         if (member->type == group node)
> >>>             call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
> >>> f into lsn_list.
> >>>         else
> >>>             Store name node LSNs into lsn_list.
> >>>     }
> >>>
> >>>     Determine synced LSNs of this group using lsn_list and priority method.
> >>>     Store synced LSNs into write_lsn and flush_lsn.
> >>>     return true;
> >>> }
> >>>
> >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
> >>>> other related functions are defined in syncgroup_gram.y. It would
> >>>> be better to place them together.
> >>>
> >>> SyncRepClearStandbyGroupList() is used by
> >>> check_synchronous_standby_names(), so I put this function syncrep.c.
> >>>
> >>>> SyncRepStandbys are to be in multilevel and the struct is
> >>>> naturally allowed to be so but SyncRepClearStandbyGroupList
> >>>> assumes it in single level.
> >>>
> >>> Because I think that we don't need to implement to fully support
> >>> nested style at first version.
> >>> We have to carefully design this feature while considering
> >>> expandability, but overkill implementation could be cause of crash.
> >>> Consider remaining time for 9.6, I feel we could implement quorum
> >>> method at best.
> >>>
> >>>> This is a comment from the aspect of abstractness of objects.
> >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
> >>>> the inside of SyncGroupNode but what the function should just
> >>>> return seems to be the list of wansnds element. Element number is
> >>>> useless when the SyncGroupNode nests.
> >>>> > int
> >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
> >>>> This might need to expose 'volatile WalSnd*' (only pointer type)
> >>>> outside of walsender.
> >>>> Or it should return the list of index number of
> >>>> *WalSndCtl->walsnds*.
> >>>
> >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
> >>> index number of "WalSndCtl->walsnd" as sync_list, no?
> >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
> >>> inside of SyncGroupNode in my design.
> >>> Selecting sync nodes from its group doesn't depend on the type of node.
> >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
> >>> *its* group.
> >>>
> >>
> >> Previous patch has bug around GUC parameter handling.
> >> Attached updated version.
> >
> > Thanks for updating the patch!
> >
> > Now I'm fixing some problems (e.g., current patch doesn't work
> > with EXEC_BACKEND environment) and revising the patch.
> > I will post the revised version this weekend or the first half
> > of next week.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Mar 10, 2016 at 7:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Hi,
>>>
>>> Thank you so much for reviewing this patch!
>>>
>>> All review comments regarding document and comment are fixed.
>>> Attached latest v14 patch.
>>>
>>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
>>>> such appliction names are not allowed anyway. If you assume so,
>>>> I'd like to see a comment for that.).
>>>
>>> 'abc^Id' is accepted as application_name, no?
>>> postgres(1)=# set application_name to 'abc^Id';
>>> SET
>>> postgres(1)=# show application_name ;
>>>  application_name
>>> ------------------
>>>  abc^Id
>>> (1 row)
>>>
>>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>>>> char ychar) requires differnt character types. Is there any reason
>>>> for that?
>>>
>>> Because addlit_xd_string() is for adding string(char *) to xd_string,
>>> OTOH addlit_xd_char() is for adding just one character to xd_string.
>>>
>>>> I personally don't like addlit*string() things for such simple
>>>> syntax but itself is acceptble enough for me. However it uses
>>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
>>>> of memory chunk for every double-quoted name. The chunks are
>>>> finally stacked up left uncollected until the current
>>>> memorycontext is deleted or reset (It is deleted just after
>>>> finishing config file processing). Addition to that, setting
>>>> s_s_names runs the parser twice. It seems to me too greedy and
>>>> seems that static char [NAMEDATALEN] is enough using the v12 way
>>>> without palloc/repalloc.
>>>
>>> I though that length of group name could be more than NAMEDATALEN, so
>>> I use StringInfo.
>>> Is it not necessary?
>>>
>>>> I found that the name SyncGroupName.wait_num is not
>>>> instinctive. How about sync_num, sync_member_num or
>>>> sync_standby_num? If the last is preferable, .members also should
>>>> be .standbys .
>>>
>>> Thanks, sync_num is preferable to me.
>>>
>>> ===
>>>> I am quite uncomfortable with the existence of
>>>> WanSnd.sync_standby_priority. It represented the pirority in the
>>>> old linear s_s_names format but nested groups or even
>>>> single-level quarum list obviously doesn't fit it. Can we get rid
>>>> of sync_standby_priority, even though we realize atmost
>>>> n-priority for now?
>>>
>>> We could get rid of sync_standby_priority.
>>> But if so, we will not be able to see the next sync standby in
>>> pg_stat_replication system view.
>>> Regarding each node priority, I was thinking that standbys in quorum
>>> list have same priority, and in nested group each standbys are given
>>> the priority starting from 1.
>>>
>>> ===
>>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
>>>> have specific code for every prioritizing method (which are
>>>> priority, quorum, nested and so). Is there any reson to use it as
>>>> a callback of SyncGroupNode?
>>>
>>> The reason why the current code is so is that current code is for only
>>> priority method supporting.
>>> At first version of this feature, I'd like to implement it more simple.
>>>
>>> Aside from this, of course I'm planning to have specific code for nested design.
>>> - The group can have some name nodes or group nodes.
>>> - The group can use either 2 types of method: priority or quorum.
>>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>>>   - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
>>> at that moment using group's method.
>>>   - SyncRepGetStandbysFn() function returns standbys of its group,
>>> which are considered as sync using group's method.
>>>
>>> For example, s_s_name  = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
>>> memory structure will be,
>>>
>>> "main(quorum)" --- "a"
>>>                         |
>>>                         -- "b"
>>>                         |
>>>                         -- "group1(priority)" --- "c"
>>>                                                      |
>>>                                                      -- "d"
>>>
>>> When determine synced LSNs, we need to consider group1's LSN using by
>>> priority method at first, and then we can determine main's LSN using
>>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
>>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
>>>
>>> bool
>>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
>>> {
>>>     sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
>>>
>>>     if (sync_num < group->sync_num)
>>>         return false;
>>>
>>>     for (each member of sync_list)
>>>     {
>>>         if (member->type == group node)
>>>             call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
>>> f into lsn_list.
>>>         else
>>>             Store name node LSNs into lsn_list.
>>>     }
>>>
>>>     Determine synced LSNs of this group using lsn_list and priority method.
>>>     Store synced LSNs into write_lsn and flush_lsn.
>>>     return true;
>>> }
>>>
>>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
>>>> other related functions are defined in syncgroup_gram.y. It would
>>>> be better to place them together.
>>>
>>> SyncRepClearStandbyGroupList() is used by
>>> check_synchronous_standby_names(), so I put this function syncrep.c.
>>>
>>>> SyncRepStandbys are to be in multilevel and the struct is
>>>> naturally allowed to be so but SyncRepClearStandbyGroupList
>>>> assumes it in single level.
>>>
>>> Because I think that we don't need to implement to fully support
>>> nested style at first version.
>>> We have to carefully design this feature while considering
>>> expandability, but overkill implementation could be cause of crash.
>>> Consider remaining time for 9.6, I feel we could implement quorum
>>> method at best.
>>>
>>>> This is a comment from the aspect of abstractness of objects.
>>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
>>>> the inside of SyncGroupNode but what the function should just
>>>> return seems to be the list of wansnds element. Element number is
>>>> useless when the SyncGroupNode nests.
>>>> > int
>>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
>>>> This might need to expose 'volatile WalSnd*' (only pointer type)
>>>> outside of walsender.
>>>> Or it should return the list of index number of
>>>> *WalSndCtl->walsnds*.
>>>
>>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
>>> index number of "WalSndCtl->walsnd" as sync_list, no?
>>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
>>> inside of SyncGroupNode in my design.
>>> Selecting sync nodes from its group doesn't depend on the type of node.
>>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
>>> *its* group.
>>>
>>
>> Previous patch has bug around GUC parameter handling.
>> Attached updated version.
>
> Thanks for updating the patch!
>
> Now I'm fixing some problems (e.g., current patch doesn't work
> with EXEC_BACKEND environment) and revising the patch.

Sorry for the delay... Here is the revised version of the patch.
Please review and test this version!
BTW, I've not revised the documentation and regression test yet.
I will do that during the review and test of the patch.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Thank you for the revised patch.

At Tue, 22 Mar 2016 16:02:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwGnvuX8wR-FYH+TrNi_TWunZzU=nJFMdXkO6O8M4GbNvQ@mail.gmail.com>
> On Thu, Mar 10, 2016 at 7:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Sorry for the delay... Here is the revised version of the patch.
> Please review and test this version!
> BTW, I've not revised the documentation and regression test yet.
> I will do that during the review and test of the patch.

This version looks to focus on n-priority method. Stuffs for the
other methods like n-quorum has been removed. It is okay for me.

So using WalSnd->sync_standby_priority is reasonable. 

SyncRePGetSyncStandbys seems to work as expected, that is,
collecting n standbys in the order of priority, even if multiple
standbys are at the same prioirity, but in (pseudo) random order
among the standbys with the same priority, not LSN order. This is
the difference from the true quoraum method.

About announcement of take over,

>   if (announce_next_takeover && am_sync)
>   {
>     announce_next_takeover = false;
>     ereport(LOG,
>         (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
>             application_name, MyWalSnd->sync_standby_priority)));

This can announces for the seemingly same standby successively if
standbys with the same application_name are comming-in and
going-out. But this is the same as the current behavior.

Otherwise, as far as I can see, SyncRepReleaseWaiters seems to
work correctly.


SyncRepinitConfig parses s_s_names then prioritize all walsenders
based on the result. This is run at the start of a walsender and
at reloading of config. Ended walsenders are excluded on
collectiong sync-standbys. All of these seems to work
properly. (as before).

The parser became far simpler by getting rid of the stuffs for
the future expansion. It accepts only '<n>[name, ...]' and the
old s_s_names format.

StringInfo for double-quoted names seems to me to be overkill,
since it allocates 1024 byte block for every such name. A static
buffer seems enough for the usage as I said.

The parser is called for not only for SIGHUP, but also for
starting of every walsender. The latter is not necessary but it
is the matter of trade-off between simplisity and
effectiveness. The same can be said for
check_synchronous_standby_names().

regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thank you for the revised patch.

Thanks for reviewing the patch!

> This version looks to focus on n-priority method. Stuffs for the
> other methods like n-quorum has been removed. It is okay for me.

I don't think it's so difficult to extend this version so that
it supports also quorum commit.

> StringInfo for double-quoted names seems to me to be overkill,
> since it allocates 1024 byte block for every such name. A static
> buffer seems enough for the usage as I said.

So, what about changing the scanner code as follows?

<xd>{xdstop} {               yylval.str = pstrdup(xdbuf.data);               pfree(xdbuf.data);
BEGIN(INITIAL);              return NAME;
 

> The parser is called for not only for SIGHUP, but also for
> starting of every walsender. The latter is not necessary but it
> is the matter of trade-off between simplisity and
> effectiveness.

Could you elaborate why you think that's not necessary?

BTW, in previous patch, s_s_names is parsed by postmaster during the server
startup. A child process takes over the internal data struct for the parsed
s_s_names when it's forked by the postmaster. This is what the previous
patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
In that environment, the data struct should be passed to a child process via
the special file (like write_nondefault_variables() does), or it should
be constructed during walsender startup (like latest version of the patch
does). IMO the latter is simpler.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Mar 22, 2016 at 11:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Thank you for the revised patch.
>
> Thanks for reviewing the patch!
>
>> This version looks to focus on n-priority method. Stuffs for the
>> other methods like n-quorum has been removed. It is okay for me.
>
> I don't think it's so difficult to extend this version so that
> it supports also quorum commit.

Yeah, 1-nest level implementation would not so difficult.

>> StringInfo for double-quoted names seems to me to be overkill,
>> since it allocates 1024 byte block for every such name. A static
>> buffer seems enough for the usage as I said.
>
> So, what about changing the scanner code as follows?
>
> <xd>{xdstop} {
>                 yylval.str = pstrdup(xdbuf.data);
>                 pfree(xdbuf.data);
>                 BEGIN(INITIAL);
>                 return NAME;
>> The parser is called for not only for SIGHUP, but also for
>> starting of every walsender. The latter is not necessary but it
>> is the matter of trade-off between simplisity and
>> effectiveness.
>
> Could you elaborate why you think that's not necessary?
>
> BTW, in previous patch, s_s_names is parsed by postmaster during the server
> startup. A child process takes over the internal data struct for the parsed
> s_s_names when it's forked by the postmaster. This is what the previous
> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
> In that environment, the data struct should be passed to a child process via
> the special file (like write_nondefault_variables() does), or it should
> be constructed during walsender startup (like latest version of the patch
> does). IMO the latter is simpler.

Thank you for updating patch.

Followings are random review comments.

==
+               for (cell = list_head(pending); cell; cell = next)

Can we use foreach() instead?
==
+                               pending = list_delete_cell(pending, cell, prev);
+
+                               if (list_length(pending) == 0)
+                               {
+                                       list_free(pending);
+                                       return result;          /*
Exit if pending list is empty */
+                               }

If pending list become empty after deleting element, we can return.
It's a small optimisation.
==
If num_sync is greater than the number of members of sync standby
list, we'd rather return error message immediately.
Thoughts?
==
I got assertion error when master server is set up with empty s_s_names.
Because current patch always tries to parse s_s_names and use it
regardless value of parameter.

Attached patch incorporates above comments.
Please find it.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Mar 23, 2016 at 2:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 22, 2016 at 11:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Thank you for the revised patch.
>>
>> Thanks for reviewing the patch!
>>
>>> This version looks to focus on n-priority method. Stuffs for the
>>> other methods like n-quorum has been removed. It is okay for me.
>>
>> I don't think it's so difficult to extend this version so that
>> it supports also quorum commit.
>
> Yeah, 1-nest level implementation would not so difficult.
>
>>> StringInfo for double-quoted names seems to me to be overkill,
>>> since it allocates 1024 byte block for every such name. A static
>>> buffer seems enough for the usage as I said.
>>
>> So, what about changing the scanner code as follows?
>>
>> <xd>{xdstop} {
>>                 yylval.str = pstrdup(xdbuf.data);
>>                 pfree(xdbuf.data);
>>                 BEGIN(INITIAL);
>>                 return NAME;

I applied this change to the latest version of the patch.
Please check that.

Also I changed syncrep.c so that it uses list_free_deep() to free the list
of the parsed s_s_names. Because the data in the list is palloc'd by
syncrep_scanner.l.

>>> The parser is called for not only for SIGHUP, but also for
>>> starting of every walsender. The latter is not necessary but it
>>> is the matter of trade-off between simplisity and
>>> effectiveness.
>>
>> Could you elaborate why you think that's not necessary?
>>
>> BTW, in previous patch, s_s_names is parsed by postmaster during the server
>> startup. A child process takes over the internal data struct for the parsed
>> s_s_names when it's forked by the postmaster. This is what the previous
>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
>> In that environment, the data struct should be passed to a child process via
>> the special file (like write_nondefault_variables() does), or it should
>> be constructed during walsender startup (like latest version of the patch
>> does). IMO the latter is simpler.
>
> Thank you for updating patch.
>
> Followings are random review comments.
>
> ==
> +               for (cell = list_head(pending); cell; cell = next)
>
> Can we use foreach() instead?

Yes.

> ==
> +                               pending = list_delete_cell(pending, cell, prev);
> +
> +                               if (list_length(pending) == 0)
> +                               {
> +                                       list_free(pending);
> +                                       return result;          /*
> Exit if pending list is empty */
> +                               }
>
> If pending list become empty after deleting element, we can return.
> It's a small optimisation.

I don' think this is necessary because currently we can get ouf of the loop
immediately after that deletion.

But I found the bug about the calculation of the next highest priority.
This could cause extra unnecessary loop. I fixed that in the latest version
of the patch.

> ==
> If num_sync is greater than the number of members of sync standby
> list, we'd rather return error message immediately.
> Thoughts?

No. For example, please imagine the case where s_s_names is set to '*'
and more than one sync standbys are connecting to the master.
That's valid setting.

> ==
> I got assertion error when master server is set up with empty s_s_names.
> Because current patch always tries to parse s_s_names and use it
> regardless value of parameter.

Yeah, you're right.

>
> Attached patch incorporates above comments.
> Please find it.

Attached is the latest version of the patch based on your patch.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Mar 23, 2016 at 1:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 2:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached patch incorporates above comments.
>> Please find it.
>
> Attached is the latest version of the patch based on your patch.

Not really having a look at the core patch yet...

+ my $result = $node_master->psql('postgres', "SELECT
application_name, sync_priority, sync_state FROM
pg_stat_replication;");
+ print "$result \n";
Having ORDER BY application_name would be good for those queries, and
the result outputs could be made more consistent as a result.

+ # Change the s_s_names = '2[standby1,standby2,standby3]' and check sync state
+ $node_master->psql('postgres', "ALTER SYSTEM SET
synchronous_standby_names = '2[standby1,standby2,standby3]';");
+ $node_master->psql('postgres', "SELECT pg_reload_conf();");
Let's add a reload routine in PostgresNode.pm, this patch is not the
only one who would use it.

--- b/src/test/recovery/t/006_multisync_rep.pl
***************
*** 0 ****
--- 1,106 ----
+ use strict;
+ use warnings;
You may want to add a small description for this test as header.
     $postgres->AddFiles('src/backend/replication', 'repl_scanner.l',         'repl_gram.y');
+     $postgres->AddFiles('src/backend/replication', 'syncrep_scanner.l',
+         'syncrep_gram.y');
There is no need for a new routine call here, you can just append the
new files on the existing call.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com>
> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Thank you for the revised patch.
> 
> Thanks for reviewing the patch!
> 
> > This version looks to focus on n-priority method. Stuffs for the
> > other methods like n-quorum has been removed. It is okay for me.
> 
> I don't think it's so difficult to extend this version so that
> it supports also quorum commit.

Mmm. I think I understand this just now. As Sawada-san said
before, all standbys in a single-level quorum set having the same
sync_standby_prioirity, the current algorithm works as it is. It
also true for the case that some quorum sets are in a priority
set.

What about some priority sets in a quorum set?

> > StringInfo for double-quoted names seems to me to be overkill,
> > since it allocates 1024 byte block for every such name. A static
> > buffer seems enough for the usage as I said.
> 
> So, what about changing the scanner code as follows?
> 
> <xd>{xdstop} {
>                 yylval.str = pstrdup(xdbuf.data);
>                 pfree(xdbuf.data);
>                 BEGIN(INITIAL);
>                 return NAME;
> 
> > The parser is called for not only for SIGHUP, but also for
> > starting of every walsender. The latter is not necessary but it
> > is the matter of trade-off between simplisity and
> > effectiveness.
> 
> Could you elaborate why you think that's not necessary?

Sorry, starting of walsender is not so large problem, 1024 bytes
memory is just abandoned once. SIGHUP is rather a problem.

The part is called under two kinds of memory context, "config
file processing" then "Replication command context". The former
is deleted just after reading the config file so no harm but the
latter is a quite long-lasting context and every reloading bloats
the context with abandoned memory blocks. It is needed to be
pfreed or to use a memory context with shorter lifetime, or use
static storage of 64 byte-length, even though the bloat become
visible after very many times of conf reloads.


> BTW, in previous patch, s_s_names is parsed by postmaster during the server
> startup. A child process takes over the internal data struct for the parsed
> s_s_names when it's forked by the postmaster. This is what the previous
> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
> In that environment, the data struct should be passed to a child process via
> the special file (like write_nondefault_variables() does), or it should
> be constructed during walsender startup (like latest version of the patch
> does). IMO the latter is simpler.

Ah, I haven't notice that but I agree with it.


As per my previous comment, syncrep_scanner.l doesn't reject some
(nonprintable and multibyte) characters in a name, which is to be
silently replaced with '?' for application_name. It would not be
a problem for almost all of us but might be needed to be
documented if we won't change the behavior to be the same as
application_name.

By the way, the following documentation fix mentioned by Thomas,

-    to as 2-safe replication in computer science theory.
+    to as group-safe replication in computer science theory.

should be restored if the discussion in the following message is
true. And some supplemental description would be needed.

http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com>
>> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Thank you for the revised patch.
>>
>> Thanks for reviewing the patch!
>>
>> > This version looks to focus on n-priority method. Stuffs for the
>> > other methods like n-quorum has been removed. It is okay for me.
>>
>> I don't think it's so difficult to extend this version so that
>> it supports also quorum commit.
>
> Mmm. I think I understand this just now. As Sawada-san said
> before, all standbys in a single-level quorum set having the same
> sync_standby_prioirity, the current algorithm works as it is. It
> also true for the case that some quorum sets are in a priority
> set.
>
> What about some priority sets in a quorum set?
>
>> > StringInfo for double-quoted names seems to me to be overkill,
>> > since it allocates 1024 byte block for every such name. A static
>> > buffer seems enough for the usage as I said.
>>
>> So, what about changing the scanner code as follows?
>>
>> <xd>{xdstop} {
>>                 yylval.str = pstrdup(xdbuf.data);
>>                 pfree(xdbuf.data);
>>                 BEGIN(INITIAL);
>>                 return NAME;
>>
>> > The parser is called for not only for SIGHUP, but also for
>> > starting of every walsender. The latter is not necessary but it
>> > is the matter of trade-off between simplisity and
>> > effectiveness.
>>
>> Could you elaborate why you think that's not necessary?
>
> Sorry, starting of walsender is not so large problem, 1024 bytes
> memory is just abandoned once. SIGHUP is rather a problem.
>
> The part is called under two kinds of memory context, "config
> file processing" then "Replication command context". The former
> is deleted just after reading the config file so no harm but the
> latter is a quite long-lasting context and every reloading bloats
> the context with abandoned memory blocks. It is needed to be
> pfreed or to use a memory context with shorter lifetime, or use
> static storage of 64 byte-length, even though the bloat become
> visible after very many times of conf reloads.

SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
in the patch. Or am I missing something?

>> BTW, in previous patch, s_s_names is parsed by postmaster during the server
>> startup. A child process takes over the internal data struct for the parsed
>> s_s_names when it's forked by the postmaster. This is what the previous
>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
>> In that environment, the data struct should be passed to a child process via
>> the special file (like write_nondefault_variables() does), or it should
>> be constructed during walsender startup (like latest version of the patch
>> does). IMO the latter is simpler.
>
> Ah, I haven't notice that but I agree with it.
>
>
> As per my previous comment, syncrep_scanner.l doesn't reject some
> (nonprintable and multibyte) characters in a name, which is to be
> silently replaced with '?' for application_name. It would not be
> a problem for almost all of us but might be needed to be
> documented if we won't change the behavior to be the same as
> application_name.

There are three options:

1. Replace nonprintable and non-ASCII characters in s_s_names with ?
2. Emit an error if s_s_names contains nonprintable and non-ASCII characters
3. Do nothing (9.5 or before behave in this way)

You implied that we should choose #1 or #2?

> By the way, the following documentation fix mentioned by Thomas,
>
> -    to as 2-safe replication in computer science theory.
> +    to as group-safe replication in computer science theory.
>
> should be restored if the discussion in the following message is
> true. And some supplemental description would be needed.
>
> http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp

Yeah, the document needs to be updated.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Hello,
>>
>> At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com>
>>> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> > Thank you for the revised patch.
>>>
>>> Thanks for reviewing the patch!
>>>
>>> > This version looks to focus on n-priority method. Stuffs for the
>>> > other methods like n-quorum has been removed. It is okay for me.
>>>
>>> I don't think it's so difficult to extend this version so that
>>> it supports also quorum commit.
>>
>> Mmm. I think I understand this just now. As Sawada-san said
>> before, all standbys in a single-level quorum set having the same
>> sync_standby_prioirity, the current algorithm works as it is. It
>> also true for the case that some quorum sets are in a priority
>> set.
>>
>> What about some priority sets in a quorum set?

We should surely consider it that when we support more than 1 nest
level configuration.
IMO, we can have another information which indicates current sync
standbys instead of sync_priority.
For now, we are'nt trying to support even quorum method, so we could
consider it after we can support both priority method and quorum
method without incident.

>>> > StringInfo for double-quoted names seems to me to be overkill,
>>> > since it allocates 1024 byte block for every such name. A static
>>> > buffer seems enough for the usage as I said.
>>>
>>> So, what about changing the scanner code as follows?
>>>
>>> <xd>{xdstop} {
>>>                 yylval.str = pstrdup(xdbuf.data);
>>>                 pfree(xdbuf.data);
>>>                 BEGIN(INITIAL);
>>>                 return NAME;
>>>
>>> > The parser is called for not only for SIGHUP, but also for
>>> > starting of every walsender. The latter is not necessary but it
>>> > is the matter of trade-off between simplisity and
>>> > effectiveness.
>>>
>>> Could you elaborate why you think that's not necessary?
>>
>> Sorry, starting of walsender is not so large problem, 1024 bytes
>> memory is just abandoned once. SIGHUP is rather a problem.
>>
>> The part is called under two kinds of memory context, "config
>> file processing" then "Replication command context". The former
>> is deleted just after reading the config file so no harm but the
>> latter is a quite long-lasting context and every reloading bloats
>> the context with abandoned memory blocks. It is needed to be
>> pfreed or to use a memory context with shorter lifetime, or use
>> static storage of 64 byte-length, even though the bloat become
>> visible after very many times of conf reloads.
>
> SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
> in the patch. Or am I missing something?
>
>>> BTW, in previous patch, s_s_names is parsed by postmaster during the server
>>> startup. A child process takes over the internal data struct for the parsed
>>> s_s_names when it's forked by the postmaster. This is what the previous
>>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
>>> In that environment, the data struct should be passed to a child process via
>>> the special file (like write_nondefault_variables() does), or it should
>>> be constructed during walsender startup (like latest version of the patch
>>> does). IMO the latter is simpler.
>>
>> Ah, I haven't notice that but I agree with it.
>>
>>
>> As per my previous comment, syncrep_scanner.l doesn't reject some
>> (nonprintable and multibyte) characters in a name, which is to be
>> silently replaced with '?' for application_name. It would not be
>> a problem for almost all of us but might be needed to be
>> documented if we won't change the behavior to be the same as
>> application_name.
>
> There are three options:
>
> 1. Replace nonprintable and non-ASCII characters in s_s_names with ?
> 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters
> 3. Do nothing (9.5 or before behave in this way)
>
> You implied that we should choose #1 or #2?

Previous(9.5 or before) s_s_names also accepts non-ASCII character and
non-printable character, and can show it without replacing these
character to '?'.
From backward compatibility perspective, we should not choose #1 or #2.
Different behaviour between previous and current s_s_names is that
previous s_s_names doesn't accept the node name having the sort of
white-space character that isspace() returns true with.
But current s_s_names allows us to specify such a node name.
I guess that changing such behaviour is enough for fixing this issue.
Thoughts?

>
>> By the way, the following documentation fix mentioned by Thomas,
>>
>> -    to as 2-safe replication in computer science theory.
>> +    to as group-safe replication in computer science theory.
>>
>> should be restored if the discussion in the following message is
>> true. And some supplemental description would be needed.
>>
>> http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp
>
> Yeah, the document needs to be updated.

I will do that.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com>
> On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> I don't think it's so difficult to extend this version so that
> >>> it supports also quorum commit.
> >>
> >> Mmm. I think I understand this just now. As Sawada-san said
> >> before, all standbys in a single-level quorum set having the same
> >> sync_standby_prioirity, the current algorithm works as it is. It
> >> also true for the case that some quorum sets are in a priority
> >> set.
> >>
> >> What about some priority sets in a quorum set?
> 
> We should surely consider it that when we support more than 1 nest
> level configuration.
> IMO, we can have another information which indicates current sync
> standbys instead of sync_priority.
> For now, we are'nt trying to support even quorum method, so we could
> consider it after we can support both priority method and quorum
> method without incident.

Fine with me.

> >>> > StringInfo for double-quoted names seems to me to be overkill,
> >>> > since it allocates 1024 byte block for every such name. A static
> >>> > buffer seems enough for the usage as I said.
> >>>
> >>> So, what about changing the scanner code as follows?
> >>>
> >>> <xd>{xdstop} {
> >>>                 yylval.str = pstrdup(xdbuf.data);
> >>>                 pfree(xdbuf.data);
> >>>                 BEGIN(INITIAL);
> >>>                 return NAME;
> >>>
> >>> > The parser is called for not only for SIGHUP, but also for
> >>> > starting of every walsender. The latter is not necessary but it
> >>> > is the matter of trade-off between simplisity and
> >>> > effectiveness.
> >>>
> >>> Could you elaborate why you think that's not necessary?
> >>
> >> Sorry, starting of walsender is not so large problem, 1024 bytes
> >> memory is just abandoned once. SIGHUP is rather a problem.
> >>
> >> The part is called under two kinds of memory context, "config
> >> file processing" then "Replication command context". The former
> >> is deleted just after reading the config file so no harm but the
> >> latter is a quite long-lasting context and every reloading bloats
> >> the context with abandoned memory blocks. It is needed to be
> >> pfreed or to use a memory context with shorter lifetime, or use
> >> static storage of 64 byte-length, even though the bloat become
> >> visible after very many times of conf reloads.
> >
> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
> > in the patch. Or am I missing something?

Sorry, instead, the memory from strdup() will be abandoned in
upper level. (Thinking for some time..) Ah, I found that the
problem should be here.
> SyncRepFreeConfig(SyncRepConfigData *config)> {
...
!>     list_free(config->members);>     pfree(config);> }

The list_free *doesn't* free the memory blocks pointed by
lfirst(cell), which has been pstrdup'ed. It should be
list_free_deep(config->members) instead to free it completely.

> >>> BTW, in previous patch, s_s_names is parsed by postmaster during the server
> >>> startup. A child process takes over the internal data struct for the parsed
> >>> s_s_names when it's forked by the postmaster. This is what the previous
> >>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
> >>> In that environment, the data struct should be passed to a child process via
> >>> the special file (like write_nondefault_variables() does), or it should
> >>> be constructed during walsender startup (like latest version of the patch
> >>> does). IMO the latter is simpler.
> >>
> >> Ah, I haven't notice that but I agree with it.
> >>
> >>
> >> As per my previous comment, syncrep_scanner.l doesn't reject some
> >> (nonprintable and multibyte) characters in a name, which is to be
> >> silently replaced with '?' for application_name. It would not be
> >> a problem for almost all of us but might be needed to be
> >> documented if we won't change the behavior to be the same as
> >> application_name.
> >
> > There are three options:
> >
> > 1. Replace nonprintable and non-ASCII characters in s_s_names with ?
> > 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters
> > 3. Do nothing (9.5 or before behave in this way)
> >
> > You implied that we should choose #1 or #2?
> 
> Previous(9.5 or before) s_s_names also accepts non-ASCII character and
> non-printable character, and can show it without replacing these
> character to '?'.

Thank you for pointint it out (it was completely out of my
mind..). I have no objection to keep the previous behavior.

> From backward compatibility perspective, we should not choose #1 or #2.
> Different behaviour between previous and current s_s_names is that
> previous s_s_names doesn't accept the node name having the sort of
> white-space character that isspace() returns true with.
> But current s_s_names allows us to specify such a node name.
> I guess that changing such behaviour is enough for fixing this issue.
> Thoughts?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Mar 24, 2016 at 2:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com>
>> On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >>> I don't think it's so difficult to extend this version so that
>> >>> it supports also quorum commit.
>> >>
>> >> Mmm. I think I understand this just now. As Sawada-san said
>> >> before, all standbys in a single-level quorum set having the same
>> >> sync_standby_prioirity, the current algorithm works as it is. It
>> >> also true for the case that some quorum sets are in a priority
>> >> set.
>> >>
>> >> What about some priority sets in a quorum set?
>>
>> We should surely consider it that when we support more than 1 nest
>> level configuration.
>> IMO, we can have another information which indicates current sync
>> standbys instead of sync_priority.
>> For now, we are'nt trying to support even quorum method, so we could
>> consider it after we can support both priority method and quorum
>> method without incident.
>
> Fine with me.
>
>> >>> > StringInfo for double-quoted names seems to me to be overkill,
>> >>> > since it allocates 1024 byte block for every such name. A static
>> >>> > buffer seems enough for the usage as I said.
>> >>>
>> >>> So, what about changing the scanner code as follows?
>> >>>
>> >>> <xd>{xdstop} {
>> >>>                 yylval.str = pstrdup(xdbuf.data);
>> >>>                 pfree(xdbuf.data);
>> >>>                 BEGIN(INITIAL);
>> >>>                 return NAME;
>> >>>
>> >>> > The parser is called for not only for SIGHUP, but also for
>> >>> > starting of every walsender. The latter is not necessary but it
>> >>> > is the matter of trade-off between simplisity and
>> >>> > effectiveness.
>> >>>
>> >>> Could you elaborate why you think that's not necessary?
>> >>
>> >> Sorry, starting of walsender is not so large problem, 1024 bytes
>> >> memory is just abandoned once. SIGHUP is rather a problem.
>> >>
>> >> The part is called under two kinds of memory context, "config
>> >> file processing" then "Replication command context". The former
>> >> is deleted just after reading the config file so no harm but the
>> >> latter is a quite long-lasting context and every reloading bloats
>> >> the context with abandoned memory blocks. It is needed to be
>> >> pfreed or to use a memory context with shorter lifetime, or use
>> >> static storage of 64 byte-length, even though the bloat become
>> >> visible after very many times of conf reloads.
>> >
>> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
>> > in the patch. Or am I missing something?
>
> Sorry, instead, the memory from strdup() will be abandoned in
> upper level. (Thinking for some time..) Ah, I found that the
> problem should be here.
>
>  > SyncRepFreeConfig(SyncRepConfigData *config)
>  > {
> ...
> !>      list_free(config->members);
>  >      pfree(config);
>  > }
>
> The list_free *doesn't* free the memory blocks pointed by
> lfirst(cell), which has been pstrdup'ed. It should be
> list_free_deep(config->members) instead to free it completely.

Yep, but SyncRepFreeConfig() already uses list_free_deep in the latest patch.
Could you read the latest version that I posted upthread.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Mar 24, 2016 at 2:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com>
>> On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >>> I don't think it's so difficult to extend this version so that
>> >>> it supports also quorum commit.
>> >>
>> >> Mmm. I think I understand this just now. As Sawada-san said
>> >> before, all standbys in a single-level quorum set having the same
>> >> sync_standby_prioirity, the current algorithm works as it is. It
>> >> also true for the case that some quorum sets are in a priority
>> >> set.
>> >>
>> >> What about some priority sets in a quorum set?
>>
>> We should surely consider it that when we support more than 1 nest
>> level configuration.
>> IMO, we can have another information which indicates current sync
>> standbys instead of sync_priority.
>> For now, we are'nt trying to support even quorum method, so we could
>> consider it after we can support both priority method and quorum
>> method without incident.
>
> Fine with me.
>
>> >>> > StringInfo for double-quoted names seems to me to be overkill,
>> >>> > since it allocates 1024 byte block for every such name. A static
>> >>> > buffer seems enough for the usage as I said.
>> >>>
>> >>> So, what about changing the scanner code as follows?
>> >>>
>> >>> <xd>{xdstop} {
>> >>>                 yylval.str = pstrdup(xdbuf.data);
>> >>>                 pfree(xdbuf.data);
>> >>>                 BEGIN(INITIAL);
>> >>>                 return NAME;
>> >>>
>> >>> > The parser is called for not only for SIGHUP, but also for
>> >>> > starting of every walsender. The latter is not necessary but it
>> >>> > is the matter of trade-off between simplisity and
>> >>> > effectiveness.
>> >>>
>> >>> Could you elaborate why you think that's not necessary?
>> >>
>> >> Sorry, starting of walsender is not so large problem, 1024 bytes
>> >> memory is just abandoned once. SIGHUP is rather a problem.
>> >>
>> >> The part is called under two kinds of memory context, "config
>> >> file processing" then "Replication command context". The former
>> >> is deleted just after reading the config file so no harm but the
>> >> latter is a quite long-lasting context and every reloading bloats
>> >> the context with abandoned memory blocks. It is needed to be
>> >> pfreed or to use a memory context with shorter lifetime, or use
>> >> static storage of 64 byte-length, even though the bloat become
>> >> visible after very many times of conf reloads.
>> >
>> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
>> > in the patch. Or am I missing something?
>
> Sorry, instead, the memory from strdup() will be abandoned in
> upper level. (Thinking for some time..) Ah, I found that the
> problem should be here.
>
>  > SyncRepFreeConfig(SyncRepConfigData *config)
>  > {
> ...
> !>      list_free(config->members);
>  >      pfree(config);
>  > }
>
> The list_free *doesn't* free the memory blocks pointed by
> lfirst(cell), which has been pstrdup'ed. It should be
> list_free_deep(config->members) instead to free it completely.
>> >>> BTW, in previous patch, s_s_names is parsed by postmaster during the server
>> >>> startup. A child process takes over the internal data struct for the parsed
>> >>> s_s_names when it's forked by the postmaster. This is what the previous
>> >>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment.
>> >>> In that environment, the data struct should be passed to a child process via
>> >>> the special file (like write_nondefault_variables() does), or it should
>> >>> be constructed during walsender startup (like latest version of the patch
>> >>> does). IMO the latter is simpler.
>> >>
>> >> Ah, I haven't notice that but I agree with it.
>> >>
>> >>
>> >> As per my previous comment, syncrep_scanner.l doesn't reject some
>> >> (nonprintable and multibyte) characters in a name, which is to be
>> >> silently replaced with '?' for application_name. It would not be
>> >> a problem for almost all of us but might be needed to be
>> >> documented if we won't change the behavior to be the same as
>> >> application_name.
>> >
>> > There are three options:
>> >
>> > 1. Replace nonprintable and non-ASCII characters in s_s_names with ?
>> > 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters
>> > 3. Do nothing (9.5 or before behave in this way)
>> >
>> > You implied that we should choose #1 or #2?
>>
>> Previous(9.5 or before) s_s_names also accepts non-ASCII character and
>> non-printable character, and can show it without replacing these
>> character to '?'.
>
> Thank you for pointint it out (it was completely out of my
> mind..). I have no objection to keep the previous behavior.
>
>> From backward compatibility perspective, we should not choose #1 or #2.
>> Different behaviour between previous and current s_s_names is that
>> previous s_s_names doesn't accept the node name having the sort of
>> white-space character that isspace() returns true with.
>> But current s_s_names allows us to specify such a node name.
>> I guess that changing such behaviour is enough for fixing this issue.
>> Thoughts?
>

Attached latest patch incorporating all review comments so far.

Aside from the review comments, I did following changes;
- Add logic to avoid fatal exit in yy_fatal_error().
- Improve regression test cases.

Also I felt a sense of discomfort regarding using [ and ] as a special
character for priority method.
Because (, ) and [, ] are a little similar each other, so it would
easily make many syntax errors when nested style is supported.
And the synopsis of that in documentation is odd;
    synchronous_standby_names = 'N [ node_name [, ...] ]'

This topic has been already discussed before but, we might want to
change it to other characters such as < and >?

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Thu, Mar 24, 2016 at 9:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Also I felt a sense of discomfort regarding using [ and ] as a special
> character for priority method.
> Because (, ) and [, ] are a little similar each other, so it would
> easily make many syntax errors when nested style is supported.
> And the synopsis of that in documentation is odd;
>     synchronous_standby_names = 'N [ node_name [, ...] ]'
>
> This topic has been already discussed before but, we might want to
> change it to other characters such as < and >?

I personally would recommend against <>.  Those should mean less-than
and greater-than, not grouping.  I think you could use parentheses,
().  There's nothing saying that has to mean any particular thing, so
you may as well use it for the first thing implemented, perhaps.  Or
you could use [] or {}.  It *is* important that you don't create
confusing syntax summaries, but I don't think that's a reason to pick
a nonstandard syntax for grouping.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Mar 25, 2016 at 9:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 24, 2016 at 9:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Also I felt a sense of discomfort regarding using [ and ] as a special
>> character for priority method.
>> Because (, ) and [, ] are a little similar each other, so it would
>> easily make many syntax errors when nested style is supported.
>> And the synopsis of that in documentation is odd;
>>     synchronous_standby_names = 'N [ node_name [, ...] ]'
>>
>> This topic has been already discussed before but, we might want to
>> change it to other characters such as < and >?
>
> I personally would recommend against <>.  Those should mean less-than
> and greater-than, not grouping.  I think you could use parentheses,
> ().  There's nothing saying that has to mean any particular thing, so
> you may as well use it for the first thing implemented, perhaps.  Or
> you could use [] or {}.  It *is* important that you don't create
> confusing syntax summaries, but I don't think that's a reason to pick
> a nonstandard syntax for grouping.
>

I agree with you.
I've changed it to use parentheses.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Thank you for the new patch. Sorry to have overlooked some
versions. I'm looking the  v19 patch now.

make complains for an unused variable.

| syncrep.c: In function ‘SyncRepGetSyncStandbys’:
| syncrep.c:601:13: warning: variable ‘next’ set but not used [-Wunused-but-set-variable]
|    ListCell *next;


At Thu, 24 Mar 2016 22:29:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCxwezOTf9kLQRhuf2y=1c_fGjCormqJfqHOmQW8EgaDg@mail.gmail.com>
> >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
> >> > in the patch. Or am I missing something?
> >
> > Sorry, instead, the memory from strdup() will be abandoned in
> > upper level. (Thinking for some time..) Ah, I found that the
> > problem should be here.
> >
> >  > SyncRepFreeConfig(SyncRepConfigData *config)
> >  > {
> > ...
> > !>      list_free(config->members);
> >  >      pfree(config);
> >  > }
> >
> > The list_free *doesn't* free the memory blocks pointed by
> > lfirst(cell), which has been pstrdup'ed. It should be
> > list_free_deep(config->members) instead to free it completely.

Fujii> Yep, but SyncRepFreeConfig() already uses list_free_deep
Fujii> in the latest patch.  Could you read the latest version
Fujii> that I posted upthread.

Sorry for overlooked the version. Every pair of parse(or
SyncRepUpdateConfig) and SyncRepFreeConfig is on the same memory
context so it seems safe (but might be fragile since it relies on
that the caller does so.).

> >> Previous(9.5 or before) s_s_names also accepts non-ASCII character and
> >> non-printable character, and can show it without replacing these
> >> character to '?'.
> >
> > Thank you for pointint it out (it was completely out of my
> > mind..). I have no objection to keep the previous behavior.
> >
> >> From backward compatibility perspective, we should not choose #1 or #2.
> >> Different behaviour between previous and current s_s_names is that
> >> previous s_s_names doesn't accept the node name having the sort of
> >> white-space character that isspace() returns true with.
> >> But current s_s_names allows us to specify such a node name.
> >> I guess that changing such behaviour is enough for fixing this issue.
> >> Thoughts?
> >
> 
> Attached latest patch incorporating all review comments so far.
> 
> Aside from the review comments, I did following changes;
> - Add logic to avoid fatal exit in yy_fatal_error().

Maybe good catch, but..

> syncrep_scanstr(const char *str)
..
>   * Regain control after a fatal, internal flex error.  It may have
>   * corrupted parser state.  Consequently, abandon the file, but trust
~~~~~~~~~~~~~~~~
>   * that the state remains sane enough for syncrep_yy_delete_buffer().
~~~~~~~~~~~~~~~~~~~~~~~~

guc-file.l actually abandones the config file but syncrep_scanner
reads only a value of an item in it. And, the latter is
eventually true but a bit hard to understand. 

The patch will emit a mysterious error message like this.

> invalid value for parameter "synchronous_standby_names": "2[a,b,c]"
> configuration file ".../postgresql.conf" contains errors

This is utterly wrong. A bit related to that, it seems to me that
syncrep_scan.l doesn't need the same mechanism with
guc-file.l. The nature of the modification would be making
call_*_check_hook to be tri-state instead of boolean. So just
cathing errors in call_*_check_hook and ereport()'ing as SQL
parser does seems enough, but either will do for me.


> - Improve regression test cases.

I forgot to mention that, but additionalORDER BY makes the test
robust.

I doubt the validity of the behavior in the following test.

> # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state

Is this regarded as a correct as a value for it?


> Also I felt a sense of discomfort regarding using [ and ] as a special
> character for priority method.
> Because (, ) and [, ] are a little similar each other, so it would
> easily make many syntax errors when nested style is supported.
> And the synopsis of that in documentation is odd;
>     synchronous_standby_names = 'N [ node_name [, ...] ]'
> 
> This topic has been already discussed before but, we might want to
> change it to other characters such as < and >?

I don't mind ether but as Robert said, it is true that the
characters essentially to be used to enclose something should be
preferred to other characters. Distinguishability of glyphs has
less signinficance, perhaps.

# LISPers don't hesitate to dive into Sea of Parens.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2016/03/28 17:50, Kyotaro HORIGUCHI wrote:
> 
> # LISPers don't hesitate to dive into Sea of Parens.

Sorry in advance to be off-topic: https://xkcd.com/297 :)

Thanks,
Amit





Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thank you for the new patch. Sorry to have overlooked some
> versions. I'm looking the  v19 patch now.
>
> make complains for an unused variable.
>
> | syncrep.c: In function ‘SyncRepGetSyncStandbys’:
> | syncrep.c:601:13: warning: variable ‘next’ set but not used [-Wunused-but-set-variable]
> |    ListCell *next;
>
>
> At Thu, 24 Mar 2016 22:29:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCxwezOTf9kLQRhuf2y=1c_fGjCormqJfqHOmQW8EgaDg@mail.gmail.com>
>> >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that
>> >> > in the patch. Or am I missing something?
>> >
>> > Sorry, instead, the memory from strdup() will be abandoned in
>> > upper level. (Thinking for some time..) Ah, I found that the
>> > problem should be here.
>> >
>> >  > SyncRepFreeConfig(SyncRepConfigData *config)
>> >  > {
>> > ...
>> > !>      list_free(config->members);
>> >  >      pfree(config);
>> >  > }
>> >
>> > The list_free *doesn't* free the memory blocks pointed by
>> > lfirst(cell), which has been pstrdup'ed. It should be
>> > list_free_deep(config->members) instead to free it completely.
>
> Fujii> Yep, but SyncRepFreeConfig() already uses list_free_deep
> Fujii> in the latest patch.  Could you read the latest version
> Fujii> that I posted upthread.
>
> Sorry for overlooked the version. Every pair of parse(or
> SyncRepUpdateConfig) and SyncRepFreeConfig is on the same memory
> context so it seems safe (but might be fragile since it relies on
> that the caller does so.).
>
>> >> Previous(9.5 or before) s_s_names also accepts non-ASCII character and
>> >> non-printable character, and can show it without replacing these
>> >> character to '?'.
>> >
>> > Thank you for pointint it out (it was completely out of my
>> > mind..). I have no objection to keep the previous behavior.
>> >
>> >> From backward compatibility perspective, we should not choose #1 or #2.
>> >> Different behaviour between previous and current s_s_names is that
>> >> previous s_s_names doesn't accept the node name having the sort of
>> >> white-space character that isspace() returns true with.
>> >> But current s_s_names allows us to specify such a node name.
>> >> I guess that changing such behaviour is enough for fixing this issue.
>> >> Thoughts?
>> >
>>
>> Attached latest patch incorporating all review comments so far.
>>
>> Aside from the review comments, I did following changes;
>> - Add logic to avoid fatal exit in yy_fatal_error().
>
> Maybe good catch, but..
>
>> syncrep_scanstr(const char *str)
> ..
>>   * Regain control after a fatal, internal flex error.  It may have
>>   * corrupted parser state.  Consequently, abandon the file, but trust
>                                              ~~~~~~~~~~~~~~~~
>>   * that the state remains sane enough for syncrep_yy_delete_buffer().
>                                              ~~~~~~~~~~~~~~~~~~~~~~~~
>
> guc-file.l actually abandones the config file but syncrep_scanner
> reads only a value of an item in it. And, the latter is
> eventually true but a bit hard to understand.
>
> The patch will emit a mysterious error message like this.
>
>> invalid value for parameter "synchronous_standby_names": "2[a,b,c]"
>> configuration file ".../postgresql.conf" contains errors
>
> This is utterly wrong. A bit related to that, it seems to me that
> syncrep_scan.l doesn't need the same mechanism with
> guc-file.l. The nature of the modification would be making
> call_*_check_hook to be tri-state instead of boolean. So just
> cathing errors in call_*_check_hook and ereport()'ing as SQL
> parser does seems enough, but either will do for me.

Well, I think that call_*_check_hook can not catch such a fatal error.
Because if yy_fatal_error() is called without preventing logic when
reloading configuration file, postmaster process will abnormal exit
immediately as well as wal sender process.

>
>> - Improve regression test cases.
>
> I forgot to mention that, but additionalORDER BY makes the test
> robust.
>
> I doubt the validity of the behavior in the following test.
>
>> # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state
>
> Is this regarded as a correct as a value for it?

Since previous s_s_names (9.5 or before) can accept this value, I
didn't change behaviour.
And I added this test case for checking backward compatibility more finely.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Thank you for the new patch. Sorry to have overlooked some
> > versions. I'm looking the  v19 patch now.
> >
> > make complains for an unused variable.

Thank you. I'll have a closer look on it a bit later.

> >> Attached latest patch incorporating all review comments so far.
> >>
> >> Aside from the review comments, I did following changes;
> >> - Add logic to avoid fatal exit in yy_fatal_error().
> >
> > Maybe good catch, but..
> >
> >> syncrep_scanstr(const char *str)
> > ..
> >>   * Regain control after a fatal, internal flex error.  It may have
> >>   * corrupted parser state.  Consequently, abandon the file, but trust
> >                                              ~~~~~~~~~~~~~~~~
> >>   * that the state remains sane enough for syncrep_yy_delete_buffer().
> >                                              ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > guc-file.l actually abandones the config file but syncrep_scanner
> > reads only a value of an item in it. And, the latter is
> > eventually true but a bit hard to understand.
> >
> > The patch will emit a mysterious error message like this.
> >
> >> invalid value for parameter "synchronous_standby_names": "2[a,b,c]"
> >> configuration file ".../postgresql.conf" contains errors
> >
> > This is utterly wrong. A bit related to that, it seems to me that
> > syncrep_scan.l doesn't need the same mechanism with
> > guc-file.l. The nature of the modification would be making
> > call_*_check_hook to be tri-state instead of boolean. So just
> > cathing errors in call_*_check_hook and ereport()'ing as SQL
> > parser does seems enough, but either will do for me.
> 
> Well, I think that call_*_check_hook can not catch such a fatal error.

As mentioned in my comment, SQL parser converts yy_fatal_error
into ereport(ERROR), which can be caught by the upper PG_TRY (by
#define'ing fprintf). So it is doable if you mind exit().

> Because if yy_fatal_error() is called without preventing logic when
> reloading configuration file, postmaster process will abnormal exit
> immediately as well as wal sender process.


> >> - Improve regression test cases.
> >
> > I forgot to mention that, but additionalORDER BY makes the test
> > robust.
> >
> > I doubt the validity of the behavior in the following test.
> >
> >> # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state
> >
> > Is this regarded as a correct as a value for it?
> 
> Since previous s_s_names (9.5 or before) can accept this value, I
> didn't change behaviour.
> And I added this test case for checking backward compatibility more finely.

I understand that and it's fine. But we need a explanation for
the reason above in the test case or somewhere else.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
> sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Thank you for the new patch. Sorry to have overlooked some
>> > versions. I'm looking the  v19 patch now.
>> >
>> > make complains for an unused variable.
>
> Thank you. I'll have a closer look on it a bit later.
>
>> >> Attached latest patch incorporating all review comments so far.
>> >>
>> >> Aside from the review comments, I did following changes;
>> >> - Add logic to avoid fatal exit in yy_fatal_error().
>> >
>> > Maybe good catch, but..
>> >
>> >> syncrep_scanstr(const char *str)
>> > ..
>> >>   * Regain control after a fatal, internal flex error.  It may have
>> >>   * corrupted parser state.  Consequently, abandon the file, but trust
>> >                                              ~~~~~~~~~~~~~~~~
>> >>   * that the state remains sane enough for syncrep_yy_delete_buffer().
>> >                                              ~~~~~~~~~~~~~~~~~~~~~~~~
>> >
>> > guc-file.l actually abandones the config file but syncrep_scanner
>> > reads only a value of an item in it. And, the latter is
>> > eventually true but a bit hard to understand.
>> >
>> > The patch will emit a mysterious error message like this.
>> >
>> >> invalid value for parameter "synchronous_standby_names": "2[a,b,c]"
>> >> configuration file ".../postgresql.conf" contains errors
>> >
>> > This is utterly wrong. A bit related to that, it seems to me that
>> > syncrep_scan.l doesn't need the same mechanism with
>> > guc-file.l. The nature of the modification would be making
>> > call_*_check_hook to be tri-state instead of boolean. So just
>> > cathing errors in call_*_check_hook and ereport()'ing as SQL
>> > parser does seems enough, but either will do for me.
>>
>> Well, I think that call_*_check_hook can not catch such a fatal error.
>
> As mentioned in my comment, SQL parser converts yy_fatal_error
> into ereport(ERROR), which can be caught by the upper PG_TRY (by
> #define'ing fprintf). So it is doable if you mind exit().

I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
flex fatal error occurs, postmaster just exits instead of jumping out of parser.


ISTM that, when an internal flex fatal error occurs, it's better to elog(FATAL)
and terminate the problematic process. This might lead to the server crash
(e.g., if postmaster emits a FATAL error, it and its all child processes will
exit soon). But probably we can live with this because the fatal error basically
rarely happens.

OTOH, if we make the process keep running even after it gets an internal
fatal error (like Sawada's patch or your idea do), this might cause more
serious problem. Please imagine the case where one walsender gets the fatal
error (e.g., because of OOM), abandon new setting value of
synchronous_standby_names, and keep running with the previous setting value.
OTOH, the other walsender processes successfully parse the setting and
keep running with new setting. In this case, the inconsistency of the setting
which each walsender is based on happens. This completely will mess up the
synchronous replication.

Therefore, I think that it's better to make the problematic process exit
with FATAL error rather than ignore the error and keep it running.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
I personally don't think it needs such a survive measure. It is
very small syntax and the parser reads very short text. If the
parser failes in such mode, something more serious should have
occurred.

At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello,
> >
> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > As mentioned in my comment, SQL parser converts yy_fatal_error
> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
> > #define'ing fprintf). So it is doable if you mind exit().
> 
> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
> flex fatal error occurs, postmaster just exits instead of jumping out of parser.

If The ERROR may be LOG or DEBUG2 either, if we think the parser
fatal erros are recoverable. guc-file.l is doing so.

> ISTM that, when an internal flex fatal error occurs, it's
> better to elog(FATAL) and terminate the problematic
> process. This might lead to the server crash (e.g., if
> postmaster emits a FATAL error, it and its all child processes
> will exit soon). But probably we can live with this because the
> fatal error basically rarely happens.

I agree to this

> OTOH, if we make the process keep running even after it gets an internal
> fatal error (like Sawada's patch or your idea do), this might cause more
> serious problem. Please imagine the case where one walsender gets the fatal
> error (e.g., because of OOM), abandon new setting value of
> synchronous_standby_names, and keep running with the previous setting value.
> OTOH, the other walsender processes successfully parse the setting and
> keep running with new setting. In this case, the inconsistency of the setting
> which each walsender is based on happens. This completely will mess up the
> synchronous replication.

On the other hand, guc-file.l seems ignoring parser errors under
normal operation, even though it may cause similar inconsistency,
if any..

| LOG:  received SIGHUP, reloading configuration files
| LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
| LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied

> Therefore, I think that it's better to make the problematic process exit
> with FATAL error rather than ignore the error and keep it running.

+1. Restarting walsender would be far less harmful than keeping
it running in doubtful state.

Sould I wait for the next version or have a look on the latest?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I personally don't think it needs such a survive measure. It is
> very small syntax and the parser reads very short text. If the
> parser failes in such mode, something more serious should have
> occurred.
>
> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello,
>> >
>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>> > #define'ing fprintf). So it is doable if you mind exit().
>>
>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>
> If The ERROR may be LOG or DEBUG2 either, if we think the parser
> fatal erros are recoverable. guc-file.l is doing so.
>
>> ISTM that, when an internal flex fatal error occurs, it's
>> better to elog(FATAL) and terminate the problematic
>> process. This might lead to the server crash (e.g., if
>> postmaster emits a FATAL error, it and its all child processes
>> will exit soon). But probably we can live with this because the
>> fatal error basically rarely happens.
>
> I agree to this
>
>> OTOH, if we make the process keep running even after it gets an internal
>> fatal error (like Sawada's patch or your idea do), this might cause more
>> serious problem. Please imagine the case where one walsender gets the fatal
>> error (e.g., because of OOM), abandon new setting value of
>> synchronous_standby_names, and keep running with the previous setting value.
>> OTOH, the other walsender processes successfully parse the setting and
>> keep running with new setting. In this case, the inconsistency of the setting
>> which each walsender is based on happens. This completely will mess up the
>> synchronous replication.
>
> On the other hand, guc-file.l seems ignoring parser errors under
> normal operation, even though it may cause similar inconsistency,
> if any..
>
> | LOG:  received SIGHUP, reloading configuration files
> | LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
> | LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied
>
>> Therefore, I think that it's better to make the problematic process exit
>> with FATAL error rather than ignore the error and keep it running.
>
> +1. Restarting walsender would be far less harmful than keeping
> it running in doubtful state.
>
> Sould I wait for the next version or have a look on the latest?
>

Attached latest patch incorporate some review comments so far, and is
rebased against current HEAD.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> I personally don't think it needs such a survive measure. It is
>> very small syntax and the parser reads very short text. If the
>> parser failes in such mode, something more serious should have
>> occurred.
>>
>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> > Hello,
>>> >
>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>>> > #define'ing fprintf). So it is doable if you mind exit().
>>>
>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>>
>> If The ERROR may be LOG or DEBUG2 either, if we think the parser
>> fatal erros are recoverable. guc-file.l is doing so.
>>
>>> ISTM that, when an internal flex fatal error occurs, it's
>>> better to elog(FATAL) and terminate the problematic
>>> process. This might lead to the server crash (e.g., if
>>> postmaster emits a FATAL error, it and its all child processes
>>> will exit soon). But probably we can live with this because the
>>> fatal error basically rarely happens.
>>
>> I agree to this
>>
>>> OTOH, if we make the process keep running even after it gets an internal
>>> fatal error (like Sawada's patch or your idea do), this might cause more
>>> serious problem. Please imagine the case where one walsender gets the fatal
>>> error (e.g., because of OOM), abandon new setting value of
>>> synchronous_standby_names, and keep running with the previous setting value.
>>> OTOH, the other walsender processes successfully parse the setting and
>>> keep running with new setting. In this case, the inconsistency of the setting
>>> which each walsender is based on happens. This completely will mess up the
>>> synchronous replication.
>>
>> On the other hand, guc-file.l seems ignoring parser errors under
>> normal operation, even though it may cause similar inconsistency,
>> if any..
>>
>> | LOG:  received SIGHUP, reloading configuration files
>> | LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
>> | LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied
>>
>>> Therefore, I think that it's better to make the problematic process exit
>>> with FATAL error rather than ignore the error and keep it running.
>>
>> +1. Restarting walsender would be far less harmful than keeping
>> it running in doubtful state.
>>
>> Sould I wait for the next version or have a look on the latest?
>>
>
> Attached latest patch incorporate some review comments so far, and is
> rebased against current HEAD.
>

Sorry I attached wrong patch.
Attached patch is correct patch.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> I personally don't think it needs such a survive measure. It is
>>> very small syntax and the parser reads very short text. If the
>>> parser failes in such mode, something more serious should have
>>> occurred.
>>>
>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> > Hello,
>>>> >
>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>>>> > #define'ing fprintf). So it is doable if you mind exit().
>>>>
>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>>>
>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser
>>> fatal erros are recoverable. guc-file.l is doing so.
>>>
>>>> ISTM that, when an internal flex fatal error occurs, it's
>>>> better to elog(FATAL) and terminate the problematic
>>>> process. This might lead to the server crash (e.g., if
>>>> postmaster emits a FATAL error, it and its all child processes
>>>> will exit soon). But probably we can live with this because the
>>>> fatal error basically rarely happens.
>>>
>>> I agree to this
>>>
>>>> OTOH, if we make the process keep running even after it gets an internal
>>>> fatal error (like Sawada's patch or your idea do), this might cause more
>>>> serious problem. Please imagine the case where one walsender gets the fatal
>>>> error (e.g., because of OOM), abandon new setting value of
>>>> synchronous_standby_names, and keep running with the previous setting value.
>>>> OTOH, the other walsender processes successfully parse the setting and
>>>> keep running with new setting. In this case, the inconsistency of the setting
>>>> which each walsender is based on happens. This completely will mess up the
>>>> synchronous replication.
>>>
>>> On the other hand, guc-file.l seems ignoring parser errors under
>>> normal operation, even though it may cause similar inconsistency,
>>> if any..
>>>
>>> | LOG:  received SIGHUP, reloading configuration files
>>> | LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
>>> | LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were
applied
>>>
>>>> Therefore, I think that it's better to make the problematic process exit
>>>> with FATAL error rather than ignore the error and keep it running.
>>>
>>> +1. Restarting walsender would be far less harmful than keeping
>>> it running in doubtful state.
>>>
>>> Sould I wait for the next version or have a look on the latest?
>>>
>>
>> Attached latest patch incorporate some review comments so far, and is
>> rebased against current HEAD.
>>
>
> Sorry I attached wrong patch.
> Attached patch is correct patch.
>
> [mulit_sync_replication_v21.patch]

Here are some TPS numbers from some quick tests I ran on a set of
Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
as primary + 3 standbys, to try out different combinations of
synchronous_commit levels and synchronous_standby_names numbers.  They
were run for a short time only and these are of course systems with
limited and perhaps uneven IO and CPU, but they still give some idea
of the trends.  And reassuringly, the trends are travelling in the
expected directions.

All default settings except shared_buffers = 1GB, and the GUCs
required for replication.

pgbench postgres -j2 -c2 -N bench2 -T 600
              1(*) 2(*) 3(*)              ==== ==== ====
off          = 4056 4096 4092
local        = 1323 1299 1312
remote_write = 1130 1046  958
on           =  860  744  701
remote_apply =  785  725  604

pgbench postgres -j16 -c16 -N bench2 -T 600
              1(*) 2(*) 3(*)              ==== ==== ====
off          = 3952 3943 3933
local        = 2964 2984 3026
remote_write = 2790 2724 2675
on           = 2731 2627 2523
remote_apply = 2627 2501 2432

One thing I noticed is that there are LOG messages telling me when a
standby becomes a synchronous standby, but nothing to tell me if a
standby stops being a standby (ie because a higher priority one has
taken its place in the quorum).  Would that be interesting?

Also, I spotted some tiny mistakes:

+  <indexterm zone="high-availability">
+   <primary>Dedicated language for multiple synchornous replication</primary>
+  </indexterm>

s/synchornous/synchronous/

+ /*
+ * If we are managing the sync standby, though we weren't
+ * prior to this, then announce we are now the sync standby.
+ */

s/ the / a / (two occurrences)

+ ereport(LOG,
+ (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+ application_name, MyWalSnd->sync_standby_priority)));

s/ the / a /
    offered by a transaction commit. This level of protection is referred
-    to as 2-safe replication in computer science theory.
+    to as 2-safe replication in computer science theory, and group-1-safe
+    (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
+    more than <literal>remote_write</>.

Why "more than"?  I think those two words should be changed to "at
least", or removed.

+   <para>
+    This syntax allows us to define a synchronous group that will wait for at
+    least N standbys of them, and a comma-separated list of group
members that are surrounded by
+    parantheses.  The special value <literal>*</> for server name
matches any standby.
+    By surrounding list of group members using parantheses,
synchronous standbys are chosen from
+    that group using priority method.
+   </para>

s/parantheses/parentheses/ (two occurrences)

+  <sect2 id="dedicated-language-for-multi-sync-replication-priority">
+   <title>Prioirty Method</title>

s/Prioirty Method/Priority Method/

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Thu, Mar 31, 2016 at 5:11 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> I personally don't think it needs such a survive measure. It is
>>>> very small syntax and the parser reads very short text. If the
>>>> parser failes in such mode, something more serious should have
>>>> occurred.
>>>>
>>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
>>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
>>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>>>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>>>>> > #define'ing fprintf). So it is doable if you mind exit().
>>>>>
>>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>>>>
>>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser
>>>> fatal erros are recoverable. guc-file.l is doing so.
>>>>
>>>>> ISTM that, when an internal flex fatal error occurs, it's
>>>>> better to elog(FATAL) and terminate the problematic
>>>>> process. This might lead to the server crash (e.g., if
>>>>> postmaster emits a FATAL error, it and its all child processes
>>>>> will exit soon). But probably we can live with this because the
>>>>> fatal error basically rarely happens.
>>>>
>>>> I agree to this
>>>>
>>>>> OTOH, if we make the process keep running even after it gets an internal
>>>>> fatal error (like Sawada's patch or your idea do), this might cause more
>>>>> serious problem. Please imagine the case where one walsender gets the fatal
>>>>> error (e.g., because of OOM), abandon new setting value of
>>>>> synchronous_standby_names, and keep running with the previous setting value.
>>>>> OTOH, the other walsender processes successfully parse the setting and
>>>>> keep running with new setting. In this case, the inconsistency of the setting
>>>>> which each walsender is based on happens. This completely will mess up the
>>>>> synchronous replication.
>>>>
>>>> On the other hand, guc-file.l seems ignoring parser errors under
>>>> normal operation, even though it may cause similar inconsistency,
>>>> if any..
>>>>
>>>> | LOG:  received SIGHUP, reloading configuration files
>>>> | LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
>>>> | LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were
applied
>>>>
>>>>> Therefore, I think that it's better to make the problematic process exit
>>>>> with FATAL error rather than ignore the error and keep it running.
>>>>
>>>> +1. Restarting walsender would be far less harmful than keeping
>>>> it running in doubtful state.
>>>>
>>>> Sould I wait for the next version or have a look on the latest?
>>>>
>>>
>>> Attached latest patch incorporate some review comments so far, and is
>>> rebased against current HEAD.
>>>
>>
>> Sorry I attached wrong patch.
>> Attached patch is correct patch.
>>
>> [mulit_sync_replication_v21.patch]
>
> Here are some TPS numbers from some quick tests I ran on a set of
> Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
> as primary + 3 standbys, to try out different combinations of
> synchronous_commit levels and synchronous_standby_names numbers.  They
> were run for a short time only and these are of course systems with
> limited and perhaps uneven IO and CPU, but they still give some idea
> of the trends.  And reassuringly, the trends are travelling in the
> expected directions.
>
> All default settings except shared_buffers = 1GB, and the GUCs
> required for replication.
>
> pgbench postgres -j2 -c2 -N bench2 -T 600
>
>                1(*) 2(*) 3(*)
>                ==== ==== ====
> off          = 4056 4096 4092
> local        = 1323 1299 1312
> remote_write = 1130 1046  958
> on           =  860  744  701
> remote_apply =  785  725  604
>
> pgbench postgres -j16 -c16 -N bench2 -T 600
>
>                1(*) 2(*) 3(*)
>                ==== ==== ====
> off          = 3952 3943 3933
> local        = 2964 2984 3026
> remote_write = 2790 2724 2675
> on           = 2731 2627 2523
> remote_apply = 2627 2501 2432
>
> One thing I noticed is that there are LOG messages telling me when a
> standby becomes a synchronous standby, but nothing to tell me if a
> standby stops being a standby (ie because a higher priority one has
> taken its place in the quorum).  Would that be interesting?
>
> Also, I spotted some tiny mistakes:
>
> +  <indexterm zone="high-availability">
> +   <primary>Dedicated language for multiple synchornous replication</primary>
> +  </indexterm>
>
> s/synchornous/synchronous/
>
> + /*
> + * If we are managing the sync standby, though we weren't
> + * prior to this, then announce we are now the sync standby.
> + */
>
> s/ the / a / (two occurrences)
>
> + ereport(LOG,
> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
> + application_name, MyWalSnd->sync_standby_priority)));
>
> s/ the / a /
>
>      offered by a transaction commit. This level of protection is referred
> -    to as 2-safe replication in computer science theory.
> +    to as 2-safe replication in computer science theory, and group-1-safe
> +    (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
> +    more than <literal>remote_write</>.
>
> Why "more than"?  I think those two words should be changed to "at
> least", or removed.
>
> +   <para>
> +    This syntax allows us to define a synchronous group that will wait for at
> +    least N standbys of them, and a comma-separated list of group
> members that are surrounded by
> +    parantheses.  The special value <literal>*</> for server name
> matches any standby.
> +    By surrounding list of group members using parantheses,
> synchronous standbys are chosen from
> +    that group using priority method.
> +   </para>
>
> s/parantheses/parentheses/ (two occurrences)
>
> +  <sect2 id="dedicated-language-for-multi-sync-replication-priority">
> +   <title>Prioirty Method</title>
>
> s/Prioirty Method/Priority Method/

A couple more comments:
 /*
- * If we aren't managing the highest priority standby then just leave.
+ * If the number of sync standbys is less than requested or we aren't
+ * managing the sync standby then just leave. */
- if (syncWalSnd != MyWalSnd)
+ if (!got_oldest || !am_sync)

s/ the sync / a sync /

+ /*
+ * Consider all pending standbys as sync if the number of them plus
+ * already-found sync ones is lower than the configuration requests.
+ */
+ if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
+ return list_concat(result, pending);

The cells from 'pending' will be attached to 'result', and 'result'
will be freed by the caller.  But won't the List header object from
'pending' be leaked?

+ result = lappend_int(result, i);
+ if (list_length(result) == SyncRepConfig->num_sync)
+ {
+ list_free(pending);
+ return result; /* Exit if got enough sync standbys */
+ }

If we didn't take the early return in the list-not-long-enough case
mentioned above, we should *always* exit via this return statement,
right?  Since we know that the pending list had enough elements to
reach num_sync.  I think that is worth a comment, and also a "not
reached" comment at the bottom of the function, if it is true.

As a future improvement, I wonder if we could avoid recomputing the
current set of sync standbys in every walsender every time we call
SyncRepReleaseWaiters, perhaps by maintaining that set incrementally
in shmem when walsender states change etc.

I don't have any other comments, other than to say: thank you to all
the people who have contributed to this feature so far and I really
really hope it goes into 9.6!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Sat, Apr 2, 2016 at 10:20 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Mar 31, 2016 at 5:11 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>> I personally don't think it needs such a survive measure. It is
>>>>> very small syntax and the parser reads very short text. If the
>>>>> parser failes in such mode, something more serious should have
>>>>> occurred.
>>>>>
>>>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com>
>>>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com>
>>>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>>>>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>>>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>>>>>> > #define'ing fprintf). So it is doable if you mind exit().
>>>>>>
>>>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>>>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>>>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>>>>>
>>>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser
>>>>> fatal erros are recoverable. guc-file.l is doing so.
>>>>>
>>>>>> ISTM that, when an internal flex fatal error occurs, it's
>>>>>> better to elog(FATAL) and terminate the problematic
>>>>>> process. This might lead to the server crash (e.g., if
>>>>>> postmaster emits a FATAL error, it and its all child processes
>>>>>> will exit soon). But probably we can live with this because the
>>>>>> fatal error basically rarely happens.
>>>>>
>>>>> I agree to this
>>>>>
>>>>>> OTOH, if we make the process keep running even after it gets an internal
>>>>>> fatal error (like Sawada's patch or your idea do), this might cause more
>>>>>> serious problem. Please imagine the case where one walsender gets the fatal
>>>>>> error (e.g., because of OOM), abandon new setting value of
>>>>>> synchronous_standby_names, and keep running with the previous setting value.
>>>>>> OTOH, the other walsender processes successfully parse the setting and
>>>>>> keep running with new setting. In this case, the inconsistency of the setting
>>>>>> which each walsender is based on happens. This completely will mess up the
>>>>>> synchronous replication.
>>>>>
>>>>> On the other hand, guc-file.l seems ignoring parser errors under
>>>>> normal operation, even though it may cause similar inconsistency,
>>>>> if any..
>>>>>
>>>>> | LOG:  received SIGHUP, reloading configuration files
>>>>> | LOG:  input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
>>>>> | LOG:  configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were
applied
>>>>>
>>>>>> Therefore, I think that it's better to make the problematic process exit
>>>>>> with FATAL error rather than ignore the error and keep it running.
>>>>>
>>>>> +1. Restarting walsender would be far less harmful than keeping
>>>>> it running in doubtful state.
>>>>>
>>>>> Sould I wait for the next version or have a look on the latest?
>>>>>
>>>>
>>>> Attached latest patch incorporate some review comments so far, and is
>>>> rebased against current HEAD.
>>>>
>>>
>>> Sorry I attached wrong patch.
>>> Attached patch is correct patch.

Thanks for updating the patch!

I applied the following changes to the patch.
Attached is the revised version of the patch.

- Changed syncrep_flex_fatal() so that it just calls ereport(FATAL), based on
  the recent discussion with Horiguchi-san.
- Improved the documentation.
- Fixed some bugs.
- Removed the changes for recovery testing framework. I'd like to commit
   those changes later separately from the main patch of multiple sync rep.

Barring any objections, I'll commit this patch.

>> One thing I noticed is that there are LOG messages telling me when a
>> standby becomes a synchronous standby, but nothing to tell me if a
>> standby stops being a standby (ie because a higher priority one has
>> taken its place in the quorum).  Would that be interesting?

+1

>> Also, I spotted some tiny mistakes:
>>
>> +  <indexterm zone="high-availability">
>> +   <primary>Dedicated language for multiple synchornous replication</primary>
>> +  </indexterm>
>>
>> s/synchornous/synchronous/

Confirmed that there is no typo "synchornous" in the latest patch.

>> + /*
>> + * If we are managing the sync standby, though we weren't
>> + * prior to this, then announce we are now the sync standby.
>> + */
>>
>> s/ the / a / (two occurrences)

Fixed.

>> + ereport(LOG,
>> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
>> + application_name, MyWalSnd->sync_standby_priority)));
>>
>> s/ the / a /

I have no objection to this change itself. But we have used this message
in 9.5 or before, so if we apply this change, probably we need
back-patching.

>>
>>      offered by a transaction commit. This level of protection is referred
>> -    to as 2-safe replication in computer science theory.
>> +    to as 2-safe replication in computer science theory, and group-1-safe
>> +    (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
>> +    more than <literal>remote_write</>.
>>
>> Why "more than"?  I think those two words should be changed to "at
>> least", or removed.

Removed.

>> +   <para>
>> +    This syntax allows us to define a synchronous group that will wait for at
>> +    least N standbys of them, and a comma-separated list of group
>> members that are surrounded by
>> +    parantheses.  The special value <literal>*</> for server name
>> matches any standby.
>> +    By surrounding list of group members using parantheses,
>> synchronous standbys are chosen from
>> +    that group using priority method.
>> +   </para>
>>
>> s/parantheses/parentheses/ (two occurrences)

Confirmed that this typo doesn't exist in the latest patch.

>>
>> +  <sect2 id="dedicated-language-for-multi-sync-replication-priority">
>> +   <title>Prioirty Method</title>
>>
>> s/Prioirty Method/Priority Method/

Confirmed that this typo doesn't exist in the latest patch.

> A couple more comments:
>
>   /*
> - * If we aren't managing the highest priority standby then just leave.
> + * If the number of sync standbys is less than requested or we aren't
> + * managing the sync standby then just leave.
>   */
> - if (syncWalSnd != MyWalSnd)
> + if (!got_oldest || !am_sync)
>
> s/ the sync / a sync /

Fixed.

> + /*
> + * Consider all pending standbys as sync if the number of them plus
> + * already-found sync ones is lower than the configuration requests.
> + */
> + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
> + return list_concat(result, pending);
>
> The cells from 'pending' will be attached to 'result', and 'result'
> will be freed by the caller.  But won't the List header object from
> 'pending' be leaked?

Yes if 'result' is not NIL. I added pfree(pending) for that case.

> + result = lappend_int(result, i);
> + if (list_length(result) == SyncRepConfig->num_sync)
> + {
> + list_free(pending);
> + return result; /* Exit if got enough sync standbys */
> + }
>
> If we didn't take the early return in the list-not-long-enough case
> mentioned above, we should *always* exit via this return statement,
> right?  Since we know that the pending list had enough elements to
> reach num_sync.  I think that is worth a comment, and also a "not
> reached" comment at the bottom of the function, if it is true.

Good catch! I added the comments. Also added Assert(false) at
the bottom of the function.

> As a future improvement, I wonder if we could avoid recomputing the
> current set of sync standbys in every walsender every time we call
> SyncRepReleaseWaiters, perhaps by maintaining that set incrementally
> in shmem when walsender states change etc.

+1

> I don't have any other comments, other than to say: thank you to all
> the people who have contributed to this feature so far and I really
> really hope it goes into 9.6!

+1000

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Abhijit Menon-Sen
Date:
At 2016-04-04 17:28:07 +0900, masao.fujii@gmail.com wrote:
>
> Barring any objections, I'll commit this patch.

No objections, just a minor wording tweak:

doc/src/sgml/config.sgml:

"The synchronous standbys will be the standbys that their names appear
early in this list" should be "The synchronous standbys will be those
whose names appear earlier in this list".

doc/src/sgml/high-availability.sgml:

"The standbys that their names appear early in this list are given
higher priority and will be considered as synchronous" should be "The
standbys whose names appear earlier in the list are given higher
priority and will be considered as synchronous".

"The standbys that their names appear early in the list will be used as
the synchronous standby" should be "The standbys whose names appear
earlier in the list will be used as synchronous standbys".

You may prefer to reword this in some other way, but the current "that
their names appear" wording should be changed.

-- Abhijit



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello, thank you for testing.

At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com>
> >>> Attached latest patch incorporate some review comments so far, and is
> >>> rebased against current HEAD.
> >>>
> >>
> >> Sorry I attached wrong patch.
> >> Attached patch is correct patch.
> >>
> >> [mulit_sync_replication_v21.patch]
> >
> > Here are some TPS numbers from some quick tests I ran on a set of
> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
> > as primary + 3 standbys, to try out different combinations of
> > synchronous_commit levels and synchronous_standby_names numbers.  They
> > were run for a short time only and these are of course systems with
> > limited and perhaps uneven IO and CPU, but they still give some idea
> > of the trends.  And reassuringly, the trends are travelling in the
> > expected directions.
> >
> > All default settings except shared_buffers = 1GB, and the GUCs
> > required for replication.
> >
> > pgbench postgres -j2 -c2 -N bench2 -T 600
> >
> >                1(*) 2(*) 3(*)
> >                ==== ==== ====
> > off          = 4056 4096 4092
> > local        = 1323 1299 1312
> > remote_write = 1130 1046  958
> > on           =  860  744  701
> > remote_apply =  785  725  604
> >
> > pgbench postgres -j16 -c16 -N bench2 -T 600
> >
> >                1(*) 2(*) 3(*)
> >                ==== ==== ====
> > off          = 3952 3943 3933
> > local        = 2964 2984 3026
> > remote_write = 2790 2724 2675
> > on           = 2731 2627 2523
> > remote_apply = 2627 2501 2432
> >
> > One thing I noticed is that there are LOG messages telling me when a
> > standby becomes a synchronous standby, but nothing to tell me if a
> > standby stops being a standby (ie because a higher priority one has
> > taken its place in the quorum).  Would that be interesting?

A walsender exits by proc_exit() for any operational
termination so wrapping proc_exit() should work. (Attached file 1)

For the setting "2(Sby1, Sby2, Sby3)", the master says that all
of the standbys are sync-standbys and no message is emited on
failure of Sby1, which should cause a promotion of Sby3.

>  standby "Sby3" is now the synchronous standby with priority 3
>  standby "Sby2" is now the synchronous standby with priority 2
>  standby "Sby1" is now the synchronous standby with priority 1
..<Sby 1 failure>
>  standby "Sby3" is now the synchronous standby with priority 3

Sby3 becomes sync standby twice:p

This was a behavior taken over from the single-sync-rep era but
it should be confusing for the new sync-rep selection mechanism.
The second attached diff makes this as the following.


> 17:48:21.969 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> 17:48:23.087 LOG:  standby "Sby2" is now a synchronous standby with priority 2
> 17:48:25.617 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> 17:48:31.990 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
> 17:48:43.905 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> 17:49:10.262 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> 17:49:13.865 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3

Since this status check is taken place for every reply from
stanbys, the message of downgrading to "potential" may be
diferred or even fail to occur but it should be no problem.

Applying the both of the above patches, the message would be like
the following.

> 17:54:08.367 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> 17:54:08.564 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> 17:54:08.565 LOG:  standby "Sby2" is now a synchronous standby with priority 2
> 17:54:18.387 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
> 17:54:28.887 LOG:  synchronous standby "Sby1" with priority 1 exited
> 17:54:31.359 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> 17:54:39.008 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> 17:54:41.382 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3

Does this make sense?

By the way, Sawada-san, you have changed the parentheses for the
priority method from '[]' to '()'. And I mistankenly defined
s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
is, only Sby2 is registed as mandatory synchronous standby.

For this case, the tree members of SyncRepConfig are '2[Sby1,',
'Sby2', "Sby3]'. This syntax is valid for the current
specification but will surely get different meaning by the future
changes. We should refuse this known-to-be-wrong-in-future syntax
from now.

And, this error was very hard to know.  pg_setting only shows the
string itself

=# select name, setting from pg_settings where name = 'synchronous_standby_names';          name            |
setting      
 
---------------------------+---------------------synchronous_standby_names | 2[Sby1, Sby2, Sby3]
(1 row)


Since the sintax is no longer so simple, we may need some means
to see the current standby-group setting clearly, but it wont'be
if refusing the known....-future syntax now.


> > Also, I spotted some tiny mistakes:
> >
> > +  <indexterm zone="high-availability">
> > +   <primary>Dedicated language for multiple synchornous replication</primary>
> > +  </indexterm>
> >
> > s/synchornous/synchronous/
> >
> > + /*
> > + * If we are managing the sync standby, though we weren't
> > + * prior to this, then announce we are now the sync standby.
> > + */
> >
> > s/ the / a / (two occurrences)
> >
> > + ereport(LOG,
> > + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
> > + application_name, MyWalSnd->sync_standby_priority)));
> >
> > s/ the / a /
> >
> >      offered by a transaction commit. This level of protection is referred
> > -    to as 2-safe replication in computer science theory.
> > +    to as 2-safe replication in computer science theory, and group-1-safe
> > +    (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
> > +    more than <literal>remote_write</>.
> >
> > Why "more than"?  I think those two words should be changed to "at
> > least", or removed.
> >
> > +   <para>
> > +    This syntax allows us to define a synchronous group that will wait for at
> > +    least N standbys of them, and a comma-separated list of group
> > members that are surrounded by
> > +    parantheses.  The special value <literal>*</> for server name
> > matches any standby.
> > +    By surrounding list of group members using parantheses,
> > synchronous standbys are chosen from
> > +    that group using priority method.
> > +   </para>
> >
> > s/parantheses/parentheses/ (two occurrences)
> >
> > +  <sect2 id="dedicated-language-for-multi-sync-replication-priority">
> > +   <title>Prioirty Method</title>
> >
> > s/Prioirty Method/Priority Method/
> 
> A couple more comments:
> 
>   /*
> - * If we aren't managing the highest priority standby then just leave.
> + * If the number of sync standbys is less than requested or we aren't
> + * managing the sync standby then just leave.
>   */
> - if (syncWalSnd != MyWalSnd)
> + if (!got_oldest || !am_sync)
> 
> s/ the sync / a sync /
> 
> + /*
> + * Consider all pending standbys as sync if the number of them plus
> + * already-found sync ones is lower than the configuration requests.
> + */
> + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
> + return list_concat(result, pending);
> 
> The cells from 'pending' will be attached to 'result', and 'result'
> will be freed by the caller.  But won't the List header object from
> 'pending' be leaked?
> 
> + result = lappend_int(result, i);
> + if (list_length(result) == SyncRepConfig->num_sync)
> + {
> + list_free(pending);
> + return result; /* Exit if got enough sync standbys */
> + }
> 
> If we didn't take the early return in the list-not-long-enough case
> mentioned above, we should *always* exit via this return statement,
> right?  Since we know that the pending list had enough elements to
> reach num_sync.  I think that is worth a comment, and also a "not
> reached" comment at the bottom of the function, if it is true.
> 
> As a future improvement, I wonder if we could avoid recomputing the
> current set of sync standbys in every walsender every time we call
> SyncRepReleaseWaiters, perhaps by maintaining that set incrementally
> in shmem when walsender states change etc.
> 
> I don't have any other comments, other than to say: thank you to all
> the people who have contributed to this feature so far and I really
> really hope it goes into 9.6!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0867cc4..77d24f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -184,6 +184,8 @@ static volatile sig_atomic_t replication_active = false;static LogicalDecodingContext
*logical_decoding_ctx= NULL;static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+static void walsnd_proc_exit(int code);
+/* Signal handlers */static void WalSndSigHupHandler(SIGNAL_ARGS);static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -242,6 +244,23 @@ InitWalSender(void)    SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);}
+static void
+walsnd_proc_exit(int code)
+{
+    WalSnd *walsnd = MyWalSnd;
+    int mypriority = 0;
+
+    SpinLockAcquire(&walsnd->mutex);
+    mypriority = walsnd->sync_standby_priority;
+    SpinLockRelease(&walsnd->mutex);
+
+    if (mypriority > 0)
+        ereport(LOG,
+                (errmsg("synchronous standby \"%s\" with priority %d exited",
+                        application_name, mypriority)));
+    proc_exit(code);
+}
+/* * Clean up after an error. *
@@ -266,7 +285,7 @@ WalSndErrorCleanup(void)    replication_active = false;    if (walsender_ready_to_stop)
-        proc_exit(0);
+        walsnd_proc_exit(0);    /* Revert back to startup state */    WalSndSetState(WALSNDSTATE_STARTUP);
@@ -285,7 +304,7 @@ WalSndShutdown(void)    if (whereToSendOutput == DestRemote)        whereToSendOutput = DestNone;
-    proc_exit(0);
+    walsnd_proc_exit(0);    abort();                    /* keep the compiler quiet */}
@@ -673,7 +692,7 @@ StartReplication(StartReplicationCmd *cmd)        replication_active = false;        if
(walsender_ready_to_stop)
-            proc_exit(0);
+            walsnd_proc_exit(0);        WalSndSetState(WALSNDSTATE_STARTUP);        Assert(streamingDoneSending &&
streamingDoneReceiving);
@@ -1027,7 +1046,7 @@ StartLogicalReplication(StartReplicationCmd *cmd)    replication_active = false;    if
(walsender_ready_to_stop)
-        proc_exit(0);
+        walsnd_proc_exit(0);    WalSndSetState(WALSNDSTATE_STARTUP);    /* Get out of COPY mode (CommandComplete). */
@@ -1391,7 +1410,7 @@ ProcessRepliesIfAny(void)            ereport(COMMERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),                    errmsg("unexpected EOF on standby connection")));
 
-            proc_exit(0);
+            walsnd_proc_exit(0);        }        if (r == 0)        {
@@ -1407,7 +1426,7 @@ ProcessRepliesIfAny(void)            ereport(COMMERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),                    errmsg("unexpected EOF on standby connection")));
 
-            proc_exit(0);
+            walsnd_proc_exit(0);        }        /*
@@ -1453,7 +1472,7 @@ ProcessRepliesIfAny(void)                 * 'X' means that the standby is closing down the
socket.                */            case 'X':
 
-                proc_exit(0);
+                walsnd_proc_exit(0);            default:                ereport(FATAL,
@@ -1500,7 +1519,7 @@ ProcessStandbyMessage(void)            ereport(COMMERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),                    errmsg("unexpected message type \"%c\"", msgtype)));
 
-            proc_exit(0);
+            walsnd_proc_exit(0);    }}
@@ -2501,7 +2520,7 @@ WalSndDone(WalSndSendDataCallback send_data)        EndCommand("COPY 0", DestRemote);
pq_flush();
-        proc_exit(0);
+        walsnd_proc_exit(0);    }    if (!waiting_for_ping_response)        WalSndKeepalive(true);
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6692027..6e120f3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -64,7 +64,13 @@ char       *SyncRepStandbyNames;#define SyncStandbysDefined() \    (SyncRepStandbyNames != NULL &&
SyncRepStandbyNames[0]!= '\0')
 
-static bool announce_next_takeover = true;
+typedef enum syncrep_state {
+    SSTATE_NONE,
+    SSTATE_POTENTIAL,
+    SSTATE_SYNC,
+} sync_state;
+
+static sync_state syncrep_state = SSTATE_NONE;SyncRepConfigData *SyncRepConfig;static int    SyncRepWaitMode =
SYNC_REP_NO_WAIT;
@@ -416,22 +422,26 @@ SyncRepReleaseWaiters(void)     * If we are managing the sync standby, though we weren't     *
priorto this, then announce we are now the sync standby.     */
 
-    if (announce_next_takeover && am_sync)
+    if ((syncrep_state != SSTATE_POTENTIAL &&
+         !am_sync && MyWalSnd->sync_standby_priority > 0) ||
+        (syncrep_state != SSTATE_SYNC && am_sync))    {
-        announce_next_takeover = false;        ereport(LOG,
-                (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
-                        application_name, MyWalSnd->sync_standby_priority)));
+                    (errmsg("standby \"%s\" is now a %ssynchronous standby with priority %u",
+                            application_name,
+                            am_sync ? "" : "potential ",
+                            MyWalSnd->sync_standby_priority)));
+        syncrep_state = (am_sync ? SSTATE_SYNC : SSTATE_POTENTIAL);    }    /*     * If the number of sync standbys is
lessthan requested or we aren't     * managing the sync standby then just leave.     */
 
-    if (!got_oldest || !am_sync)
+    if (!got_oldest || MyWalSnd->sync_standby_priority == 0)    {        LWLockRelease(SyncRepLock);
-        announce_next_takeover = !am_sync;
+        syncrep_state = SSTATE_NONE;        return;    }

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
 
Barring any objections, I'll commit this patch.

That sounds good.

May I have one more day to review this? Actually more like 3-4 hours.

I have no comments on an initial read, so I'm hopeful of having nothing at all to say on it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2016-04-04 10:35:34 +0100, Simon Riggs wrote:
> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
> > Barring any objections, I'll commit this patch.

No objection here either, just one question: Has anybody thought about
the ability to extend this to do per-database syncrep? Logical decoding
works on a database level, and that can cause some problems with global
configuration.

> That sounds good.
> 
> May I have one more day to review this? Actually more like 3-4 hours.

> I have no comments on an initial read, so I'm hopeful of having nothing at
> all to say on it.

Simon, perhaps you could hold the above question in your mind while
looking through this?

Thanks,

Andres



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, thank you for testing.
>
> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com>
>> >>> Attached latest patch incorporate some review comments so far, and is
>> >>> rebased against current HEAD.
>> >>>
>> >>
>> >> Sorry I attached wrong patch.
>> >> Attached patch is correct patch.
>> >>
>> >> [mulit_sync_replication_v21.patch]
>> >
>> > Here are some TPS numbers from some quick tests I ran on a set of
>> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
>> > as primary + 3 standbys, to try out different combinations of
>> > synchronous_commit levels and synchronous_standby_names numbers.  They
>> > were run for a short time only and these are of course systems with
>> > limited and perhaps uneven IO and CPU, but they still give some idea
>> > of the trends.  And reassuringly, the trends are travelling in the
>> > expected directions.
>> >
>> > All default settings except shared_buffers = 1GB, and the GUCs
>> > required for replication.
>> >
>> > pgbench postgres -j2 -c2 -N bench2 -T 600
>> >
>> >                1(*) 2(*) 3(*)
>> >                ==== ==== ====
>> > off          = 4056 4096 4092
>> > local        = 1323 1299 1312
>> > remote_write = 1130 1046  958
>> > on           =  860  744  701
>> > remote_apply =  785  725  604
>> >
>> > pgbench postgres -j16 -c16 -N bench2 -T 600
>> >
>> >                1(*) 2(*) 3(*)
>> >                ==== ==== ====
>> > off          = 3952 3943 3933
>> > local        = 2964 2984 3026
>> > remote_write = 2790 2724 2675
>> > on           = 2731 2627 2523
>> > remote_apply = 2627 2501 2432
>> >
>> > One thing I noticed is that there are LOG messages telling me when a
>> > standby becomes a synchronous standby, but nothing to tell me if a
>> > standby stops being a standby (ie because a higher priority one has
>> > taken its place in the quorum).  Would that be interesting?
>
> A walsender exits by proc_exit() for any operational
> termination so wrapping proc_exit() should work. (Attached file 1)
>
> For the setting "2(Sby1, Sby2, Sby3)", the master says that all
> of the standbys are sync-standbys and no message is emited on
> failure of Sby1, which should cause a promotion of Sby3.
>
>>  standby "Sby3" is now the synchronous standby with priority 3
>>  standby "Sby2" is now the synchronous standby with priority 2
>>  standby "Sby1" is now the synchronous standby with priority 1
> ..<Sby 1 failure>
>>  standby "Sby3" is now the synchronous standby with priority 3
>
> Sby3 becomes sync standby twice:p
>
> This was a behavior taken over from the single-sync-rep era but
> it should be confusing for the new sync-rep selection mechanism.
> The second attached diff makes this as the following.
>
>
>> 17:48:21.969 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> 17:48:23.087 LOG:  standby "Sby2" is now a synchronous standby with priority 2
>> 17:48:25.617 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> 17:48:31.990 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>> 17:48:43.905 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> 17:49:10.262 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> 17:49:13.865 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>
> Since this status check is taken place for every reply from
> stanbys, the message of downgrading to "potential" may be
> diferred or even fail to occur but it should be no problem.
>
> Applying the both of the above patches, the message would be like
> the following.
>
>> 17:54:08.367 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> 17:54:08.564 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> 17:54:08.565 LOG:  standby "Sby2" is now a synchronous standby with priority 2
>> 17:54:18.387 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>> 17:54:28.887 LOG:  synchronous standby "Sby1" with priority 1 exited
>> 17:54:31.359 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> 17:54:39.008 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> 17:54:41.382 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>
> Does this make sense?
>
> By the way, Sawada-san, you have changed the parentheses for the
> priority method from '[]' to '()'. And I mistankenly defined
> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
> is, only Sby2 is registed as mandatory synchronous standby.
>
> For this case, the tree members of SyncRepConfig are '2[Sby1,',
> 'Sby2', "Sby3]'. This syntax is valid for the current
> specification but will surely get different meaning by the future
> changes. We should refuse this known-to-be-wrong-in-future syntax
> from now.
>

I have no objection about current version patch.
But one optimise idea I came up with is to return false before
calculation of lowest LSN from sync standby if MyWalSnd is not listed
in sync_standby.
For example in SyncRepGetOldestSyncRecPtr(),

==
sync_standby = SyncRepGetSyncStandbys();

if (list_length(sync_standbys) <SyncRepConfig->num_sync()
{ (snip)
}

/* Here if MyWalSnd is not listed in sync_standby, quick exit. */
if (list_member_int(sync_standbys, MyWalSnd->slotno))
return false;

foreach(cell, sync_standbys)
{ (snip)
}
==

> For this case, the tree members of SyncRepConfig are '2[Sby1,',
> 'Sby2', "Sby3]'. This syntax is valid for the current
> specification but will surely get different meaning by the future
> changes. We should refuse this known-to-be-wrong-in-future syntax
> from now.

I couldn't get your point but why will the above syntax meaning be
different from current meaning by future change?
I thought that another method uses another kind of parentheses.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 4 April 2016 at 10:45, Andres Freund <andres@anarazel.de> wrote:

Simon, perhaps you could hold the above question in your mind while
looking through this?

Sure, np. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>
> Thanks for updating the patch!
>
> I applied the following changes to the patch.
> Attached is the revised version of the patch.
>

1.
       {
{"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
gettext_noop("List of names of potential synchronous standbys."),
NULL,
GUC_LIST_INPUT
},
&SyncRepStandbyNames,
"",
check_synchronous_standby_names, NULL, NULL
},

Isn't it better to modify the description of synchronous_standby_names in guc.c based on new usage?

2.
pg_stat_get_wal_senders()
{
..
/*
! * Allocate and update the config data of synchronous replication,
! * and then get the currently active synchronous standbys.
  */
+ SyncRepUpdateConfig();
  LWLockAcquire(SyncRepLock, LW_SHARED);
! sync_standbys = SyncRepGetSyncStandbys();
  LWLockRelease(SyncRepLock);
..
}

Why is it important to update the config with patch?  Earlier also any update to config between calls wouldn't have been visible.


3.
      <title>Planning for High Availability</title>
  
     <para>
!     <varname>synchronous_standby_names</> specifies the number of
!     synchronous standbys that transaction commits made when

Is it better to say like: <varname>synchronous_standby_names</> specifies the number and names of


4.
+ /*
+  * Return the list of sync standbys, or NIL if no sync standby is connected.
+  *
+  * If there are multiple standbys with the same priority,
+  * the first one found is considered as higher priority.

Here line indentation of second line can be improved.

5.
! /*
! * syncrep_yyparse sets the global syncrep_parse_result as side effect.
! * But this function is required to just check, so frees it
! * once parsing parameter.
! */
! SyncRepFreeConfig(syncrep_parse_result);

How about below change in comment?
/so frees it once parsing parameter/so frees it after parsing the parameter


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 4 April 2016 at 10:35, Simon Riggs <simon@2ndquadrant.com> wrote:
On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
 
Barring any objections, I'll commit this patch.

That sounds good.

May I have one more day to review this? Actually more like 3-4 hours.

What we have here is useful and elegant. I love the simplicity and backwards compatibility of the design. Very nice, chef.

I am in favour of committing something for 9.6, though I do have some objective comments

1. Header comments in syncrep.c need changes, not just additions.

2. We need tests to ensure that k >=1 and k<=N

3. There should be a WARNING if k == N to say that we don't yet provide a level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1, n2, n3) etc

4. How does it work?
It's pretty strange, but that isn't documented anywhere. It took me a while to figure it out even though I know that code. My thought is its a lot slower than before, which is a concern when we know by definition that k >=2 for the new feature. I was going to mention the fact that this code only needs to be executed by standbys mentioned in s_s_n, so we can avoid overhead and contention for async standbys (But Masahiko just mentioned that also).

5. Timing – k > 1 will be slower by definition and more complex to configure, yet there is no timing facility to measure the effect of this, even though we have a new timing facility in 9.6. It would be useful to have a track_syncrep option to keep track of typical response times from nodes.

6. Meaning of k (n1, n2, n3) with N servers

It's clearly documented that this means k replies IN SEQUENCE. I believe the typical meaning of would be “any k out of N”, which would be faster than what we have, e.g.
   3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3) acknowledge.

The “any k” option is not currently possible, but could be fairly easily. The syntax should also be easily extensible.

I would call what we have now “first” semantics, and we could have both of these...

* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the responses from k out of N standbys. “any k” would be faster, so is desirable for performance and resilience

>>> So I am suggesting we put an extra keyword in front of the “k”, to explain how the k responses should be gathered as an extension to the the syntax. I also think implementing “any k” is actually fairly trivial and could be done for 9.6 (rather than just "first k").



Future thoughts that relate to syntax choices now, not for 9.6

Eventually I would want to be able to specify this…
   2 ( any (london1, london2), any (nyc1, nyc2))
meaning I want a response from at least 1 London server and at least one NYC server, but whichever one responds first doesn't matter.

And I also want to be able to specify node groups in there. So elsewhere we would specify London node group as (London1, London2) and NYC node group as (NYC1, NYC2) and then specify

any 2 (London, NYC, Tokyo).
 

Good work

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2016/04/05 16:35, Simon Riggs wrote:
> 6. Meaning of k (n1, n2, n3) with N servers
> 
> It's clearly documented that this means k replies IN SEQUENCE. I believe
> the typical meaning of would be “any k out of N”, which would be faster
> than what we have, e.g.
>    3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3)
> acknowledge.
> 
> The “any k” option is not currently possible, but could be fairly easily.
> The syntax should also be easily extensible.
> 
> I would call what we have now “first” semantics, and we could have both of
> these...
> 
> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
> * any k (n1, n2, n3) – would release waiters as soon as we have the
> responses from k out of N standbys. “any k” would be faster, so is
> desirable for performance and resilience
> 
>>>> So I am suggesting we put an extra keyword in front of the “k”, to
> explain how the k responses should be gathered as an extension to the the
> syntax. I also think implementing “any k” is actually fairly trivial and
> could be done for 9.6 (rather than just "first k").

+1 for 'first/any k (...)', with possibly only 'first' supported for now,
if the 'any' case is more involved than we would like to spend time on,
given the time considerations. IMHO, the extra keyword adds to clarity of
the syntax.

Thanks,
Amit





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Apr 4, 2016 at 5:59 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> At 2016-04-04 17:28:07 +0900, masao.fujii@gmail.com wrote:
>>
>> Barring any objections, I'll commit this patch.
>
> No objections, just a minor wording tweak:
>
> doc/src/sgml/config.sgml:
>
> "The synchronous standbys will be the standbys that their names appear
> early in this list" should be "The synchronous standbys will be those
> whose names appear earlier in this list".
>
> doc/src/sgml/high-availability.sgml:
>
> "The standbys that their names appear early in this list are given
> higher priority and will be considered as synchronous" should be "The
> standbys whose names appear earlier in the list are given higher
> priority and will be considered as synchronous".
>
> "The standbys that their names appear early in the list will be used as
> the synchronous standby" should be "The standbys whose names appear
> earlier in the list will be used as synchronous standbys".
>
> You may prefer to reword this in some other way, but the current "that
> their names appear" wording should be changed.

Thanks for the review! Will apply these comments to new patch.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Hello, thank you for testing.
>>
>> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com>
>>> >>> Attached latest patch incorporate some review comments so far, and is
>>> >>> rebased against current HEAD.
>>> >>>
>>> >>
>>> >> Sorry I attached wrong patch.
>>> >> Attached patch is correct patch.
>>> >>
>>> >> [mulit_sync_replication_v21.patch]
>>> >
>>> > Here are some TPS numbers from some quick tests I ran on a set of
>>> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
>>> > as primary + 3 standbys, to try out different combinations of
>>> > synchronous_commit levels and synchronous_standby_names numbers.  They
>>> > were run for a short time only and these are of course systems with
>>> > limited and perhaps uneven IO and CPU, but they still give some idea
>>> > of the trends.  And reassuringly, the trends are travelling in the
>>> > expected directions.
>>> >
>>> > All default settings except shared_buffers = 1GB, and the GUCs
>>> > required for replication.
>>> >
>>> > pgbench postgres -j2 -c2 -N bench2 -T 600
>>> >
>>> >                1(*) 2(*) 3(*)
>>> >                ==== ==== ====
>>> > off          = 4056 4096 4092
>>> > local        = 1323 1299 1312
>>> > remote_write = 1130 1046  958
>>> > on           =  860  744  701
>>> > remote_apply =  785  725  604
>>> >
>>> > pgbench postgres -j16 -c16 -N bench2 -T 600
>>> >
>>> >                1(*) 2(*) 3(*)
>>> >                ==== ==== ====
>>> > off          = 3952 3943 3933
>>> > local        = 2964 2984 3026
>>> > remote_write = 2790 2724 2675
>>> > on           = 2731 2627 2523
>>> > remote_apply = 2627 2501 2432
>>> >
>>> > One thing I noticed is that there are LOG messages telling me when a
>>> > standby becomes a synchronous standby, but nothing to tell me if a
>>> > standby stops being a standby (ie because a higher priority one has
>>> > taken its place in the quorum).  Would that be interesting?
>>
>> A walsender exits by proc_exit() for any operational
>> termination so wrapping proc_exit() should work. (Attached file 1)
>>
>> For the setting "2(Sby1, Sby2, Sby3)", the master says that all
>> of the standbys are sync-standbys and no message is emited on
>> failure of Sby1, which should cause a promotion of Sby3.
>>
>>>  standby "Sby3" is now the synchronous standby with priority 3
>>>  standby "Sby2" is now the synchronous standby with priority 2
>>>  standby "Sby1" is now the synchronous standby with priority 1
>> ..<Sby 1 failure>
>>>  standby "Sby3" is now the synchronous standby with priority 3
>>
>> Sby3 becomes sync standby twice:p
>>
>> This was a behavior taken over from the single-sync-rep era but
>> it should be confusing for the new sync-rep selection mechanism.
>> The second attached diff makes this as the following.
>>
>>
>>> 17:48:21.969 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>>> 17:48:23.087 LOG:  standby "Sby2" is now a synchronous standby with priority 2
>>> 17:48:25.617 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>>> 17:48:31.990 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>>> 17:48:43.905 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>>> 17:49:10.262 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>>> 17:49:13.865 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>>
>> Since this status check is taken place for every reply from
>> stanbys, the message of downgrading to "potential" may be
>> diferred or even fail to occur but it should be no problem.
>>
>> Applying the both of the above patches, the message would be like
>> the following.
>>
>>> 17:54:08.367 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>>> 17:54:08.564 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>>> 17:54:08.565 LOG:  standby "Sby2" is now a synchronous standby with priority 2
>>> 17:54:18.387 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>>> 17:54:28.887 LOG:  synchronous standby "Sby1" with priority 1 exited
>>> 17:54:31.359 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>>> 17:54:39.008 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>>> 17:54:41.382 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>>
>> Does this make sense?
>>
>> By the way, Sawada-san, you have changed the parentheses for the
>> priority method from '[]' to '()'. And I mistankenly defined
>> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
>> is, only Sby2 is registed as mandatory synchronous standby.
>>
>> For this case, the tree members of SyncRepConfig are '2[Sby1,',
>> 'Sby2', "Sby3]'. This syntax is valid for the current
>> specification but will surely get different meaning by the future
>> changes. We should refuse this known-to-be-wrong-in-future syntax
>> from now.
>>
>
> I have no objection about current version patch.
> But one optimise idea I came up with is to return false before
> calculation of lowest LSN from sync standby if MyWalSnd is not listed
> in sync_standby.
> For example in SyncRepGetOldestSyncRecPtr(),
>
> ==
> sync_standby = SyncRepGetSyncStandbys();
>
> if (list_length(sync_standbys) <SyncRepConfig->num_sync()
> {
>   (snip)
> }
>
> /* Here if MyWalSnd is not listed in sync_standby, quick exit. */
> if (list_member_int(sync_standbys, MyWalSnd->slotno))
> return false;

list_member_int() performs the loop internally. So I'm not sure how much
adding extra list_member_int() here can optimize this processing.
Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
standby or not. In this idea, without adding extra loop, we can exit earilier
in the case where I'm not a sync standby. Does this make sense?

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 5 April 2016 at 08:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
 
>>>> So I am suggesting we put an extra keyword in front of the “k”, to
> explain how the k responses should be gathered as an extension to the the
> syntax. I also think implementing “any k” is actually fairly trivial and
> could be done for 9.6 (rather than just "first k").

+1 for 'first/any k (...)', with possibly only 'first' supported for now,
if the 'any' case is more involved than we would like to spend time on,
given the time considerations. IMHO, the extra keyword adds to clarity of
the syntax.

Further thoughts:

I said "any k" was faster, though what I mean is both faster and more robust. If you have network peaks from any of the k sync standbys then the user will wait longer. With "any k", if a network peak occurs, then another standby response will work just as well. So the performance of "any k" will be both faster, more consistent and less prone to misconfiguration.

I also didn't explain why I think it is easy to implement "any k".

All we need to do is change SyncRepGetOldestSyncRecPtr() so that it returns the k'th oldest pointer of any named standby. Then use that to wake up user backends. So the change requires only slightly modified logic in a very isolated part of the code, almost all of which would be code inserts to cope with the new option. The syntax and doc changes would take a couple of hours.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Apr 4, 2016 at 6:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-04 10:35:34 +0100, Simon Riggs wrote:
>> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > Barring any objections, I'll commit this patch.
>
> No objection here either, just one question: Has anybody thought about
> the ability to extend this to do per-database syncrep?

Nope at least for me... You'd like to extend synchronous_standby_names
so that users can specify that per-database?

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 5 April 2016 at 10:10, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Apr 4, 2016 at 6:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-04 10:35:34 +0100, Simon Riggs wrote:
>> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > Barring any objections, I'll commit this patch.
>
> No objection here either, just one question: Has anybody thought about
> the ability to extend this to do per-database syncrep?

Nope at least for me... You'd like to extend synchronous_standby_names
so that users can specify that per-database?

As requested, I did consider whether we could have syntax for per-database settings.

ISTM that it is already possible to have one database in async mode and another in sync mode, using settings of synchronous_commit.

The easiest way to have per-database settings if you want more is to use different instances. Adding a dbname into the syntax would complicate it significantly and even if we agreed that, I don't think it would happen for 9.6. The lack of per-database settings is not a blocker for me.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Andres Freund
Date:
On 2016-04-05 10:13:50 +0100, Simon Riggs wrote:
> The lack of per-database settings is not a blocker for me.

Just to clarify: Neither is it for me.



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>>
>> Thanks for updating the patch!
>>
>> I applied the following changes to the patch.
>> Attached is the revised version of the patch.
>>
>
> 1.
>        {
> {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
> gettext_noop("List of names of potential synchronous standbys."),
> NULL,
> GUC_LIST_INPUT
> },
> &SyncRepStandbyNames,
> "",
> check_synchronous_standby_names, NULL, NULL
> },
>
> Isn't it better to modify the description of synchronous_standby_names in
> guc.c based on new usage?

What about "Number of synchronous standbys and list of names of
potential synchronous ones"? Better idea?

> 2.
> pg_stat_get_wal_senders()
> {
> ..
> /*
> ! * Allocate and update the config data of synchronous replication,
> ! * and then get the currently active synchronous standbys.
>   */
> + SyncRepUpdateConfig();
>   LWLockAcquire(SyncRepLock, LW_SHARED);
> ! sync_standbys = SyncRepGetSyncStandbys();
>   LWLockRelease(SyncRepLock);
> ..
> }
>
> Why is it important to update the config with patch?  Earlier also any
> update to config between calls wouldn't have been visible.

Because a backend has no chance to call SyncRepUpdateConfig() and
parse the latest value of s_s_names if SyncRepUpdateConfig() is not
called here. This means that pg_stat_replication may return the information
based on the old value of s_s_names.

> 3.
>       <title>Planning for High Availability</title>
>
>      <para>
> !     <varname>synchronous_standby_names</> specifies the number of
> !     synchronous standbys that transaction commits made when
>
> Is it better to say like: <varname>synchronous_standby_names</> specifies
> the number and names of

Precisely s_s_names specifies a list of names of potential sync standbys
not sync ones.

> 4.
> + /*
> +  * Return the list of sync standbys, or NIL if no sync standby is
> connected.
> +  *
> +  * If there are multiple standbys with the same priority,
> +  * the first one found is considered as higher priority.
>
> Here line indentation of second line can be improved.

What about "the first one found is selected first"? Or better idea?

>
> ! /*
> ! * syncrep_yyparse sets the global syncrep_parse_result as side effect.
> ! * But this function is required to just check, so frees it
> ! * once parsing parameter.
> ! */
> ! SyncRepFreeConfig(syncrep_parse_result);
>
> How about below change in comment?
> /so frees it once parsing parameter/so frees it after parsing the parameter

Will apply this to the patch.

Thanks for the review!

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Tue, 5 Apr 2016 18:08:20 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwG+DM=LCctG6q_Uxkgk17CbLKrHBwtPfUN3+Hu9QbvNuQ@mail.gmail.com>
> On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> Hello, thank you for testing.
> >>
> >> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com>
> >>> > One thing I noticed is that there are LOG messages telling me when a
> >>> > standby becomes a synchronous standby, but nothing to tell me if a
> >>> > standby stops being a standby (ie because a higher priority one has
> >>> > taken its place in the quorum).  Would that be interesting?
> >>
> >> A walsender exits by proc_exit() for any operational
> >> termination so wrapping proc_exit() should work. (Attached file 1)
> >>
> >> For the setting "2(Sby1, Sby2, Sby3)", the master says that all
> >> of the standbys are sync-standbys and no message is emited on
> >> failure of Sby1, which should cause a promotion of Sby3.
> >>
> >>>  standby "Sby3" is now the synchronous standby with priority 3
> >>>  standby "Sby2" is now the synchronous standby with priority 2
> >>>  standby "Sby1" is now the synchronous standby with priority 1
> >> ..<Sby 1 failure>
> >>>  standby "Sby3" is now the synchronous standby with priority 3
> >>
> >> Sby3 becomes sync standby twice:p
> >>
> >> This was a behavior taken over from the single-sync-rep era but
> >> it should be confusing for the new sync-rep selection mechanism.
> >> The second attached diff makes this as the following.
...
> >> Applying the both of the above patches, the message would be like
> >> the following.
> >>
> >>> 17:54:08.367 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> >>> 17:54:08.564 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> >>> 17:54:08.565 LOG:  standby "Sby2" is now a synchronous standby with priority 2
> >>> 17:54:18.387 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
> >>> 17:54:28.887 LOG:  synchronous standby "Sby1" with priority 1 exited
> >>> 17:54:31.359 LOG:  standby "Sby3" is now a synchronous standby with priority 3
> >>> 17:54:39.008 LOG:  standby "Sby1" is now a synchronous standby with priority 1
> >>> 17:54:41.382 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
> >>
> >> Does this make sense?
> >>
> >> By the way, Sawada-san, you have changed the parentheses for the
> >> priority method from '[]' to '()'. And I mistankenly defined
> >> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
> >> is, only Sby2 is registed as mandatory synchronous standby.
> >>
> >> For this case, the tree members of SyncRepConfig are '2[Sby1,',
> >> 'Sby2', "Sby3]'. This syntax is valid for the current
> >> specification but will surely get different meaning by the future
> >> changes. We should refuse this known-to-be-wrong-in-future syntax
> >> from now.
> >>
> >
> > I have no objection about current version patch.
> > But one optimise idea I came up with is to return false before
> > calculation of lowest LSN from sync standby if MyWalSnd is not listed
> > in sync_standby.
> > For example in SyncRepGetOldestSyncRecPtr(),
> >
> > ==
> > sync_standby = SyncRepGetSyncStandbys();
> >
> > if (list_length(sync_standbys) <SyncRepConfig->num_sync()
> > {
> >   (snip)
> > }
> >
> > /* Here if MyWalSnd is not listed in sync_standby, quick exit. */
> > if (list_member_int(sync_standbys, MyWalSnd->slotno))
> > return false;
> 
> list_member_int() performs the loop internally. So I'm not sure how much
> adding extra list_member_int() here can optimize this processing.
> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
> standby or not. In this idea, without adding extra loop, we can exit earilier
> in the case where I'm not a sync standby. Does this make sense?

The list_member_int() is also performed in the "(snip)" part. So
SyncRepGetSyncStandbys() returning am_sync seems making sense.

sync_standbys = SyncRepGetSyncStandbys(am_sync);

/**  Quick exit if I am not synchronous or there's not*  enough synchronous standbys* /
if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
{ list_free(sync_standbys); return false;
}


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 4:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 4 April 2016 at 10:35, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>>>
>>> Barring any objections, I'll commit this patch.
>>
>>
>> That sounds good.
>>
>> May I have one more day to review this? Actually more like 3-4 hours.
>
>
> What we have here is useful and elegant. I love the simplicity and backwards
> compatibility of the design. Very nice, chef.
>
> I am in favour of committing something for 9.6, though I do have some
> objective comments

Thanks for the review!

> 1. Header comments in syncrep.c need changes, not just additions.

Okay, will consider this later. And I'd appreciate if you elaborate what
changes are necessary specifically.

> 2. We need tests to ensure that k >=1 and k<=N

The changes to replication test framework was included in the patch before,
but I excluded it from the patch because I'd like to commit the core part of
the patch first. Will review the test part later.

>
> 3. There should be a WARNING if k == N to say that we don't yet provide a
> level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1,
> n2, n3) etc

Sorry I failed to get your point. Could you tell me what Apply consistency
and why we cannot provide it when k = N?

> 4. How does it work?
> It's pretty strange, but that isn't documented anywhere. It took me a while
> to figure it out even though I know that code. My thought is its a lot
> slower than before, which is a concern when we know by definition that k >=2
> for the new feature. I was going to mention the fact that this code only
> needs to be executed by standbys mentioned in s_s_n, so we can avoid
> overhead and contention for async standbys (But Masahiko just mentioned that
> also).

Unless I'm missing something, the patch already avoids the overhead
of async standbys. Please see the top of SyncRepReleaseWaiters().
Since async standbys exit at the beginning of SyncRepReleaseWaiters(),
they don't need to perform any operations that the patch adds
(e.g., find out which standbys are synchronous).

> 5. Timing – k > 1 will be slower by definition and more complex to
> configure, yet there is no timing facility to measure the effect of this,
> even though we have a new timing facility in 9.6. It would be useful to have
> a track_syncrep option to keep track of typical response times from nodes.

Maybe it's useful. But it seems completely new feature, so I'm not sure
if we have enough time to push it to 9.6. Probably it's for 9.7.

> 6. Meaning of k (n1, n2, n3) with N servers
>
> It's clearly documented that this means k replies IN SEQUENCE. I believe the
> typical meaning of would be “any k out of N”, which would be faster than
> what we have, e.g.
>    3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3)
> acknowledge.
>
> The “any k” option is not currently possible, but could be fairly easily.
> The syntax should also be easily extensible.
>
> I would call what we have now “first” semantics, and we could have both of
> these...
>
> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
> * any k (n1, n2, n3) – would release waiters as soon as we have the
> responses from k out of N standbys. “any k” would be faster, so is desirable
> for performance and resilience

We discussed the syntax very long time, so restarting the discussion
and keeping the patch uncommited is not good. We might fail to commit
anything about N-sync rep in 9.6. So let's commit the current patch first
and restart the discussion later.

Regards,

--
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Mon, 4 Apr 2016 22:00:24 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDoq1ubY4KkKhrA9jzaVXekwAT7gV5pQJbS+wj98b9-3A@mail.gmail.com>
> > For this case, the tree members of SyncRepConfig are '2[Sby1,',
> > 'Sby2', "Sby3]'. This syntax is valid for the current
> > specification but will surely get different meaning by the future
> > changes. We should refuse this known-to-be-wrong-in-future syntax
> > from now.
> 
> I couldn't get your point but why will the above syntax meaning be
> different from current meaning by future change?
> I thought that another method uses another kind of parentheses.

If the 'another kind of parehtheses' is a pair of brackets, an
application_name 'tokyo[A]', for example, is currently allowed to
occur unquoted in the list but will become disallowed by the
syntax change.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 08:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>
>>
>> >>>> So I am suggesting we put an extra keyword in front of the “k”, to
>> > explain how the k responses should be gathered as an extension to the
>> > the
>> > syntax. I also think implementing “any k” is actually fairly trivial and
>> > could be done for 9.6 (rather than just "first k").
>>
>> +1 for 'first/any k (...)', with possibly only 'first' supported for now,
>> if the 'any' case is more involved than we would like to spend time on,
>> given the time considerations. IMHO, the extra keyword adds to clarity of
>> the syntax.
>
>
> Further thoughts:
>
> I said "any k" was faster, though what I mean is both faster and more
> robust. If you have network peaks from any of the k sync standbys then the
> user will wait longer. With "any k", if a network peak occurs, then another
> standby response will work just as well. So the performance of "any k" will
> be both faster, more consistent and less prone to misconfiguration.
>
> I also didn't explain why I think it is easy to implement "any k".
>
> All we need to do is change SyncRepGetOldestSyncRecPtr() so that it returns
> the k'th oldest pointer of any named standby.

s/oldest/newest ?

> Then use that to wake up user
> backends. So the change requires only slightly modified logic in a very
> isolated part of the code, almost all of which would be code inserts to cope
> with the new option.

Yes. Probably we need to use some time to find what algorithm is the best
for searching the k'th newest pointer.

> The syntax and doc changes would take a couple of
> hours.

Yes, the updates of documentation would need more time.

Regards,

--
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 5 April 2016 at 11:18, Fujii Masao <masao.fujii@gmail.com> wrote:
 
> 1. Header comments in syncrep.c need changes, not just additions.

Okay, will consider this later. And I'd appreciate if you elaborate what
changes are necessary specifically.

Some of the old header comments are now wrong.
 
> 2. We need tests to ensure that k >=1 and k<=N

The changes to replication test framework was included in the patch before,
but I excluded it from the patch because I'd like to commit the core part of
the patch first. Will review the test part later.

I meant tests of setting the parameters, not tests of the feature itself.
 
>
> 3. There should be a WARNING if k == N to say that we don't yet provide a
> level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1,
> n2, n3) etc

Sorry I failed to get your point. Could you tell me what Apply consistency
and why we cannot provide it when k = N?

> 4. How does it work?
> It's pretty strange, but that isn't documented anywhere. It took me a while
> to figure it out even though I know that code. My thought is its a lot
> slower than before, which is a concern when we know by definition that k >=2
> for the new feature. I was going to mention the fact that this code only
> needs to be executed by standbys mentioned in s_s_n, so we can avoid
> overhead and contention for async standbys (But Masahiko just mentioned that
> also).

Unless I'm missing something, the patch already avoids the overhead
of async standbys. Please see the top of SyncRepReleaseWaiters().
Since async standbys exit at the beginning of SyncRepReleaseWaiters(),
they don't need to perform any operations that the patch adds
(e.g., find out which standbys are synchronous).

I was thinking about the overhead of scanning through the full list of WALSenders for each message, when it is a sync standby. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 7:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 5 Apr 2016 18:08:20 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwG+DM=LCctG6q_Uxkgk17CbLKrHBwtPfUN3+Hu9QbvNuQ@mail.gmail.com>
>> On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> > On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> >> Hello, thank you for testing.
>> >>
>> >> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com>
>> >>> > One thing I noticed is that there are LOG messages telling me when a
>> >>> > standby becomes a synchronous standby, but nothing to tell me if a
>> >>> > standby stops being a standby (ie because a higher priority one has
>> >>> > taken its place in the quorum).  Would that be interesting?
>> >>
>> >> A walsender exits by proc_exit() for any operational
>> >> termination so wrapping proc_exit() should work. (Attached file 1)
>> >>
>> >> For the setting "2(Sby1, Sby2, Sby3)", the master says that all
>> >> of the standbys are sync-standbys and no message is emited on
>> >> failure of Sby1, which should cause a promotion of Sby3.
>> >>
>> >>>  standby "Sby3" is now the synchronous standby with priority 3
>> >>>  standby "Sby2" is now the synchronous standby with priority 2
>> >>>  standby "Sby1" is now the synchronous standby with priority 1
>> >> ..<Sby 1 failure>
>> >>>  standby "Sby3" is now the synchronous standby with priority 3
>> >>
>> >> Sby3 becomes sync standby twice:p
>> >>
>> >> This was a behavior taken over from the single-sync-rep era but
>> >> it should be confusing for the new sync-rep selection mechanism.
>> >> The second attached diff makes this as the following.
> ...
>> >> Applying the both of the above patches, the message would be like
>> >> the following.
>> >>
>> >>> 17:54:08.367 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> >>> 17:54:08.564 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> >>> 17:54:08.565 LOG:  standby "Sby2" is now a synchronous standby with priority 2
>> >>> 17:54:18.387 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>> >>> 17:54:28.887 LOG:  synchronous standby "Sby1" with priority 1 exited
>> >>> 17:54:31.359 LOG:  standby "Sby3" is now a synchronous standby with priority 3
>> >>> 17:54:39.008 LOG:  standby "Sby1" is now a synchronous standby with priority 1
>> >>> 17:54:41.382 LOG:  standby "Sby3" is now a potential synchronous standby with priority 3
>> >>
>> >> Does this make sense?
>> >>
>> >> By the way, Sawada-san, you have changed the parentheses for the
>> >> priority method from '[]' to '()'. And I mistankenly defined
>> >> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
>> >> is, only Sby2 is registed as mandatory synchronous standby.
>> >>
>> >> For this case, the tree members of SyncRepConfig are '2[Sby1,',
>> >> 'Sby2', "Sby3]'. This syntax is valid for the current
>> >> specification but will surely get different meaning by the future
>> >> changes. We should refuse this known-to-be-wrong-in-future syntax
>> >> from now.
>> >>
>> >
>> > I have no objection about current version patch.
>> > But one optimise idea I came up with is to return false before
>> > calculation of lowest LSN from sync standby if MyWalSnd is not listed
>> > in sync_standby.
>> > For example in SyncRepGetOldestSyncRecPtr(),
>> >
>> > ==
>> > sync_standby = SyncRepGetSyncStandbys();
>> >
>> > if (list_length(sync_standbys) <SyncRepConfig->num_sync()
>> > {
>> >   (snip)
>> > }
>> >
>> > /* Here if MyWalSnd is not listed in sync_standby, quick exit. */
>> > if (list_member_int(sync_standbys, MyWalSnd->slotno))
>> > return false;
>>
>> list_member_int() performs the loop internally. So I'm not sure how much
>> adding extra list_member_int() here can optimize this processing.
>> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
>> standby or not. In this idea, without adding extra loop, we can exit earilier
>> in the case where I'm not a sync standby. Does this make sense?
>
> The list_member_int() is also performed in the "(snip)" part. So
> SyncRepGetSyncStandbys() returning am_sync seems making sense.
>
> sync_standbys = SyncRepGetSyncStandbys(am_sync);
>
> /*
>  *  Quick exit if I am not synchronous or there's not
>  *  enough synchronous standbys
>  * /
> if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
> {
>   list_free(sync_standbys);
>   return false;
> }

Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
it checks whether we're managing a sync standby or not.
Attached is the updated version of the patch. I also applied several
review comments to the patch.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 5 April 2016 at 11:23, Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Apr 5, 2016 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 08:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>
>>
>> >>>> So I am suggesting we put an extra keyword in front of the “k”, to
>> > explain how the k responses should be gathered as an extension to the
>> > the
>> > syntax. I also think implementing “any k” is actually fairly trivial and
>> > could be done for 9.6 (rather than just "first k").
>>
>> +1 for 'first/any k (...)', with possibly only 'first' supported for now,
>> if the 'any' case is more involved than we would like to spend time on,
>> given the time considerations. IMHO, the extra keyword adds to clarity of
>> the syntax.
>
>
> Further thoughts:
>
> I said "any k" was faster, though what I mean is both faster and more
> robust. If you have network peaks from any of the k sync standbys then the
> user will wait longer. With "any k", if a network peak occurs, then another
> standby response will work just as well. So the performance of "any k" will
> be both faster, more consistent and less prone to misconfiguration.
>
> I also didn't explain why I think it is easy to implement "any k".
>
> All we need to do is change SyncRepGetOldestSyncRecPtr() so that it returns
> the k'th oldest pointer of any named standby.

s/oldest/newest ?

Sure
 
> Then use that to wake up user
> backends. So the change requires only slightly modified logic in a very
> isolated part of the code, almost all of which would be code inserts to cope
> with the new option.

Yes. Probably we need to use some time to find what algorithm is the best
for searching the k'th newest pointer.

I think we would all agree an insertion sort would be the fastest for k ~ 2-5, no much discussion there.

We do already use that in this section of code, namely SHMQueue. 
 
> The syntax and doc changes would take a couple of
> hours.

Yes, the updates of documentation would need more time.

I can help, if you wish that.

"any k" is in my mind what people would be expecting us to deliver with this feature, which is why I suggest it now, especially since it is a small additional item.

Please don't see these comments as blocking your progress to commit.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 8:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 11:18, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>>
>> > 1. Header comments in syncrep.c need changes, not just additions.
>>
>> Okay, will consider this later. And I'd appreciate if you elaborate what
>> changes are necessary specifically.
>
>
> Some of the old header comments are now wrong.

Okay, will check.

>> > 2. We need tests to ensure that k >=1 and k<=N
>>
>> The changes to replication test framework was included in the patch
>> before,
>> but I excluded it from the patch because I'd like to commit the core part
>> of
>> the patch first. Will review the test part later.
>
>
> I meant tests of setting the parameters, not tests of the feature itself.

k<=0 causes an error while parsing s_s_names in current patch.

Regarding the test of k<=N, you mean that an error should be emitted
when k is larger than or equal to the number of standby names in the list?
Multiple standbys with the same name may connect to the master.
In this case, users might want to specifiy k<=N. So k<=N seems not invalid
setting.

>> > 3. There should be a WARNING if k == N to say that we don't yet provide
>> > a
>> > level to give Apply consistency. (I mean if we specify 2 (n1, n2) or
>> > 3(n1,
>> > n2, n3) etc
>>
>> Sorry I failed to get your point. Could you tell me what Apply consistency
>> and why we cannot provide it when k = N?
>>
>> > 4. How does it work?
>> > It's pretty strange, but that isn't documented anywhere. It took me a
>> > while
>> > to figure it out even though I know that code. My thought is its a lot
>> > slower than before, which is a concern when we know by definition that k
>> > >=2
>> > for the new feature. I was going to mention the fact that this code only
>> > needs to be executed by standbys mentioned in s_s_n, so we can avoid
>> > overhead and contention for async standbys (But Masahiko just mentioned
>> > that
>> > also).
>>
>> Unless I'm missing something, the patch already avoids the overhead
>> of async standbys. Please see the top of SyncRepReleaseWaiters().
>> Since async standbys exit at the beginning of SyncRepReleaseWaiters(),
>> they don't need to perform any operations that the patch adds
>> (e.g., find out which standbys are synchronous).
>
>
> I was thinking about the overhead of scanning through the full list of
> WALSenders for each message, when it is a sync standby.

This is true even in current release or before.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote:
 
Multiple standbys with the same name may connect to the master.
In this case, users might want to specifiy k<=N. So k<=N seems not invalid
setting.

Confusing as that is, it is already the case; k > N could make sense. ;-(

However, in most cases, k > N would not make sense and we should issue a WARNING. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Tue, Apr 5, 2016 at 3:15 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >>
> >> Thanks for updating the patch!
> >>
> >> I applied the following changes to the patch.
> >> Attached is the revised version of the patch.
> >>
> >
> > 1.
> >        {
> > {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
> > gettext_noop("List of names of potential synchronous standbys."),
> > NULL,
> > GUC_LIST_INPUT
> > },
> > &SyncRepStandbyNames,
> > "",
> > check_synchronous_standby_names, NULL, NULL
> > },
> >
> > Isn't it better to modify the description of synchronous_standby_names in
> > guc.c based on new usage?
>
> What about "Number of synchronous standbys and list of names of
> potential synchronous ones"? Better idea?
>

Looks good. 

>
> > 2.
> > pg_stat_get_wal_senders()
> > {
> > ..
> > /*
> > ! * Allocate and update the config data of synchronous replication,
> > ! * and then get the currently active synchronous standbys.
> >   */
> > + SyncRepUpdateConfig();
> >   LWLockAcquire(SyncRepLock, LW_SHARED);
> > ! sync_standbys = SyncRepGetSyncStandbys();
> >   LWLockRelease(SyncRepLock);
> > ..
> > }
> >
> > Why is it important to update the config with patch?  Earlier also any
> > update to config between calls wouldn't have been visible.
>
> Because a backend has no chance to call SyncRepUpdateConfig() and
> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
> called here. This means that pg_stat_replication may return the information
> based on the old value of s_s_names.
>

Thats right, but without this patch also won't pg_stat_replication can show old information? If no, why so?

> > 3.
> >       <title>Planning for High Availability</title>
> >
> >      <para>
> > !     <varname>synchronous_standby_names</> specifies the number of
> > !     synchronous standbys that transaction commits made when
> >
> > Is it better to say like: <varname>synchronous_standby_names</> specifies
> > the number and names of
>
> Precisely s_s_names specifies a list of names of potential sync standbys
> not sync ones.
>

Okay, but you doesn't seem to have updated this in your latest patch.

> > 4.
> > + /*
> > +  * Return the list of sync standbys, or NIL if no sync standby is
> > connected.
> > +  *
> > +  * If there are multiple standbys with the same priority,
> > +  * the first one found is considered as higher priority.
> >
> > Here line indentation of second line can be improved.
>
> What about "the first one found is selected first"? Or better idea?
>

What I was complaining about that few words from second line can be moved to previous line, but may be pgindent will take care of same, so no need to worry.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Robert Haas
Date:
On Mon, Apr 4, 2016 at 4:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> + ereport(LOG,
>>> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
>>> + application_name, MyWalSnd->sync_standby_priority)));
>>>
>>> s/ the / a /
>
> I have no objection to this change itself. But we have used this message
> in 9.5 or before, so if we apply this change, probably we need
> back-patching.

"the" implies that there can be only one synchronous standby at that
priority, while "a" implies that there could be more than one.  So the
situation might be different with this patch than previously.  (I
haven't read the patch so I don't know whether this is actually true,
but it might be what Thomas was going for.)

Also, I'd like to associate myself with the general happiness about
the prospect of having this feature in 9.6 (but without specifically
endorsing the code, since I have not read it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Tue, Apr 5, 2016 at 7:23 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Mon, 4 Apr 2016 22:00:24 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoDoq1ubY4KkKhrA9jzaVXekwAT7gV5pQJbS+wj98b9-3A@mail.gmail.com>
>> > For this case, the tree members of SyncRepConfig are '2[Sby1,',
>> > 'Sby2', "Sby3]'. This syntax is valid for the current
>> > specification but will surely get different meaning by the future
>> > changes. We should refuse this known-to-be-wrong-in-future syntax
>> > from now.
>>
>> I couldn't get your point but why will the above syntax meaning be
>> different from current meaning by future change?
>> I thought that another method uses another kind of parentheses.
>
> If the 'another kind of parehtheses' is a pair of brackets, an
> application_name 'tokyo[A]', for example, is currently allowed to
> occur unquoted in the list but will become disallowed by the
> syntax change.
>
>

Thank you for explaining.
I understood but since the future syntax is yet to be reached
consensus, I thought that it would be difficult  to refuse particular
kind of parentheses for now.

> > list_member_int() performs the loop internally. So I'm not sure how much
> > adding extra list_member_int() here can optimize this processing.
> > Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
> > standby or not. In this idea, without adding extra loop, we can exit earilier
> > in the case where I'm not a sync standby. Does this make sense?
> The list_member_int() is also performed in the "(snip)" part. So
> SyncRepGetSyncStandbys() returning am_sync seems making sense.
>
> sync_standbys = SyncRepGetSyncStandbys(am_sync);
>
> /*
> *  Quick exit if I am not synchronous or there's not
> *  enough synchronous standbys
> * /
> if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
> {
>  list_free(sync_standbys);
> return false;

I meant that it can skip to acquire spin lock at least, so it will
optimise that logic.
But anyway I agree with making SyncRepGetSyncStandbys returns am_sync variable.

-- 
Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com>
> >> list_member_int() performs the loop internally. So I'm not sure how much
> >> adding extra list_member_int() here can optimize this processing.
> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
> >> standby or not. In this idea, without adding extra loop, we can exit earilier
> >> in the case where I'm not a sync standby. Does this make sense?
> >
> > The list_member_int() is also performed in the "(snip)" part. So
> > SyncRepGetSyncStandbys() returning am_sync seems making sense.
> >
> > sync_standbys = SyncRepGetSyncStandbys(am_sync);
> >
> > /*
> >  *  Quick exit if I am not synchronous or there's not
> >  *  enough synchronous standbys
> >  * /
> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
> > {
> >   list_free(sync_standbys);
> >   return false;
> > }
> 
> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
> it checks whether we're managing a sync standby or not.
> Attached is the updated version of the patch. I also applied several
> review comments to the patch.

It still does list_member_int but it can be gotten rid of as the
attached patch.

regards,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 9b2137a..6998bb8 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync)        if (XLogRecPtrIsInvalid(walsnd->flush))
continue;
+        /* Notify myself as 'synchonized' if I am */
+        if (am_sync != NULL && walsnd == MyWalSnd)
+            *am_sync = true;
+        /*         * If the priority is equal to 1, consider this standby as sync         * and append it to the
result.Otherwise append this standby
 
@@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync)        if (this_priority == 1)        {            result =
lappend_int(result,i);
 
-            if (am_sync != NULL && walsnd == MyWalSnd)
-                *am_sync = true;            if (list_length(result) == SyncRepConfig->num_sync)            {
    list_free(pending);
 
@@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync)    {        bool        needfree = (result != NIL && pending
!=NIL);
 
-        if (am_sync != NULL && !(*am_sync))
-            *am_sync = list_member_int(pending, MyWalSnd->slotno);
-        result = list_concat(result, pending);        if (needfree)            pfree(pending);
@@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)    }    /*
+     * The pending list contains eventually potentially-synchronized standbys
+     * and this walsender may be one of them. So once reset am_sync.
+     */
+    if (am_sync != NULL)
+        *am_sync = false;
+
+    /*     * Find the sync standbys from the pending list.     */    priority = next_highest_priority;

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>>
>> Multiple standbys with the same name may connect to the master.
>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid
>> setting.
>
>
> Confusing as that is, it is already the case; k > N could make sense. ;-(
>
> However, in most cases, k > N would not make sense and we should issue a
> WARNING.

Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread
and the code for that test was included in the old patch (but I excluded it).
Now the majority seems to prefer to add that test, so I just revived and
revised that test code.

Attached is the updated version of the patch. I also completed Amit's
and Robert's comments.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Apr 5, 2016 at 3:15 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> Thanks for updating the patch!
>> >>
>> >> I applied the following changes to the patch.
>> >> Attached is the revised version of the patch.
>> >>
>> >
>> > 1.
>> >        {
>> > {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
>> > gettext_noop("List of names of potential synchronous standbys."),
>> > NULL,
>> > GUC_LIST_INPUT
>> > },
>> > &SyncRepStandbyNames,
>> > "",
>> > check_synchronous_standby_names, NULL, NULL
>> > },
>> >
>> > Isn't it better to modify the description of synchronous_standby_names
>> > in
>> > guc.c based on new usage?
>>
>> What about "Number of synchronous standbys and list of names of
>> potential synchronous ones"? Better idea?
>>
>
> Looks good.
>
>>
>> > 2.
>> > pg_stat_get_wal_senders()
>> > {
>> > ..
>> > /*
>> > ! * Allocate and update the config data of synchronous replication,
>> > ! * and then get the currently active synchronous standbys.
>> >   */
>> > + SyncRepUpdateConfig();
>> >   LWLockAcquire(SyncRepLock, LW_SHARED);
>> > ! sync_standbys = SyncRepGetSyncStandbys();
>> >   LWLockRelease(SyncRepLock);
>> > ..
>> > }
>> >
>> > Why is it important to update the config with patch?  Earlier also any
>> > update to config between calls wouldn't have been visible.
>>
>> Because a backend has no chance to call SyncRepUpdateConfig() and
>> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
>> called here. This means that pg_stat_replication may return the
>> information
>> based on the old value of s_s_names.
>>
>
> Thats right, but without this patch also won't pg_stat_replication can show
> old information? If no, why so?

Without the patch, when s_s_names is changed and SIGHUP is sent,
a backend calls ProcessConfigFile(), parse the configuration file and
set the global variable SyncRepStandbyNames to the latest value of
s_s_names. When pg_stat_replication is accessed, a backend calculates
which standby is synchronous based on that latest value in SyncRepStandbyNames,
and then displays the information of sync replication.

With the patch, basically the same steps are executed when s_s_names is
changed. But the difference is that, with the patch, SyncRepUpdateConfig()
must be called after ProcessConfigFile() is called before the calculation of
sync standbys. So I just added the call of SyncRepUpdateConfig() to
pg_stat_get_wal_senders().

BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
from pg_stat_get_wal_senders() and every backends always parse the value
of s_s_names when the setting is changed.

>> > 3.
>> >       <title>Planning for High Availability</title>
>> >
>> >      <para>
>> > !     <varname>synchronous_standby_names</> specifies the number of
>> > !     synchronous standbys that transaction commits made when
>> >
>> > Is it better to say like: <varname>synchronous_standby_names</>
>> > specifies
>> > the number and names of
>>
>> Precisely s_s_names specifies a list of names of potential sync standbys
>> not sync ones.
>>
>
> Okay, but you doesn't seem to have updated this in your latest patch.

I applied the change you suggested, to the patch. Thanks!

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 5, 2016 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 4, 2016 at 4:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> + ereport(LOG,
>>>> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
>>>> + application_name, MyWalSnd->sync_standby_priority)));
>>>>
>>>> s/ the / a /
>>
>> I have no objection to this change itself. But we have used this message
>> in 9.5 or before, so if we apply this change, probably we need
>> back-patching.
>
> "the" implies that there can be only one synchronous standby at that
> priority, while "a" implies that there could be more than one.  So the
> situation might be different with this patch than previously.  (I
> haven't read the patch so I don't know whether this is actually true,
> but it might be what Thomas was going for.)

Thanks for the explanation!
I applied that change, to the latest patch I posted upthread.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>>>
>>> Multiple standbys with the same name may connect to the master.
>>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid
>>> setting.
>>
>>
>> Confusing as that is, it is already the case; k > N could make sense. ;-(
>>
>> However, in most cases, k > N would not make sense and we should issue a
>> WARNING.
>
> Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread
> and the code for that test was included in the old patch (but I excluded it).
> Now the majority seems to prefer to add that test, so I just revived and
> revised that test code.

The regression test codes seems not to be included in latest patch, no?

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Apr 6, 2016 at 2:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>>>
>>>> Multiple standbys with the same name may connect to the master.
>>>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid
>>>> setting.
>>>
>>>
>>> Confusing as that is, it is already the case; k > N could make sense. ;-(
>>>
>>> However, in most cases, k > N would not make sense and we should issue a
>>> WARNING.
>>
>> Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread
>> and the code for that test was included in the old patch (but I excluded it).
>> Now the majority seems to prefer to add that test, so I just revived and
>> revised that test code.
>
> The regression test codes seems not to be included in latest patch, no?

I am looking at the latest patch now, and they are not included. It
would be good to get those tests bundled in for a last lookup I think.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com>
>> >> list_member_int() performs the loop internally. So I'm not sure how much
>> >> adding extra list_member_int() here can optimize this processing.
>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
>> >> standby or not. In this idea, without adding extra loop, we can exit earilier
>> >> in the case where I'm not a sync standby. Does this make sense?
>> >
>> > The list_member_int() is also performed in the "(snip)" part. So
>> > SyncRepGetSyncStandbys() returning am_sync seems making sense.
>> >
>> > sync_standbys = SyncRepGetSyncStandbys(am_sync);
>> >
>> > /*
>> >  *  Quick exit if I am not synchronous or there's not
>> >  *  enough synchronous standbys
>> >  * /
>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
>> > {
>> >   list_free(sync_standbys);
>> >   return false;
>> > }
>>
>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
>> it checks whether we're managing a sync standby or not.
>> Attached is the updated version of the patch. I also applied several
>> review comments to the patch.
>
> It still does list_member_int but it can be gotten rid of as the
> attached patch.

Thanks for the review!

>
> regards,
>
> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
> index 9b2137a..6998bb8 100644
> --- a/src/backend/replication/syncrep.c
> +++ b/src/backend/replication/syncrep.c
> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync)
>                 if (XLogRecPtrIsInvalid(walsnd->flush))
>                         continue;
>
> +               /* Notify myself as 'synchonized' if I am */
> +               if (am_sync != NULL && walsnd == MyWalSnd)
> +                       *am_sync = true;
> +
>                 /*
>                  * If the priority is equal to 1, consider this standby as sync
>                  * and append it to the result. Otherwise append this standby
> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>                 if (this_priority == 1)
>                 {
>                         result = lappend_int(result, i);
> -                       if (am_sync != NULL && walsnd == MyWalSnd)
> -                               *am_sync = true;
>                         if (list_length(result) == SyncRepConfig->num_sync)
>                         {
>                                 list_free(pending);
> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>         {
>                 bool            needfree = (result != NIL && pending != NIL);
>
> -               if (am_sync != NULL && !(*am_sync))
> -                       *am_sync = list_member_int(pending, MyWalSnd->slotno);
> -
>                 result = list_concat(result, pending);
>                 if (needfree)
>                         pfree(pending);
> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
>         }
>
>         /*
> +        * The pending list contains eventually potentially-synchronized standbys
> +        * and this walsender may be one of them. So once reset am_sync.
> +        */
> +       if (am_sync != NULL)
> +               *am_sync = false;
> +
> +       /*

This code seems wrong in the case where this walsender is in the result list.
So I adopted another logic. Attached is the updated version of the patch.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 2:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>>>
>>>> Multiple standbys with the same name may connect to the master.
>>>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid
>>>> setting.
>>>
>>>
>>> Confusing as that is, it is already the case; k > N could make sense. ;-(
>>>
>>> However, in most cases, k > N would not make sense and we should issue a
>>> WARNING.
>>
>> Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread
>> and the code for that test was included in the old patch (but I excluded it).
>> Now the majority seems to prefer to add that test, so I just revived and
>> revised that test code.
>
> The regression test codes seems not to be included in latest patch, no?

I intentionally excluded the regression test from the patch because
I'd like to review and commit it separately from the main part of the feature.

I'd appreciate if you read through the regression test which was included
in previous patch and update it if required.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com>
>>> >> list_member_int() performs the loop internally. So I'm not sure how much
>>> >> adding extra list_member_int() here can optimize this processing.
>>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
>>> >> standby or not. In this idea, without adding extra loop, we can exit earilier
>>> >> in the case where I'm not a sync standby. Does this make sense?
>>> >
>>> > The list_member_int() is also performed in the "(snip)" part. So
>>> > SyncRepGetSyncStandbys() returning am_sync seems making sense.
>>> >
>>> > sync_standbys = SyncRepGetSyncStandbys(am_sync);
>>> >
>>> > /*
>>> >  *  Quick exit if I am not synchronous or there's not
>>> >  *  enough synchronous standbys
>>> >  * /
>>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
>>> > {
>>> >   list_free(sync_standbys);
>>> >   return false;
>>> > }
>>>
>>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
>>> it checks whether we're managing a sync standby or not.
>>> Attached is the updated version of the patch. I also applied several
>>> review comments to the patch.
>>
>> It still does list_member_int but it can be gotten rid of as the
>> attached patch.
>
> Thanks for the review!
>
>>
>> regards,
>>
>> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
>> index 9b2137a..6998bb8 100644
>> --- a/src/backend/replication/syncrep.c
>> +++ b/src/backend/replication/syncrep.c
>> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>                 if (XLogRecPtrIsInvalid(walsnd->flush))
>>                         continue;
>>
>> +               /* Notify myself as 'synchonized' if I am */
>> +               if (am_sync != NULL && walsnd == MyWalSnd)
>> +                       *am_sync = true;
>> +
>>                 /*
>>                  * If the priority is equal to 1, consider this standby as sync
>>                  * and append it to the result. Otherwise append this standby
>> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>                 if (this_priority == 1)
>>                 {
>>                         result = lappend_int(result, i);
>> -                       if (am_sync != NULL && walsnd == MyWalSnd)
>> -                               *am_sync = true;
>>                         if (list_length(result) == SyncRepConfig->num_sync)
>>                         {
>>                                 list_free(pending);
>> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>         {
>>                 bool            needfree = (result != NIL && pending != NIL);
>>
>> -               if (am_sync != NULL && !(*am_sync))
>> -                       *am_sync = list_member_int(pending, MyWalSnd->slotno);
>> -
>>                 result = list_concat(result, pending);
>>                 if (needfree)
>>                         pfree(pending);
>> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>         }
>>
>>         /*
>> +        * The pending list contains eventually potentially-synchronized standbys
>> +        * and this walsender may be one of them. So once reset am_sync.
>> +        */
>> +       if (am_sync != NULL)
>> +               *am_sync = false;
>> +
>> +       /*
>
> This code seems wrong in the case where this walsender is in the result list.
> So I adopted another logic. Attached is the updated version of the patch.

To be honest, this is a nice patch that we have here, and it received
a fair amount of work. I have been playing with it a bit but I could
not break it.

Here are few things I have noticed:
+   for (i = 0; i < max_wal_senders; i++)
+   {
+       walsnd = &WalSndCtl->walsnds[i];
No volatile pointer to prevent code reordering?
 */typedef struct WalSnd{
+   int     slotno;         /* index of this slot in WalSnd array */   pid_t       pid;            /* this walsender's
processid, or 0 */
 
slotno is used nowhere.

I'll grab the tests and look at them.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Wed, 6 Apr 2016 15:29:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHGQEwH2c9buiZ=G7Ko8PQYwiU7=NsDkvCjRKUPSN8j7A@mail.gmail.com>
> > @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
> >         }
> >
> >         /*
> > +        * The pending list contains eventually potentially-synchronized standbys
> > +        * and this walsender may be one of them. So once reset am_sync.
> > +        */
> > +       if (am_sync != NULL)
> > +               *am_sync = false;
> > +
> > +       /*
> 
> This code seems wrong in the case where this walsender is in the result list.
> So I adopted another logic. Attached is the updated version of the patch.

You must misread the patch. am_sync is originally set in the loop
just after that for the case.

!     while (priority <= lowest_priority)
!     {
..
!         for (cell = list_head(pending); cell != NULL; cell = next)
!         {
...
!             if (this_priority == priority)
!             {
!                 result = lappend_int(result, i);
!                 if (am_sync != NULL && walsnd == MyWalSnd)
!                     *am_sync = true;

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Here are few things I have noticed:
> +   for (i = 0; i < max_wal_senders; i++)
> +   {
> +       walsnd = &WalSndCtl->walsnds[i];
> No volatile pointer to prevent code reordering?
>
>   */
>  typedef struct WalSnd
>  {
> +   int     slotno;         /* index of this slot in WalSnd array */
>     pid_t       pid;            /* this walsender's process id, or 0 */
> slotno is used nowhere.
>
> I'll grab the tests and look at them.

So I had a look at those tests and finished with the attached:
- patch 1 adds a reload routine to PostgresNode
- patch 2 the list of tests.

I took the tests from patch 21 and did many tweaks on them:
- Use of qq() instead of quotes
- Removal of hardcoded newlines
- typo fixes and sanity fixes
- etc.
Regards,
--
Michael

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com>
>>>> >> list_member_int() performs the loop internally. So I'm not sure how much
>>>> >> adding extra list_member_int() here can optimize this processing.
>>>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
>>>> >> standby or not. In this idea, without adding extra loop, we can exit earilier
>>>> >> in the case where I'm not a sync standby. Does this make sense?
>>>> >
>>>> > The list_member_int() is also performed in the "(snip)" part. So
>>>> > SyncRepGetSyncStandbys() returning am_sync seems making sense.
>>>> >
>>>> > sync_standbys = SyncRepGetSyncStandbys(am_sync);
>>>> >
>>>> > /*
>>>> >  *  Quick exit if I am not synchronous or there's not
>>>> >  *  enough synchronous standbys
>>>> >  * /
>>>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
>>>> > {
>>>> >   list_free(sync_standbys);
>>>> >   return false;
>>>> > }
>>>>
>>>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
>>>> it checks whether we're managing a sync standby or not.
>>>> Attached is the updated version of the patch. I also applied several
>>>> review comments to the patch.
>>>
>>> It still does list_member_int but it can be gotten rid of as the
>>> attached patch.
>>
>> Thanks for the review!
>>
>>>
>>> regards,
>>>
>>> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
>>> index 9b2137a..6998bb8 100644
>>> --- a/src/backend/replication/syncrep.c
>>> +++ b/src/backend/replication/syncrep.c
>>> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>                 if (XLogRecPtrIsInvalid(walsnd->flush))
>>>                         continue;
>>>
>>> +               /* Notify myself as 'synchonized' if I am */
>>> +               if (am_sync != NULL && walsnd == MyWalSnd)
>>> +                       *am_sync = true;
>>> +
>>>                 /*
>>>                  * If the priority is equal to 1, consider this standby as sync
>>>                  * and append it to the result. Otherwise append this standby
>>> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>                 if (this_priority == 1)
>>>                 {
>>>                         result = lappend_int(result, i);
>>> -                       if (am_sync != NULL && walsnd == MyWalSnd)
>>> -                               *am_sync = true;
>>>                         if (list_length(result) == SyncRepConfig->num_sync)
>>>                         {
>>>                                 list_free(pending);
>>> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>         {
>>>                 bool            needfree = (result != NIL && pending != NIL);
>>>
>>> -               if (am_sync != NULL && !(*am_sync))
>>> -                       *am_sync = list_member_int(pending, MyWalSnd->slotno);
>>> -
>>>                 result = list_concat(result, pending);
>>>                 if (needfree)
>>>                         pfree(pending);
>>> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>         }
>>>
>>>         /*
>>> +        * The pending list contains eventually potentially-synchronized standbys
>>> +        * and this walsender may be one of them. So once reset am_sync.
>>> +        */
>>> +       if (am_sync != NULL)
>>> +               *am_sync = false;
>>> +
>>> +       /*
>>
>> This code seems wrong in the case where this walsender is in the result list.
>> So I adopted another logic. Attached is the updated version of the patch.
>
> To be honest, this is a nice patch that we have here, and it received
> a fair amount of work. I have been playing with it a bit but I could
> not break it.
>
> Here are few things I have noticed:

Thanks for the review!

> +   for (i = 0; i < max_wal_senders; i++)
> +   {
> +       walsnd = &WalSndCtl->walsnds[i];
> No volatile pointer to prevent code reordering?

Yes. Since spin lock is not taken there, volatile is necessary.

>   */
>  typedef struct WalSnd
>  {
> +   int     slotno;         /* index of this slot in WalSnd array */
>     pid_t       pid;            /* this walsender's process id, or 0 */
> slotno is used nowhere.

Yep. Attached is the updated version of the patch.

> I'll grab the tests and look at them.

Many thanks!

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 5:01 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 6 Apr 2016 15:29:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHGQEwH2c9buiZ=G7Ko8PQYwiU7=NsDkvCjRKUPSN8j7A@mail.gmail.com>
>> > @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
>> >         }
>> >
>> >         /*
>> > +        * The pending list contains eventually potentially-synchronized standbys
>> > +        * and this walsender may be one of them. So once reset am_sync.
>> > +        */
>> > +       if (am_sync != NULL)
>> > +               *am_sync = false;
>> > +
>> > +       /*
>>
>> This code seems wrong in the case where this walsender is in the result list.
>> So I adopted another logic. Attached is the updated version of the patch.
>
> You must misread the patch. am_sync is originally set in the loop
> just after that for the case.
>
> !       while (priority <= lowest_priority)
> !       {
> ..
> !               for (cell = list_head(pending); cell != NULL; cell = next)
> !               {
> ...
> !                       if (this_priority == priority)
> !                       {
> !                               result = lappend_int(result, i);
> !                               if (am_sync != NULL && walsnd == MyWalSnd)
> !                                       *am_sync = true;

But if this walsender has the priority 1, *am_sync is set to true in
the first loop not the second one. No?

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Sorry, my code was wrong in the case that the total numer of
synchronous standby exceeds required number and the wansender is
at priority 1.

Sorry for the noise.

At Wed, 06 Apr 2016 17:01:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160406.170151.246853881.horiguchi.kyotaro@lab.ntt.co.jp>
> You must misread the patch. am_sync is originally set in the loop
> just after that for the case.

regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 5:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com>
>>>>> >> list_member_int() performs the loop internally. So I'm not sure how much
>>>>> >> adding extra list_member_int() here can optimize this processing.
>>>>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync
>>>>> >> standby or not. In this idea, without adding extra loop, we can exit earilier
>>>>> >> in the case where I'm not a sync standby. Does this make sense?
>>>>> >
>>>>> > The list_member_int() is also performed in the "(snip)" part. So
>>>>> > SyncRepGetSyncStandbys() returning am_sync seems making sense.
>>>>> >
>>>>> > sync_standbys = SyncRepGetSyncStandbys(am_sync);
>>>>> >
>>>>> > /*
>>>>> >  *  Quick exit if I am not synchronous or there's not
>>>>> >  *  enough synchronous standbys
>>>>> >  * /
>>>>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync)
>>>>> > {
>>>>> >   list_free(sync_standbys);
>>>>> >   return false;
>>>>> > }
>>>>>
>>>>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that
>>>>> it checks whether we're managing a sync standby or not.
>>>>> Attached is the updated version of the patch. I also applied several
>>>>> review comments to the patch.
>>>>
>>>> It still does list_member_int but it can be gotten rid of as the
>>>> attached patch.
>>>
>>> Thanks for the review!
>>>
>>>>
>>>> regards,
>>>>
>>>> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
>>>> index 9b2137a..6998bb8 100644
>>>> --- a/src/backend/replication/syncrep.c
>>>> +++ b/src/backend/replication/syncrep.c
>>>> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>>                 if (XLogRecPtrIsInvalid(walsnd->flush))
>>>>                         continue;
>>>>
>>>> +               /* Notify myself as 'synchonized' if I am */
>>>> +               if (am_sync != NULL && walsnd == MyWalSnd)
>>>> +                       *am_sync = true;
>>>> +
>>>>                 /*
>>>>                  * If the priority is equal to 1, consider this standby as sync
>>>>                  * and append it to the result. Otherwise append this standby
>>>> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>>                 if (this_priority == 1)
>>>>                 {
>>>>                         result = lappend_int(result, i);
>>>> -                       if (am_sync != NULL && walsnd == MyWalSnd)
>>>> -                               *am_sync = true;
>>>>                         if (list_length(result) == SyncRepConfig->num_sync)
>>>>                         {
>>>>                                 list_free(pending);
>>>> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>>         {
>>>>                 bool            needfree = (result != NIL && pending != NIL);
>>>>
>>>> -               if (am_sync != NULL && !(*am_sync))
>>>> -                       *am_sync = list_member_int(pending, MyWalSnd->slotno);
>>>> -
>>>>                 result = list_concat(result, pending);
>>>>                 if (needfree)
>>>>                         pfree(pending);
>>>> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync)
>>>>         }
>>>>
>>>>         /*
>>>> +        * The pending list contains eventually potentially-synchronized standbys
>>>> +        * and this walsender may be one of them. So once reset am_sync.
>>>> +        */
>>>> +       if (am_sync != NULL)
>>>> +               *am_sync = false;
>>>> +
>>>> +       /*
>>>
>>> This code seems wrong in the case where this walsender is in the result list.
>>> So I adopted another logic. Attached is the updated version of the patch.
>>
>> To be honest, this is a nice patch that we have here, and it received
>> a fair amount of work. I have been playing with it a bit but I could
>> not break it.
>>
>> Here are few things I have noticed:
>
> Thanks for the review!
>
>> +   for (i = 0; i < max_wal_senders; i++)
>> +   {
>> +       walsnd = &WalSndCtl->walsnds[i];
>> No volatile pointer to prevent code reordering?
>
> Yes. Since spin lock is not taken there, volatile is necessary.
>
>>   */
>>  typedef struct WalSnd
>>  {
>> +   int     slotno;         /* index of this slot in WalSnd array */
>>     pid_t       pid;            /* this walsender's process id, or 0 */
>> slotno is used nowhere.
>
> Yep. Attached is the updated version of the patch.

Okay, I pushed the patch!
Many thanks to all involved in the development of this feature!

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Wed, Apr 6, 2016 at 5:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Okay, I pushed the patch!
> Many thanks to all involved in the development of this feature!

I think that I am crying... Really cool to see this milestone accomplished.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Simon Riggs
Date:
On 6 April 2016 at 09:23, Fujii Masao <masao.fujii@gmail.com> wrote:
 
Okay, I pushed the patch!
Many thanks to all involved in the development of this feature!

Very good.

I think the description in the commit message that we don't support "quorum commit" is sufficient to cover my concerns about what others might expect from this feature. Could we add similar wording to the docs?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Wed, Apr 6, 2016 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> > 2.
> >> > pg_stat_get_wal_senders()
> >> > {
> >> > ..
> >> > /*
> >> > ! * Allocate and update the config data of synchronous replication,
> >> > ! * and then get the currently active synchronous standbys.
> >> >   */
> >> > + SyncRepUpdateConfig();
> >> >   LWLockAcquire(SyncRepLock, LW_SHARED);
> >> > ! sync_standbys = SyncRepGetSyncStandbys();
> >> >   LWLockRelease(SyncRepLock);
> >> > ..
> >> > }
> >> >
> >> > Why is it important to update the config with patch?  Earlier also any
> >> > update to config between calls wouldn't have been visible.
> >>
> >> Because a backend has no chance to call SyncRepUpdateConfig() and
> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
> >> called here. This means that pg_stat_replication may return the
> >> information
> >> based on the old value of s_s_names.
> >>
> >
> > Thats right, but without this patch also won't pg_stat_replication can show
> > old information? If no, why so?
>
> Without the patch, when s_s_names is changed and SIGHUP is sent,
> a backend calls ProcessConfigFile(), parse the configuration file and
> set the global variable SyncRepStandbyNames to the latest value of
> s_s_names. When pg_stat_replication is accessed, a backend calculates
> which standby is synchronous based on that latest value in SyncRepStandbyNames,
> and then displays the information of sync replication.
>
> With the patch, basically the same steps are executed when s_s_names is
> changed. But the difference is that, with the patch, SyncRepUpdateConfig()
> must be called after ProcessConfigFile() is called before the calculation of
> sync standbys. So I just added the call of SyncRepUpdateConfig() to
> pg_stat_get_wal_senders().
>

Then why to call it just in pg_stat_get_wal_senders(), isn't it better if we call it always after ProcessConfigFile() (after setting SyncRepStandbyNames)

> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> from pg_stat_get_wal_senders() and every backends always parse the value
> of s_s_names when the setting is changed.
>

That sounds appropriate, but not sure what is exact place to call it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >>
>> >> > 2.
>> >> > pg_stat_get_wal_senders()
>> >> > {
>> >> > ..
>> >> > /*
>> >> > ! * Allocate and update the config data of synchronous replication,
>> >> > ! * and then get the currently active synchronous standbys.
>> >> >   */
>> >> > + SyncRepUpdateConfig();
>> >> >   LWLockAcquire(SyncRepLock, LW_SHARED);
>> >> > ! sync_standbys = SyncRepGetSyncStandbys();
>> >> >   LWLockRelease(SyncRepLock);
>> >> > ..
>> >> > }
>> >> >
>> >> > Why is it important to update the config with patch?  Earlier also
>> >> > any
>> >> > update to config between calls wouldn't have been visible.
>> >>
>> >> Because a backend has no chance to call SyncRepUpdateConfig() and
>> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
>> >> called here. This means that pg_stat_replication may return the
>> >> information
>> >> based on the old value of s_s_names.
>> >>
>> >
>> > Thats right, but without this patch also won't pg_stat_replication can
>> > show
>> > old information? If no, why so?
>>
>> Without the patch, when s_s_names is changed and SIGHUP is sent,
>> a backend calls ProcessConfigFile(), parse the configuration file and
>> set the global variable SyncRepStandbyNames to the latest value of
>> s_s_names. When pg_stat_replication is accessed, a backend calculates
>> which standby is synchronous based on that latest value in
>> SyncRepStandbyNames,
>> and then displays the information of sync replication.
>>
>> With the patch, basically the same steps are executed when s_s_names is
>> changed. But the difference is that, with the patch, SyncRepUpdateConfig()
>> must be called after ProcessConfigFile() is called before the calculation
>> of
>> sync standbys. So I just added the call of SyncRepUpdateConfig() to
>> pg_stat_get_wal_senders().
>>
>
> Then why to call it just in pg_stat_get_wal_senders(), isn't it better if we
> call it always after ProcessConfigFile() (after setting SyncRepStandbyNames)
>
>> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
>> from pg_stat_get_wal_senders() and every backends always parse the value
>> of s_s_names when the setting is changed.
>>
>
> That sounds appropriate, but not sure what is exact place to call it.

Maybe just after the following ProcessConfigFile().

-----------------------------------------
/*
* (6) check for any other interesting events that happened while we
* slept.
*/
if (got_SIGHUP)
{
got_SIGHUP = false;
ProcessConfigFile(PGC_SIGHUP);
}
-----------------------------------------

If we do the move, we also need to either (1) make postmaster call
SyncRepUpdateConfig() and pass the parsed result to any forked backends
via a file like write_nondefault_variables() does for EXEC_BACKEND
environment, or (2) make a backend call SyncRepUpdateConfig() during
its initialization phase so that the first call of pg_stat_replication
can use the parsed result. (1) seems complicated and overkill.
(2) may add very small overhead into the fork of a backend. It would
be almost negligible, though. So which logic should we adopt?

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> >> from pg_stat_get_wal_senders() and every backends always parse the value
> >> of s_s_names when the setting is changed.
> >>
> >
> > That sounds appropriate, but not sure what is exact place to call it.
>
> Maybe just after the following ProcessConfigFile().
>
> -----------------------------------------
> /*
> * (6) check for any other interesting events that happened while we
> * slept.
> */
> if (got_SIGHUP)
> {
> got_SIGHUP = false;
> ProcessConfigFile(PGC_SIGHUP);
> }
> -----------------------------------------
>
> If we do the move, we also need to either (1) make postmaster call
> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> via a file like write_nondefault_variables() does for EXEC_BACKEND
> environment, or (2) make a backend call SyncRepUpdateConfig() during
> its initialization phase so that the first call of pg_stat_replication
> can use the parsed result. (1) seems complicated and overkill.
> (2) may add very small overhead into the fork of a backend. It would
> be almost negligible, though. So which logic should we adopt?
>

Won't it be possible to have assign_* function for synchronous_standby_names as we have for some of the other settings like assign_XactIsoLevel and then call SyncRepUpdateConfig() in that function?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
>> >> from pg_stat_get_wal_senders() and every backends always parse the
>> >> value
>> >> of s_s_names when the setting is changed.
>> >>
>> >
>> > That sounds appropriate, but not sure what is exact place to call it.
>>
>> Maybe just after the following ProcessConfigFile().
>>
>> -----------------------------------------
>> /*
>> * (6) check for any other interesting events that happened while we
>> * slept.
>> */
>> if (got_SIGHUP)
>> {
>> got_SIGHUP = false;
>> ProcessConfigFile(PGC_SIGHUP);
>> }
>> -----------------------------------------
>>
>> If we do the move, we also need to either (1) make postmaster call
>> SyncRepUpdateConfig() and pass the parsed result to any forked backends
>> via a file like write_nondefault_variables() does for EXEC_BACKEND
>> environment, or (2) make a backend call SyncRepUpdateConfig() during
>> its initialization phase so that the first call of pg_stat_replication
>> can use the parsed result. (1) seems complicated and overkill.
>> (2) may add very small overhead into the fork of a backend. It would
>> be almost negligible, though. So which logic should we adopt?
>>
>
> Won't it be possible to have assign_* function for synchronous_standby_names
> as we have for some of the other settings like assign_XactIsoLevel and then
> call SyncRepUpdateConfig() in that function?

It's possible, but still seems to need (1), i.e., the variable that assign_XXX
function assigned needs to be passed to a backend via file for EXEC_BACKEND
environment.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Wed, Apr 6, 2016 at 8:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> >> >> from pg_stat_get_wal_senders() and every backends always parse the
> >> >> value
> >> >> of s_s_names when the setting is changed.
> >> >>
> >> >
> >> > That sounds appropriate, but not sure what is exact place to call it.
> >>
> >> Maybe just after the following ProcessConfigFile().
> >>
> >> -----------------------------------------
> >> /*
> >> * (6) check for any other interesting events that happened while we
> >> * slept.
> >> */
> >> if (got_SIGHUP)
> >> {
> >> got_SIGHUP = false;
> >> ProcessConfigFile(PGC_SIGHUP);
> >> }
> >> -----------------------------------------
> >>
> >> If we do the move, we also need to either (1) make postmaster call
> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> >> via a file like write_nondefault_variables() does for EXEC_BACKEND
> >> environment, or (2) make a backend call SyncRepUpdateConfig() during
> >> its initialization phase so that the first call of pg_stat_replication
> >> can use the parsed result. (1) seems complicated and overkill.
> >> (2) may add very small overhead into the fork of a backend. It would
> >> be almost negligible, though. So which logic should we adopt?
> >>
> >
> > Won't it be possible to have assign_* function for synchronous_standby_names
> > as we have for some of the other settings like assign_XactIsoLevel and then
> > call SyncRepUpdateConfig() in that function?
>
> It's possible, but still seems to need (1), i.e., the variable that assign_XXX
> function assigned needs to be passed to a backend via file for EXEC_BACKEND
> environment.
>

But for that, I think we don't need to do anything extra.  I mean write_nondefault_variables() will automatically write the non-default value of variable and then during backend initialization, it will call read_nondefault_variables which will call set_config_option for non-default parameters and that should set the required value if we have assign_* function defined for the variable.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 8:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> >
>> >> >> BTW, we can move SyncRepUpdateConfig() just after
>> >> >> ProcessConfigFile()
>> >> >> from pg_stat_get_wal_senders() and every backends always parse the
>> >> >> value
>> >> >> of s_s_names when the setting is changed.
>> >> >>
>> >> >
>> >> > That sounds appropriate, but not sure what is exact place to call it.
>> >>
>> >> Maybe just after the following ProcessConfigFile().
>> >>
>> >> -----------------------------------------
>> >> /*
>> >> * (6) check for any other interesting events that happened while we
>> >> * slept.
>> >> */
>> >> if (got_SIGHUP)
>> >> {
>> >> got_SIGHUP = false;
>> >> ProcessConfigFile(PGC_SIGHUP);
>> >> }
>> >> -----------------------------------------
>> >>
>> >> If we do the move, we also need to either (1) make postmaster call
>> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends
>> >> via a file like write_nondefault_variables() does for EXEC_BACKEND
>> >> environment, or (2) make a backend call SyncRepUpdateConfig() during
>> >> its initialization phase so that the first call of pg_stat_replication
>> >> can use the parsed result. (1) seems complicated and overkill.
>> >> (2) may add very small overhead into the fork of a backend. It would
>> >> be almost negligible, though. So which logic should we adopt?
>> >>
>> >
>> > Won't it be possible to have assign_* function for
>> > synchronous_standby_names
>> > as we have for some of the other settings like assign_XactIsoLevel and
>> > then
>> > call SyncRepUpdateConfig() in that function?
>>
>> It's possible, but still seems to need (1), i.e., the variable that
>> assign_XXX
>> function assigned needs to be passed to a backend via file for
>> EXEC_BACKEND
>> environment.
>>
>
> But for that, I think we don't need to do anything extra.  I mean
> write_nondefault_variables() will automatically write the non-default value
> of variable and then during backend initialization, it will call
> read_nondefault_variables which will call set_config_option for non-default
> parameters and that should set the required value if we have assign_*
> function defined for the variable.

Yes if the variable that we'd like to pass to a backend is BOOL, INT,
REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
complicated. So ISTM that write_one_nondefault_variable() needs to
be updated so that SyncRepConfig is written to a file.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > But for that, I think we don't need to do anything extra.  I mean
> > write_nondefault_variables() will automatically write the non-default value
> > of variable and then during backend initialization, it will call
> > read_nondefault_variables which will call set_config_option for non-default
> > parameters and that should set the required value if we have assign_*
> > function defined for the variable.
>
> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
> complicated.
>

SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to pass that?  I assume that current non-default value of SyncRepStandbyNames will be passed via write_nondefault_variables(), so we can use that to regenerate SyncRepConfig.
 
>
> So ISTM that write_one_nondefault_variable() needs to
> be updated so that SyncRepConfig is written to a file.
>


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > But for that, I think we don't need to do anything extra.  I mean
>> > write_nondefault_variables() will automatically write the non-default
>> > value
>> > of variable and then during backend initialization, it will call
>> > read_nondefault_variables which will call set_config_option for
>> > non-default
>> > parameters and that should set the required value if we have assign_*
>> > function defined for the variable.
>>
>> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
>> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
>> complicated.
>>
>
> SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to
> pass that?  I assume that current non-default value of SyncRepStandbyNames
> will be passed via write_nondefault_variables(), so we can use that to
> regenerate SyncRepConfig.

Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork,
to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames.
This is the approach of (2) which I explained upthread. assign_XXX function
doesn't seem to be helpful for this case.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 11:56 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> > But for that, I think we don't need to do anything extra.  I mean
> >> > write_nondefault_variables() will automatically write the non-default
> >> > value
> >> > of variable and then during backend initialization, it will call
> >> > read_nondefault_variables which will call set_config_option for
> >> > non-default
> >> > parameters and that should set the required value if we have assign_*
> >> > function defined for the variable.
> >>
> >> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
> >> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
> >> complicated.
> >>
> >
> > SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to
> > pass that?  I assume that current non-default value of SyncRepStandbyNames
> > will be passed via write_nondefault_variables(), so we can use that to
> > regenerate SyncRepConfig.
>
> Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork,
> to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames.
> This is the approach of (2) which I explained upthread. assign_XXX function
> doesn't seem to be helpful for this case.
>

Then where do you want to call it?  Also, this is only required for EXEC_BACKEND builds. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Amit Langote
Date:
On 2016/04/07 15:26, Fujii Masao wrote:
> On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
>>> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
>>> complicated.
>> SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to
>> pass that?  I assume that current non-default value of SyncRepStandbyNames
>> will be passed via write_nondefault_variables(), so we can use that to
>> regenerate SyncRepConfig.
> 
> Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork,
> to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames.
> This is the approach of (2) which I explained upthread. assign_XXX function
> doesn't seem to be helpful for this case.

I don't see why there is need to SyncRepUpdateConfig() after every fork or
anywhere outside syncrep.c/walsender.c for that matter.  AIUI, only
walsender or a backend that runs pg_stat_get_wal_senders() ever needs to
run SyncRepUpdateConfig() to get parsed synchronous standbys info from the
string that is SyncRepStandbyNames.  For rest of the world, it's just a
string guc and is written to and read from any external file as one (e.g.
the file that write_nondefault_variables() writes to in the EXEC_BACKEND
case).  I hope I'm not entirely missing the point of the discussion you
and Amit are having.

Thanks,
Amit





Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Apr 7, 2016 at 1:30 PM, Amit Langote <<a
href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a>>wrote:<br />><br />> On
2016/04/0715:26, Fujii Masao wrote:<br />> > On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <<a
href="mailto:amit.kapila16@gmail.com">amit.kapila16@gmail.com</a>>wrote:<br />> >> On Thu, Apr 7, 2016 at
10:02AM, Fujii Masao <<a href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>> wrote:<br />>
>>>Yes if the variable that we'd like to pass to a backend is BOOL, INT,<br />> >>> REAL, STRING
orENUM. But SyncRepConfig variable is a bit more<br />> >>> complicated.<br />> >> SyncRepConfig
isa parsed result of SyncRepStandbyNames, why you want to<br />> >> pass that?  I assume that current
non-defaultvalue of SyncRepStandbyNames<br />> >> will be passed via write_nondefault_variables(), so we can
usethat to<br />> >> regenerate SyncRepConfig.<br />> ><br />> > Yes, so SyncRepUpdateConfig()
needsto be called by a backend after fork,<br />> > to regenerate SyncRepConfig from the passed value of
SyncRepStandbyNames.<br/>> > This is the approach of (2) which I explained upthread. assign_XXX function<br
/>>> doesn't seem to be helpful for this case.<br />><br />> I don't see why there is need to
SyncRepUpdateConfig()after every fork or<br />> anywhere outside syncrep.c/walsender.c for that matter.  AIUI,
only<br/>> walsender or a backend that runs pg_stat_get_wal_senders() ever needs to<br />> run
SyncRepUpdateConfig()to get parsed synchronous standbys info from the<br />> string that is
SyncRepStandbyNames.</div><divclass="gmail_quote">></div><div class="gmail_quote"><br /></div><div
class="gmail_quote">Soif we go by this each time backend calls pg_stat_get_wal_senders, it needs to do parsing to
form SyncRepConfigwhether it's changed or not from previous time.  I understand that this is not a performance critical
path,but still if we can do it in some other optimal way which doesn't hurt any other path, then it will be better.<br
/></div><divclass="gmail_quote"><br class="" /><br />With Regards,<br />Amit Kapila.<br />EnterpriseDB: <a
href="http://www.enterprisedb.com/"target="_blank">http://www.enterprisedb.com</a><br /></div></div></div> 

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Apr 7, 2016 at 7:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Apr 7, 2016 at 1:30 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>>
>> On 2016/04/07 15:26, Fujii Masao wrote:
>> > On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >> On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com>
>> >> wrote:
>> >>> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
>> >>> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
>> >>> complicated.
>> >> SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want
>> >> to
>> >> pass that?  I assume that current non-default value of
>> >> SyncRepStandbyNames
>> >> will be passed via write_nondefault_variables(), so we can use that to
>> >> regenerate SyncRepConfig.
>> >
>> > Yes, so SyncRepUpdateConfig() needs to be called by a backend after
>> > fork,
>> > to regenerate SyncRepConfig from the passed value of
>> > SyncRepStandbyNames.
>> > This is the approach of (2) which I explained upthread. assign_XXX
>> > function
>> > doesn't seem to be helpful for this case.
>>
>> I don't see why there is need to SyncRepUpdateConfig() after every fork or
>> anywhere outside syncrep.c/walsender.c for that matter.  AIUI, only
>> walsender or a backend that runs pg_stat_get_wal_senders() ever needs to
>> run SyncRepUpdateConfig() to get parsed synchronous standbys info from the
>> string that is SyncRepStandbyNames.
>>
>
> So if we go by this each time backend calls pg_stat_get_wal_senders, it
> needs to do parsing to form SyncRepConfig whether it's changed or not from
> previous time.  I understand that this is not a performance critical path,
> but still if we can do it in some other optimal way which doesn't hurt any
> other path, then it will be better.

So, will you write the patch? Either current implementation or
the approach you're suggesting works to me. If you really want
to change the current one, I'm happy to review that.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Here are few things I have noticed:
>> +   for (i = 0; i < max_wal_senders; i++)
>> +   {
>> +       walsnd = &WalSndCtl->walsnds[i];
>> No volatile pointer to prevent code reordering?
>>
>>   */
>>  typedef struct WalSnd
>>  {
>> +   int     slotno;         /* index of this slot in WalSnd array */
>>     pid_t       pid;            /* this walsender's process id, or 0 */
>> slotno is used nowhere.
>>
>> I'll grab the tests and look at them.
>
> So I had a look at those tests and finished with the attached:
> - patch 1 adds a reload routine to PostgresNode
> - patch 2 the list of tests.

Thanks for updating the patches!

Attached is the refactored version of the patch.

Regards,

--
Fujii Masao

Attachment

Re: Support for N synchronous standby servers - take 2

From
Thomas Munro
Date:
On Wed, Apr 6, 2016 at 8:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Okay, I pushed the patch!
Many thanks to all involved in the development of this feature!

Hi,

I spotted a couple of places in the documentation that still implied there was only one synchronous standby.  Please see suggested fixes attached.

--
Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, Apr 8, 2016 at 12:55 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 6, 2016 at 8:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> Okay, I pushed the patch!
>> Many thanks to all involved in the development of this feature!
>
>
> Hi,
>
> I spotted a couple of places in the documentation that still implied there
> was only one synchronous standby.  Please see suggested fixes attached.

Thanks! Applied.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Here are few things I have noticed:
>>> +   for (i = 0; i < max_wal_senders; i++)
>>> +   {
>>> +       walsnd = &WalSndCtl->walsnds[i];
>>> No volatile pointer to prevent code reordering?
>>>
>>>   */
>>>  typedef struct WalSnd
>>>  {
>>> +   int     slotno;         /* index of this slot in WalSnd array */
>>>     pid_t       pid;            /* this walsender's process id, or 0 */
>>> slotno is used nowhere.
>>>
>>> I'll grab the tests and look at them.
>>
>> So I had a look at those tests and finished with the attached:
>> - patch 1 adds a reload routine to PostgresNode
>> - patch 2 the list of tests.
>
> Thanks for updating the patches!
>
> Attached is the refactored version of the patch.

Thanks. This looks good to me.

.gitattributes complains a bit:
$ git diff n_sync --check
src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace.
+       $self->reload;
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Fri, Apr 8, 2016 at 2:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> Here are few things I have noticed:
>>>> +   for (i = 0; i < max_wal_senders; i++)
>>>> +   {
>>>> +       walsnd = &WalSndCtl->walsnds[i];
>>>> No volatile pointer to prevent code reordering?
>>>>
>>>>   */
>>>>  typedef struct WalSnd
>>>>  {
>>>> +   int     slotno;         /* index of this slot in WalSnd array */
>>>>     pid_t       pid;            /* this walsender's process id, or 0 */
>>>> slotno is used nowhere.
>>>>
>>>> I'll grab the tests and look at them.
>>>
>>> So I had a look at those tests and finished with the attached:
>>> - patch 1 adds a reload routine to PostgresNode
>>> - patch 2 the list of tests.
>>
>> Thanks for updating the patches!
>>
>> Attached is the refactored version of the patch.
>
> Thanks. This looks good to me.
>
> .gitattributes complains a bit:
> $ git diff n_sync --check
> src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace.
> +       $self->reload;

Thanks for the review! I've finally pushed the patch.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Apr 8, 2016 at 4:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 2:26 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> Here are few things I have noticed:
>>>>> +   for (i = 0; i < max_wal_senders; i++)
>>>>> +   {
>>>>> +       walsnd = &WalSndCtl->walsnds[i];
>>>>> No volatile pointer to prevent code reordering?
>>>>>
>>>>>   */
>>>>>  typedef struct WalSnd
>>>>>  {
>>>>> +   int     slotno;         /* index of this slot in WalSnd array */
>>>>>     pid_t       pid;            /* this walsender's process id, or 0 */
>>>>> slotno is used nowhere.
>>>>>
>>>>> I'll grab the tests and look at them.
>>>>
>>>> So I had a look at those tests and finished with the attached:
>>>> - patch 1 adds a reload routine to PostgresNode
>>>> - patch 2 the list of tests.
>>>
>>> Thanks for updating the patches!
>>>
>>> Attached is the refactored version of the patch.
>>
>> Thanks. This looks good to me.
>>
>> .gitattributes complains a bit:
>> $ git diff n_sync --check
>> src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace.
>> +       $self->reload;
>
> Thanks for the review! I've finally pushed the patch.
>

Thank you!

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Apr 7, 2016 at 5:49 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Apr 7, 2016 at 7:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > So if we go by this each time backend calls pg_stat_get_wal_senders, it
> > needs to do parsing to form SyncRepConfig whether it's changed or not from
> > previous time.  I understand that this is not a performance critical path,
> > but still if we can do it in some other optimal way which doesn't hurt any
> > other path, then it will be better.
>
> So, will you write the patch? Either current implementation or
> the approach you're suggesting works to me. If you really want
> to change the current one, I'm happy to review that.
>

Sorry, I don't have time to complete the patch, but I have written an initial patch to show you what I have in mind and something on this lines should work.  I think with such an approach, you don't need to parse for s_s_names twice (once in check_* and once in syncupdate* function),  you can refer check_temp_tablespaces() and assign_temp_tablespaces() to see how to use the work done by check_* function in assign_* function.  Also write now, I have used TopMemoryContext for allocation in assign_synchronous_standby_names,  it is better to use guc_malloc or something similar for allocation as is done in other check_* and assign_* functions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Support for N synchronous standby servers - take 2

From
Jeff Janes
Date:
On Wed, Apr 6, 2016 at 1:23 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

> Okay, I pushed the patch!
> Many thanks to all involved in the development of this feature!

Thanks, a nice feature.

When I compile now without cassert, I get the compiler warning:

syncrep.c: In function 'SyncRepUpdateConfig':
syncrep.c:878:6: warning: variable 'parse_rc' set but not used
[-Wunused-but-set-variable]

Cheers,

Jeff



Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> When I compile now without cassert, I get the compiler warning:

> syncrep.c: In function 'SyncRepUpdateConfig':
> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
> [-Wunused-but-set-variable]

If there's a good reason for that to be an Assert, I don't see it.
There are no callers of SyncRepUpdateConfig that look like they
need to, or should expect not to have to, tolerate errors.
I think the way to fix this is to turn the Assert into a plain
old test-and-ereport-ERROR.
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> When I compile now without cassert, I get the compiler warning:
>
>> syncrep.c: In function 'SyncRepUpdateConfig':
>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>> [-Wunused-but-set-variable]
>
> If there's a good reason for that to be an Assert, I don't see it.
> There are no callers of SyncRepUpdateConfig that look like they
> need to, or should expect not to have to, tolerate errors.
> I think the way to fix this is to turn the Assert into a plain
> old test-and-ereport-ERROR.
>

I've changed the draft patch Amit implemented so that it doesn't parse
twice(check_hook and assign_hook).
So assertion that was in assign_hook is no longer necessary.

Please find attached.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> When I compile now without cassert, I get the compiler warning:
>>
>>> syncrep.c: In function 'SyncRepUpdateConfig':
>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>>> [-Wunused-but-set-variable]
>>
>> If there's a good reason for that to be an Assert, I don't see it.
>> There are no callers of SyncRepUpdateConfig that look like they
>> need to, or should expect not to have to, tolerate errors.
>> I think the way to fix this is to turn the Assert into a plain
>> old test-and-ereport-ERROR.
>>
>
> I've changed the draft patch Amit implemented so that it doesn't parse
> twice(check_hook and assign_hook).
> So assertion that was in assign_hook is no longer necessary.
>
> Please find attached.

Thanks for the patch!

When I emptied s_s_names, reloaded the configration file, set it to 'standby1'
and reloaded the configuration file again, the master crashed with
the following error.

*** glibc detected *** postgres: wal sender process postgres [local]
streaming 0/3015F18: munmap_chunk(): invalid pointer:
0x00000000024d9a40 ***
======= Backtrace: =========
*** glibc detected *** postgres: wal sender process postgres [local]
streaming 0/3015F18: munmap_chunk(): invalid pointer:
0x00000000024d9a40 ***
/lib64/libc.so.6[0x3be8e75f4e]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
======= Backtrace: =========
/lib64/libc.so.6[0x3be8e75f4e]
postgres: wal sender process postgres [local] streaming
0/3015F18(set_config_option+0x12cb)[0x982242]
postgres: wal sender process postgres [local] streaming
0/3015F18(SetConfigOption+0x4b)[0x9827ff]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
postgres: wal sender process postgres [local] streaming
0/3015F18(set_config_option+0x12cb)[0x982242]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
postgres: wal sender process postgres [local] streaming
0/3015F18(SetConfigOption+0x4b)[0x9827ff]
postgres: wal sender process postgres [local] streaming
0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
postgres: wal sender process postgres [local] streaming
0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
postgres: wal sender process postgres [local] streaming
0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
postgres: wal sender process postgres [local] streaming
0/3015F18(PostgresMain+0x772)[0x8141b6]
postgres: wal sender process postgres [local] streaming
0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
postgres: wal sender process postgres [local] streaming
0/3015F18(PostgresMain+0x772)[0x8141b6]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
postgres: wal sender process postgres [local] streaming
0/3015F18(PostmasterMain+0x1134)[0x784c08]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d]
postgres: wal sender process postgres [local] streaming
0/3015F18(PostmasterMain+0x1134)[0x784c08]
postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99]

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> When I compile now without cassert, I get the compiler warning:
>
>> syncrep.c: In function 'SyncRepUpdateConfig':
>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>> [-Wunused-but-set-variable]

Thanks for the report!

> If there's a good reason for that to be an Assert, I don't see it.
> There are no callers of SyncRepUpdateConfig that look like they
> need to, or should expect not to have to, tolerate errors.
> I think the way to fix this is to turn the Assert into a plain
> old test-and-ereport-ERROR.

Okay, I pushed that change. Thanks for the suggestion!

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>> When I compile now without cassert, I get the compiler warning:
>>>
>>>> syncrep.c: In function 'SyncRepUpdateConfig':
>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>>>> [-Wunused-but-set-variable]
>>>
>>> If there's a good reason for that to be an Assert, I don't see it.
>>> There are no callers of SyncRepUpdateConfig that look like they
>>> need to, or should expect not to have to, tolerate errors.
>>> I think the way to fix this is to turn the Assert into a plain
>>> old test-and-ereport-ERROR.
>>>
>>
>> I've changed the draft patch Amit implemented so that it doesn't parse
>> twice(check_hook and assign_hook).
>> So assertion that was in assign_hook is no longer necessary.
>>
>> Please find attached.
>
> Thanks for the patch!
>
> When I emptied s_s_names, reloaded the configration file, set it to 'standby1'
> and reloaded the configuration file again, the master crashed with
> the following error.
>
> *** glibc detected *** postgres: wal sender process postgres [local]
> streaming 0/3015F18: munmap_chunk(): invalid pointer:
> 0x00000000024d9a40 ***
> ======= Backtrace: =========
> *** glibc detected *** postgres: wal sender process postgres [local]
> streaming 0/3015F18: munmap_chunk(): invalid pointer:
> 0x00000000024d9a40 ***
> /lib64/libc.so.6[0x3be8e75f4e]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
> ======= Backtrace: =========
> /lib64/libc.so.6[0x3be8e75f4e]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(set_config_option+0x12cb)[0x982242]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(set_config_option+0x12cb)[0x982242]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(PostgresMain+0x772)[0x8141b6]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(PostgresMain+0x772)[0x8141b6]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d]
> postgres: wal sender process postgres [local] streaming
> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99]
>

Thank you for reviewing.

SyncRepUpdateConfig() seems to be no longer necessary.
Attached updated version.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>>> When I compile now without cassert, I get the compiler warning:
>>>>
>>>>> syncrep.c: In function 'SyncRepUpdateConfig':
>>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>>>>> [-Wunused-but-set-variable]
>>>>
>>>> If there's a good reason for that to be an Assert, I don't see it.
>>>> There are no callers of SyncRepUpdateConfig that look like they
>>>> need to, or should expect not to have to, tolerate errors.
>>>> I think the way to fix this is to turn the Assert into a plain
>>>> old test-and-ereport-ERROR.
>>>>
>>>
>>> I've changed the draft patch Amit implemented so that it doesn't parse
>>> twice(check_hook and assign_hook).
>>> So assertion that was in assign_hook is no longer necessary.
>>>
>>> Please find attached.
>>
>> Thanks for the patch!
>>
>> When I emptied s_s_names, reloaded the configration file, set it to 'standby1'
>> and reloaded the configuration file again, the master crashed with
>> the following error.
>>
>> *** glibc detected *** postgres: wal sender process postgres [local]
>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>> 0x00000000024d9a40 ***
>> ======= Backtrace: =========
>> *** glibc detected *** postgres: wal sender process postgres [local]
>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>> 0x00000000024d9a40 ***
>> /lib64/libc.so.6[0x3be8e75f4e]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>> ======= Backtrace: =========
>> /lib64/libc.so.6[0x3be8e75f4e]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e]
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d]
>> postgres: wal sender process postgres [local] streaming
>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99]
>>
>
> Thank you for reviewing.
>
> SyncRepUpdateConfig() seems to be no longer necessary.

Really? I was thinking that something like that function needs to
be called at the beginning of a backend and walsender in
EXEC_BACKEND case. No?

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Apr 11, 2016 at 8:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>>>> When I compile now without cassert, I get the compiler warning:
>>>>>
>>>>>> syncrep.c: In function 'SyncRepUpdateConfig':
>>>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>>>>>> [-Wunused-but-set-variable]
>>>>>
>>>>> If there's a good reason for that to be an Assert, I don't see it.
>>>>> There are no callers of SyncRepUpdateConfig that look like they
>>>>> need to, or should expect not to have to, tolerate errors.
>>>>> I think the way to fix this is to turn the Assert into a plain
>>>>> old test-and-ereport-ERROR.
>>>>>
>>>>
>>>> I've changed the draft patch Amit implemented so that it doesn't parse
>>>> twice(check_hook and assign_hook).
>>>> So assertion that was in assign_hook is no longer necessary.
>>>>
>>>> Please find attached.
>>>
>>> Thanks for the patch!
>>>
>>> When I emptied s_s_names, reloaded the configration file, set it to 'standby1'
>>> and reloaded the configuration file again, the master crashed with
>>> the following error.
>>>
>>> *** glibc detected *** postgres: wal sender process postgres [local]
>>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>>> 0x00000000024d9a40 ***
>>> ======= Backtrace: =========
>>> *** glibc detected *** postgres: wal sender process postgres [local]
>>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>>> 0x00000000024d9a40 ***
>>> /lib64/libc.so.6[0x3be8e75f4e]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>>> ======= Backtrace: =========
>>> /lib64/libc.so.6[0x3be8e75f4e]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e]
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d]
>>> postgres: wal sender process postgres [local] streaming
>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99]
>>>
>>
>> Thank you for reviewing.
>>
>> SyncRepUpdateConfig() seems to be no longer necessary.
>
> Really? I was thinking that something like that function needs to
> be called at the beginning of a backend and walsender in
> EXEC_BACKEND case. No?
>

Hmm, in EXEC_BACKEND case, I guess that each child process calls
read_nondefault_variables that parses and validates these
configuration parameters in SubPostmasterMain.
Previous patch didn't apply to HEAD cleanly, attached updated version.

Regards,

--
Masahiko Sawada

Attachment

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Tue, Apr 12, 2016 at 9:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Apr 11, 2016 at 8:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>>>>> When I compile now without cassert, I get the compiler warning:
>>>>>>
>>>>>>> syncrep.c: In function 'SyncRepUpdateConfig':
>>>>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used
>>>>>>> [-Wunused-but-set-variable]
>>>>>>
>>>>>> If there's a good reason for that to be an Assert, I don't see it.
>>>>>> There are no callers of SyncRepUpdateConfig that look like they
>>>>>> need to, or should expect not to have to, tolerate errors.
>>>>>> I think the way to fix this is to turn the Assert into a plain
>>>>>> old test-and-ereport-ERROR.
>>>>>>
>>>>>
>>>>> I've changed the draft patch Amit implemented so that it doesn't parse
>>>>> twice(check_hook and assign_hook).
>>>>> So assertion that was in assign_hook is no longer necessary.
>>>>>
>>>>> Please find attached.
>>>>
>>>> Thanks for the patch!
>>>>
>>>> When I emptied s_s_names, reloaded the configration file, set it to 'standby1'
>>>> and reloaded the configuration file again, the master crashed with
>>>> the following error.
>>>>
>>>> *** glibc detected *** postgres: wal sender process postgres [local]
>>>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>>>> 0x00000000024d9a40 ***
>>>> ======= Backtrace: =========
>>>> *** glibc detected *** postgres: wal sender process postgres [local]
>>>> streaming 0/3015F18: munmap_chunk(): invalid pointer:
>>>> 0x00000000024d9a40 ***
>>>> /lib64/libc.so.6[0x3be8e75f4e]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>>>> ======= Backtrace: =========
>>>> /lib64/libc.so.6[0x3be8e75f4e]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(set_config_option+0x12cb)[0x982242]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(PostgresMain+0x772)[0x8141b6]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e]
>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d]
>>>> postgres: wal sender process postgres [local] streaming
>>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08]
>>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99]
>>>>
>>>
>>> Thank you for reviewing.
>>>
>>> SyncRepUpdateConfig() seems to be no longer necessary.
>>
>> Really? I was thinking that something like that function needs to
>> be called at the beginning of a backend and walsender in
>> EXEC_BACKEND case. No?
>>
>
> Hmm, in EXEC_BACKEND case, I guess that each child process calls
> read_nondefault_variables that parses and validates these
> configuration parameters in SubPostmasterMain.

SyncRepStandbyNames is passed but SyncRepConfig is not, I think.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com>
> >>> Thank you for reviewing.
> >>>
> >>> SyncRepUpdateConfig() seems to be no longer necessary.
> >>
> >> Really? I was thinking that something like that function needs to
> >> be called at the beginning of a backend and walsender in
> >> EXEC_BACKEND case. No?
> >>
> >
> > Hmm, in EXEC_BACKEND case, I guess that each child process calls
> > read_nondefault_variables that parses and validates these
> > configuration parameters in SubPostmasterMain.
> 
> SyncRepStandbyNames is passed but SyncRepConfig is not, I think.

SyncRepStandbyNames is passed to exec'ed backends by
read_nondefault_variables, which calls set_config_option, which
calls check/assign_s_s_names then syncrep_yyparse, which sets
SyncRepConfig.

Since guess battle is a waste of time, I actually built and ran
on Windows7 and observed that SyncRepConfig has been set before
WalSndLoop starts.

> LOG:  check_s_s_names(pid=20596, newval=)
> LOG:  assign_s_s_names(pid=20596, newval=, SyncRepConfig=00000000)
> LOG:  read_nondefault_variables(pid=20596)
> LOG:  set_config_option(synchronous_standby_names)(pid=20596)
> LOG:  check_s_s_names(pid=20596, newval=2[standby,sby2,sby3])
> LOG:  assign_s_s_names(pid=20596, newval=2[standby,sby2,sby3], SyncRepConfig=01383598)
> LOG:  WalSndLoop(pid=20596)

By the way, the patch assumes that one check_s_s_names is
followed by exactly one assign_s_s_names. I suppose that myextra
should be handled without such assumption.

Plus, the name myextra should be any saner name..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Wed, Apr 13, 2016 at 1:44 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com>
> > >>> Thank you for reviewing.
> > >>>
> > >>> SyncRepUpdateConfig() seems to be no longer necessary.
> > >>
> > >> Really? I was thinking that something like that function needs to
> > >> be called at the beginning of a backend and walsender in
> > >> EXEC_BACKEND case. No?
> > >>
> > >
> > > Hmm, in EXEC_BACKEND case, I guess that each child process calls
> > > read_nondefault_variables that parses and validates these
> > > configuration parameters in SubPostmasterMain.
> >
> > SyncRepStandbyNames is passed but SyncRepConfig is not, I think.
>
> SyncRepStandbyNames is passed to exec'ed backends by
> read_nondefault_variables, which calls set_config_option, which
> calls check/assign_s_s_names then syncrep_yyparse, which sets
> SyncRepConfig.
>
> Since guess battle is a waste of time, I actually built and ran
> on Windows7 and observed that SyncRepConfig has been set before
> WalSndLoop starts.
>

Yes, this is what I was trying to explain to Fujii-san upthread and I have also verified that the same works on Windows.  I think one point which we should try to ensure in this patch is whether it is good to use TopMemoryContext to allocate the memory in the check or assign function or should we allocate some temporary context (like we do in load_tzoffsets()) to perform parsing and then delete the same at end.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Fujii Masao
Date:
On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 13, 2016 at 1:44 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>
>> At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com>
>> wrote in
>> <CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com>
>> > >>> Thank you for reviewing.
>> > >>>
>> > >>> SyncRepUpdateConfig() seems to be no longer necessary.
>> > >>
>> > >> Really? I was thinking that something like that function needs to
>> > >> be called at the beginning of a backend and walsender in
>> > >> EXEC_BACKEND case. No?
>> > >>
>> > >
>> > > Hmm, in EXEC_BACKEND case, I guess that each child process calls
>> > > read_nondefault_variables that parses and validates these
>> > > configuration parameters in SubPostmasterMain.
>> >
>> > SyncRepStandbyNames is passed but SyncRepConfig is not, I think.
>>
>> SyncRepStandbyNames is passed to exec'ed backends by
>> read_nondefault_variables, which calls set_config_option, which
>> calls check/assign_s_s_names then syncrep_yyparse, which sets
>> SyncRepConfig.
>>
>> Since guess battle is a waste of time, I actually built and ran
>> on Windows7 and observed that SyncRepConfig has been set before
>> WalSndLoop starts.
>>
>
> Yes, this is what I was trying to explain to Fujii-san upthread and I have
> also verified that the same works on Windows.

Oh, okay, understood. Thanks for explaining that!

> I think one point which we
> should try to ensure in this patch is whether it is good to use
> TopMemoryContext to allocate the memory in the check or assign function or
> should we allocate some temporary context (like we do in load_tzoffsets())
> to perform parsing and then delete the same at end.

Seems yes if some memories are allocated by palloc and they are not
free'd while parsing s_s_names.

Here are another comment for the patch.

-SyncRepFreeConfig(SyncRepConfigData *config)
+SyncRepFreeConfig(SyncRepConfigData *config, bool itself)

SyncRepFreeConfig() was extended so that it accepts the second boolean
argument. But it's always called with the second argument = false. So,
I just wonder why that second argument is required.
   SyncRepConfigData *config =
-        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
+        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));

Why should we use malloc instead of palloc here?

*If* we use malloc, its return value must be checked.

Regards,

-- 
Fujii Masao



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com>
> > Yes, this is what I was trying to explain to Fujii-san upthread and I have
> > also verified that the same works on Windows.
> 
> Oh, okay, understood. Thanks for explaining that!
> 
> > I think one point which we
> > should try to ensure in this patch is whether it is good to use
> > TopMemoryContext to allocate the memory in the check or assign function or
> > should we allocate some temporary context (like we do in load_tzoffsets())
> > to perform parsing and then delete the same at end.
> 
> Seems yes if some memories are allocated by palloc and they are not
> free'd while parsing s_s_names.
> 
> Here are another comment for the patch.
> 
> -SyncRepFreeConfig(SyncRepConfigData *config)
> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
> 
> SyncRepFreeConfig() was extended so that it accepts the second boolean
> argument. But it's always called with the second argument = false. So,
> I just wonder why that second argument is required.
> 
>     SyncRepConfigData *config =
> -        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
> +        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
> 
> Why should we use malloc instead of palloc here?
> 
> *If* we use malloc, its return value must be checked.

Because it should live irrelevant to any memory context, as guc
values are so. guc.c provides guc_malloc for this purpose, which
is a malloc having some simple error handling, so having
walsender_malloc would be reasonable.

I don't think it's good to use TopMemoryContext for syncrep
parser. syncrep_scanner.l uses palloc. This basically causes a
memory leak on all postgres processes.

It might be better if the parser works on the current memory
context and the caller copies the result on the malloc'ed
memory. But some list-creation functions using palloc.. Changing
SyncRepConfigData.members to be char** would be messier..

Any idea?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yes, this is what I was trying to explain to Fujii-san upthread and I have
> also verified that the same works on Windows.

If you could, it would be nice as well to check that nothing breaks
with VS when using vcregress recoverycheck.
--
Michael



Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com>
>> > Yes, this is what I was trying to explain to Fujii-san upthread and I have
>> > also verified that the same works on Windows.
>>
>> Oh, okay, understood. Thanks for explaining that!
>>
>> > I think one point which we
>> > should try to ensure in this patch is whether it is good to use
>> > TopMemoryContext to allocate the memory in the check or assign function or
>> > should we allocate some temporary context (like we do in load_tzoffsets())
>> > to perform parsing and then delete the same at end.
>>
>> Seems yes if some memories are allocated by palloc and they are not
>> free'd while parsing s_s_names.
>>
>> Here are another comment for the patch.
>>
>> -SyncRepFreeConfig(SyncRepConfigData *config)
>> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
>>
>> SyncRepFreeConfig() was extended so that it accepts the second boolean
>> argument. But it's always called with the second argument = false. So,
>> I just wonder why that second argument is required.
>>
>>     SyncRepConfigData *config =
>> -        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
>> +        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
>>
>> Why should we use malloc instead of palloc here?
>>
>> *If* we use malloc, its return value must be checked.
>
> Because it should live irrelevant to any memory context, as guc
> values are so. guc.c provides guc_malloc for this purpose, which
> is a malloc having some simple error handling, so having
> walsender_malloc would be reasonable.
>
> I don't think it's good to use TopMemoryContext for syncrep
> parser. syncrep_scanner.l uses palloc. This basically causes a
> memory leak on all postgres processes.
>
> It might be better if the parser works on the current memory
> context and the caller copies the result on the malloc'ed
> memory. But some list-creation functions using palloc.. Changing
> SyncRepConfigData.members to be char** would be messier..

SyncRepGetSyncStandby logic assumes deeply that the sync standby names
are constructed as a list.
I think that it would entail a radical change in SyncRepGetStandby
Another idea is to prepare the some functions that allocate/free
element of list using by malloc, free.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com>
> On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yes, this is what I was trying to explain to Fujii-san upthread and I have
> > also verified that the same works on Windows.
> 
> If you could, it would be nice as well to check that nothing breaks
> with VS when using vcregress recoverycheck.

I failed the test because of not preparing for TAP tests. But
instead, I noticed that vcregress.pl shows a bit wrong help
message.

> >vcregress
> Usage: vcregress.pl <check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule]

The new messages in the following diff is the same to the regexp
to check the parameter of vcregress.

======
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index 3d14544..08e2acc 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -548,6 +548,6 @@ sub usage{    print STDERR      "Usage: vcregress.pl ",
-"<check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule]\n";
+"<check|installcheck|plcheck|contribcheck|modulescheck|ecpgcheck|isolationcheck|upgradecheck|bincheck|recoverycheck>
[schedule]\n";   exit(1);}
 
=====

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Thu, 14 Apr 2016 17:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160414.172539.34325458.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com>
> > On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Yes, this is what I was trying to explain to Fujii-san upthread and I have
> > > also verified that the same works on Windows.
> > 
> > If you could, it would be nice as well to check that nothing breaks
> > with VS when using vcregress recoverycheck.

IPC::Run is not installed on Active Perl on my environment and
Active state seems to be saying that IPC-Run cannot be compiled
on Windows. ppm doesn't show IPC-Run. Is there any means to do
TAP test other than this way?

https://code.activestate.com/ppm/IPC-Run/

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Apr 14, 2016 at 5:25 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
> index 3d14544..08e2acc 100644
> --- a/src/tools/msvc/vcregress.pl
> +++ b/src/tools/msvc/vcregress.pl
> @@ -548,6 +548,6 @@ sub usage
>  {
>         print STDERR
>           "Usage: vcregress.pl ",
> -"<check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule]\n";
> +"<check|installcheck|plcheck|contribcheck|modulescheck|ecpgcheck|isolationcheck|upgradecheck|bincheck|recoverycheck>
[schedule]\n";
>         exit(1);
>  }

Right, this is missing modulescheck, bincheck and recoverycheck. All 3
are actually mainly my fault, or perhaps Andrew scored once on
bincheck. Honestly, this is unreadable and that's always tiring to
decrypt it, so why not changing it to something more explicit like the
attached? See by yourself:
$ perl vcregress.pl
Usage: vcregress.pl <mode> [ <schedule> ]

Options for <mode>:
  bincheck       run tests of utilities in src/bin/
  check          deploy instance and run regression tests on it
  contribcheck   run tests of modules in contrib/
  ecpgcheck      run regression tests of ECPG driver
  installcheck   run regression tests on existing instance
  isolationcheck run isolation tests
  modulescheck   run tests of modules in src/test/modules
  plcheck        run tests of PL languages
  recoverycheck  run recovery test suite
  upgradecheck   run tests of pg_upgrade

Options for <schedule>:
  serial         serial mode
  parallel       parallel mode
--
Michael

Attachment

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Thu, Apr 14, 2016 at 5:48 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Thu, 14 Apr 2016 17:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160414.172539.34325458.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> Hello,
>>
>> At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com>
>> > On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > > Yes, this is what I was trying to explain to Fujii-san upthread and I have
>> > > also verified that the same works on Windows.
>> >
>> > If you could, it would be nice as well to check that nothing breaks
>> > with VS when using vcregress recoverycheck.
>
> IPC::Run is not installed on Active Perl on my environment and
> Active state seems to be saying that IPC-Run cannot be compiled
> on Windows. ppm doesn't show IPC-Run. Is there any means to do
> TAP test other than this way?
>
> https://code.activestate.com/ppm/IPC-Run/

IPC::Run is a mandatory dependency I am afraid. You could just
download it from cpan and install it manually in your PERL5LIB path.
That's what I did, and it proves to work just fine.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Thu, Apr 14, 2016 at 1:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com>
> >> > Yes, this is what I was trying to explain to Fujii-san upthread and I have
> >> > also verified that the same works on Windows.
> >>
> >> Oh, okay, understood. Thanks for explaining that!
> >>
> >> > I think one point which we
> >> > should try to ensure in this patch is whether it is good to use
> >> > TopMemoryContext to allocate the memory in the check or assign function or
> >> > should we allocate some temporary context (like we do in load_tzoffsets())
> >> > to perform parsing and then delete the same at end.
> >>
> >> Seems yes if some memories are allocated by palloc and they are not
> >> free'd while parsing s_s_names.
> >>
> >> Here are another comment for the patch.
> >>
> >> -SyncRepFreeConfig(SyncRepConfigData *config)
> >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
> >>
> >> SyncRepFreeConfig() was extended so that it accepts the second boolean
> >> argument. But it's always called with the second argument = false. So,
> >> I just wonder why that second argument is required.
> >>
> >>     SyncRepConfigData *config =
> >> -        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
> >> +        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
> >>
> >> Why should we use malloc instead of palloc here?
> >>
> >> *If* we use malloc, its return value must be checked.
> >
> > Because it should live irrelevant to any memory context, as guc
> > values are so. guc.c provides guc_malloc for this purpose, which
> > is a malloc having some simple error handling, so having
> > walsender_malloc would be reasonable.
> >
> > I don't think it's good to use TopMemoryContext for syncrep
> > parser. syncrep_scanner.l uses palloc. This basically causes a
> > memory leak on all postgres processes.
> >
> > It might be better if the parser works on the current memory
> > context and the caller copies the result on the malloc'ed
> > memory. But some list-creation functions using palloc..

How about if we do all the parsing stuff in temporary context and then copy the results using TopMemoryContext?  I don't think it will be a leak in TopMemoryContext, because next time we try to check/assign s_s_names, it will free the previous result.

 
>
> Changing
> > SyncRepConfigData.members to be char** would be messier..
>
> SyncRepGetSyncStandby logic assumes deeply that the sync standby names
> are constructed as a list.
> I think that it would entail a radical change in SyncRepGetStandby
> Another idea is to prepare the some functions that allocate/free
> element of list using by malloc, free.
>

Yeah, that could be another way of doing it, but seems like much more work.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Thu, 14 Apr 2016 21:05:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSWLyP5ObQz_9Y=kezi0oGeZHaCPn6FT9BYK9tB3HbiVg@mail.gmail.com>
> > IPC::Run is not installed on Active Perl on my environment and
> > Active state seems to be saying that IPC-Run cannot be compiled
> > on Windows. ppm doesn't show IPC-Run. Is there any means to do
> > TAP test other than this way?
> >
> > https://code.activestate.com/ppm/IPC-Run/
> 
> IPC::Run is a mandatory dependency I am afraid. You could just
> download it from cpan and install it manually in your PERL5LIB path.
> That's what I did, and it proves to work just fine.

Hmm. I got an error that dmake is not found for the first time
but I could successfully install it this time. Thank you for
letting me retry.

I confirmed that fix_sync_rep_update_conf_v4.patch doesn't make
nothing to be broken in vcregress recoverycheck. And I will be
able to recheck for revised versions.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Qsw2hLEhrEBvveKC91uZQhDce9i-4dB8VPz87Ciz+OQ@mail.gmail.com>
> On Thu, Apr 14, 2016 at 1:10 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com>
> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com
> >
> > >> > Yes, this is what I was trying to explain to Fujii-san upthread and
> I have
> > >> > also verified that the same works on Windows.
> > >>
> > >> Oh, okay, understood. Thanks for explaining that!
> > >>
> > >> > I think one point which we
> > >> > should try to ensure in this patch is whether it is good to use
> > >> > TopMemoryContext to allocate the memory in the check or assign
> function or
> > >> > should we allocate some temporary context (like we do in
> load_tzoffsets())
> > >> > to perform parsing and then delete the same at end.
> > >>
> > >> Seems yes if some memories are allocated by palloc and they are not
> > >> free'd while parsing s_s_names.
> > >>
> > >> Here are another comment for the patch.
> > >>
> > >> -SyncRepFreeConfig(SyncRepConfigData *config)
> > >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
> > >>
> > >> SyncRepFreeConfig() was extended so that it accepts the second boolean
> > >> argument. But it's always called with the second argument = false. So,
> > >> I just wonder why that second argument is required.
> > >>
> > >>     SyncRepConfigData *config =
> > >> -        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
> > >> +        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
> > >>
> > >> Why should we use malloc instead of palloc here?
> > >>
> > >> *If* we use malloc, its return value must be checked.
> > >
> > > Because it should live irrelevant to any memory context, as guc
> > > values are so. guc.c provides guc_malloc for this purpose, which
> > > is a malloc having some simple error handling, so having
> > > walsender_malloc would be reasonable.
> > >
> > > I don't think it's good to use TopMemoryContext for syncrep
> > > parser. syncrep_scanner.l uses palloc. This basically causes a
> > > memory leak on all postgres processes.
> > >
> > > It might be better if the parser works on the current memory
> > > context and the caller copies the result on the malloc'ed
> > > memory. But some list-creation functions using palloc..
> 
> How about if we do all the parsing stuff in temporary context and then copy
> the results using TopMemoryContext?  I don't think it will be a leak in
> TopMemoryContext, because next time we try to check/assign s_s_names, it
> will free the previous result.

I agree with you. A temporary context for the parser seems
reasonable. TopMemoryContext is created very early in main() so
palloc on it is effectively the same with malloc.

One problem is that only the top memory block is assumed to be
free()'d, not pfree()'d by guc_set_extra. It makes this quite
ugly..

Maybe we shouldn't use the extra for this purpose.

Thoughts?

> > Changing
> > > SyncRepConfigData.members to be char** would be messier..
> >
> > SyncRepGetSyncStandby logic assumes deeply that the sync standby names
> > are constructed as a list.
> > I think that it would entail a radical change in SyncRepGetStandby
> > Another idea is to prepare the some functions that allocate/free
> > element of list using by malloc, free.
> >
> 
> Yeah, that could be another way of doing it, but seems like much more work.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 3c9142e..3778c94 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h"
+#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */
@@ -361,11 +362,6 @@ SyncRepInitConfig(void){    int            priority;
-    /* Update the config data of synchronous replication */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    SyncRepUpdateConfig();
-    /*     * Determine if we are a potential sync standby and remember the result     * for handling replies from
standby.
@@ -868,47 +864,61 @@ SyncRepUpdateSyncStandbysDefined(void)}/*
- * Parse synchronous_standby_names and update the config data
- * of synchronous standbys.
+ * Free a previously-allocated config data of synchronous replication. */void
-SyncRepUpdateConfig(void)
+SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt){
-    int    parse_rc;
+    MemoryContext oldcxt = NULL;
-    if (!SyncStandbysDefined())
+    if (!config)        return;
-    /*
-     * check_synchronous_standby_names() verifies the setting value of
-     * synchronous_standby_names before this function is called. So
-     * syncrep_yyparse() must not cause an error here.
-     */
-    syncrep_scanner_init(SyncRepStandbyNames);
-    parse_rc = syncrep_yyparse();
-    syncrep_scanner_finish();
-
-    if (parse_rc != 0)
-        ereport(ERROR,
-                (errcode(ERRCODE_SYNTAX_ERROR),
-                 errmsg_internal("synchronous_standby_names parser returned %d",
-                                 parse_rc)));
-
-    SyncRepConfig = syncrep_parse_result;
-    syncrep_parse_result = NULL;
+    if (cxt)
+        oldcxt = MemoryContextSwitchTo(cxt);
+    list_free_deep(config->members);
+
+    if(oldcxt)
+        MemoryContextSwitchTo(oldcxt);
+
+    if (itself)
+        free(config);}/*
- * Free a previously-allocated config data of synchronous replication.
+ * Returns a copy of a replication config data in the specified memory
+ * context. Note that only the top block should be malloc'ed, because it is
+ * assumed to be freed by set_ */
-void
-SyncRepFreeConfig(SyncRepConfigData *config)
+SyncRepConfigData *
+SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt){
-    if (!config)
-        return;
+    MemoryContext        oldcxt;
+    SyncRepConfigData  *newconfig;
+    ListCell           *lc;
-    list_free_deep(config->members);
-    pfree(config);
+    if (!oldconfig)
+        return NULL;
+
+    oldcxt = MemoryContextSwitchTo(targetcxt);
+
+    newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
+    newconfig->num_sync = oldconfig->num_sync;
+    newconfig->members = list_copy(oldconfig->members);
+
+    /*
+     * The new members list is a combination of list cells on new context and
+     * data pointed from each cell on the old context. So we explicitly copy
+     * the data.
+     */
+    foreach (lc, newconfig->members)
+    {
+        lfirst(lc) = pstrdup((char *) lfirst(lc));
+    }
+
+    MemoryContextSwitchTo(oldcxt);
+
+    return newconfig;}#ifdef USE_ASSERT_CHECKING
@@ -959,12 +969,32 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)    if (*newval !=
NULL&& (*newval)[0] != '\0')    {
 
+        MemoryContext oldcxt;
+        MemoryContext repparse_cxt;
+
+        /*
+         * The result of syncrep_yyparse should live for the lifetime of the
+         * process and syncrep_yyparse may abandon a certain amount of
+         * palloc'ed memory * blocks. So we provide a temporary memory context
+         * for the playground of syncrep_yyparse and copy the result to
+         * TopMmeoryContext.
+         */
+        repparse_cxt = AllocSetContextCreate(CurrentMemoryContext,
+                                             "syncrep parser",
+                                             ALLOCSET_DEFAULT_MINSIZE,
+                                             ALLOCSET_DEFAULT_INITSIZE,
+                                             ALLOCSET_DEFAULT_MAXSIZE);
+        oldcxt = MemoryContextSwitchTo(repparse_cxt);
+        syncrep_scanner_init(*newval);        parse_rc = syncrep_yyparse();        syncrep_scanner_finish();
+        MemoryContextSwitchTo(oldcxt);
+        if (parse_rc != 0)        {
+            MemoryContextDelete(repparse_cxt);            GUC_check_errcode(ERRCODE_SYNTAX_ERROR);
GUC_check_errdetail("synchronous_standby_namesparser returned %d",                                parse_rc);
 
@@ -1017,17 +1047,38 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)        }
/*
-         * syncrep_yyparse sets the global syncrep_parse_result as side effect.
-         * But this function is required to just check, so frees it
-         * after parsing the parameter.
+         * syncrep_yyparse sets the global syncrep_parse_result.
+         * Save syncrep_parse_result to extra in order to use it in
+         * assign_synchronous_standby_names hook as well.         */
-        SyncRepFreeConfig(syncrep_parse_result);
+        *extra = (void *)SyncRepCopyConfig(syncrep_parse_result,
+                                           TopMemoryContext);
+        MemoryContextDelete(repparse_cxt);    }    return true;}void
+assign_synchronous_standby_names(const char *newval, void *extra)
+{
+    SyncRepConfigData *myextra = extra;
+
+    /*
+     * Free members of SyncRepConfig if it already refers somewhere, but
+     * SyncRepConfig itself is freed by set_extra_field. The content of
+     * SyncRepConfit is on TopMemoryContext. See
+     * check_synchronous_standby_names.
+     */
+    if (SyncRepConfig)
+        SyncRepFreeConfig(SyncRepConfig, false, TopMemoryContext);
+
+    SyncRepConfig = myextra;
+
+    return;
+}
+
+voidassign_synchronous_commit(int newval, void *extra){    switch (newval)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 81d3d28..20d23d5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)    MemoryContextSwitchTo(oldcontext);    /*
-     * Allocate and update the config data of synchronous replication,
-     * and then get the currently active synchronous standbys.
+     * Get the currently active synchronous standbys.     */
-    SyncRepUpdateConfig();    LWLockAcquire(SyncRepLock, LW_SHARED);    sync_standbys = SyncRepGetSyncStandbys(NULL);
 LWLockRelease(SyncRepLock);
 
-    /*
-     * Free the previously-allocated config data because a backend
-     * no longer needs it. The next call of this function needs to
-     * allocate and update the config data newly because the setting
-     * of sync replication might be changed between the calls.
-     */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    for (i = 0; i < max_wal_senders; i++)    {        WalSnd *walsnd = &WalSndCtl->walsnds[i];
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fb091bc..3ce83bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] =        },        &SyncRepStandbyNames,
"",
 
-        check_synchronous_standby_names, NULL, NULL
+        check_synchronous_standby_names, assign_synchronous_standby_names, NULL    },    {
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 14b5664..97368f8 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -59,13 +59,16 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List
*SyncRepGetSyncStandbys(bool*am_sync);
 
-extern void SyncRepUpdateConfig(void);
-extern void SyncRepFreeConfig(SyncRepConfigData *config);
+extern void SyncRepFreeConfig(SyncRepConfigData *config, bool itself,
+                                     MemoryContext targetcxt);
+extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig,
+                                            MemoryContext targetcxt);/* called by checkpointer */extern void
SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra,
GucSourcesource);
 
+extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void
assign_synchronous_commit(intnewval, void *extra);/* 

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Fri, Apr 15, 2016 at 3:00 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Qsw2hLEhrEBvveKC91uZQhDce9i-4dB8VPz87Ciz+OQ@mail.gmail.com>
>> On Thu, Apr 14, 2016 at 1:10 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>> wrote:
>> >
>> > On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI
>> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com>
>> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com
>> >
>> > >> > Yes, this is what I was trying to explain to Fujii-san upthread and
>> I have
>> > >> > also verified that the same works on Windows.
>> > >>
>> > >> Oh, okay, understood. Thanks for explaining that!
>> > >>
>> > >> > I think one point which we
>> > >> > should try to ensure in this patch is whether it is good to use
>> > >> > TopMemoryContext to allocate the memory in the check or assign
>> function or
>> > >> > should we allocate some temporary context (like we do in
>> load_tzoffsets())
>> > >> > to perform parsing and then delete the same at end.
>> > >>
>> > >> Seems yes if some memories are allocated by palloc and they are not
>> > >> free'd while parsing s_s_names.
>> > >>
>> > >> Here are another comment for the patch.
>> > >>
>> > >> -SyncRepFreeConfig(SyncRepConfigData *config)
>> > >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
>> > >>
>> > >> SyncRepFreeConfig() was extended so that it accepts the second boolean
>> > >> argument. But it's always called with the second argument = false. So,
>> > >> I just wonder why that second argument is required.
>> > >>
>> > >>     SyncRepConfigData *config =
>> > >> -        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
>> > >> +        (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
>> > >>
>> > >> Why should we use malloc instead of palloc here?
>> > >>
>> > >> *If* we use malloc, its return value must be checked.
>> > >
>> > > Because it should live irrelevant to any memory context, as guc
>> > > values are so. guc.c provides guc_malloc for this purpose, which
>> > > is a malloc having some simple error handling, so having
>> > > walsender_malloc would be reasonable.
>> > >
>> > > I don't think it's good to use TopMemoryContext for syncrep
>> > > parser. syncrep_scanner.l uses palloc. This basically causes a
>> > > memory leak on all postgres processes.
>> > >
>> > > It might be better if the parser works on the current memory
>> > > context and the caller copies the result on the malloc'ed
>> > > memory. But some list-creation functions using palloc..
>>
>> How about if we do all the parsing stuff in temporary context and then copy
>> the results using TopMemoryContext?  I don't think it will be a leak in
>> TopMemoryContext, because next time we try to check/assign s_s_names, it
>> will free the previous result.
>
> I agree with you. A temporary context for the parser seems
> reasonable. TopMemoryContext is created very early in main() so
> palloc on it is effectively the same with malloc.
> One problem is that only the top memory block is assumed to be
> free()'d, not pfree()'d by guc_set_extra. It makes this quite
> ugly..
>
> Maybe we shouldn't use the extra for this purpose.
>
> Thoughts?
>

How about if check_hook just parses parameter in
CurrentMemoryContext(i.g., T_AllocSetContext), and then the
assign_hook copies syncrep_parse_result to TopMemoryContext.
Because syncrep_parse_result is a global variable, these hooks can see it.

Here are some comments.

-SyncRepUpdateConfig(void)
+SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt)

Sorry, it's my bad. itself variables is no longer needed because
SyncRepFreeConfig is called by only one function.

-void
-SyncRepFreeConfig(SyncRepConfigData *config)
+SyncRepConfigData *
+SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt)

I'm not sure targetcxt argument is necessary.

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Fri, Apr 15, 2016 at 11:30 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote :
> >
> > How about if we do all the parsing stuff in temporary context and then copy
> > the results using TopMemoryContext?  I don't think it will be a leak in
> > TopMemoryContext, because next time we try to check/assign s_s_names, it
> > will free the previous result.
>
> I agree with you. A temporary context for the parser seems
> reasonable. TopMemoryContext is created very early in main() so
> palloc on it is effectively the same with malloc.
>
> One problem is that only the top memory block is assumed to be
> free()'d, not pfree()'d by guc_set_extra. It makes this quite
> ugly..
>

+ newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));

Is there a reason to use malloc here, can't we use palloc directly?  Also for both the functions SyncRepCopyConfig() and SyncRepFreeConfig(), if we directly use TopMemoryContext inside the function (if required) rather than taking it as argument, then it will simplify the code a lot.

+SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt)

Do we really need 'bool itself' parameter in above function?

+ if (cxt)

+ oldcxt = MemoryContextSwitchTo(cxt);

+ list_free_deep(config->members);

+

+ if(oldcxt)

+ MemoryContextSwitchTo(oldcxt);

Why do you need MemoryContextSwitchTo for freeing members?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Sat, 16 Apr 2016 12:50:30 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1LzC=6-EEVuCZhoYnKDHSqKUptV6F+5SavSR5P6jHdfXw@mail.gmail.com>
> On Fri, Apr 15, 2016 at 11:30 AM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com>
> wrote :
> > >
> > > How about if we do all the parsing stuff in temporary context and then
> copy
> > > the results using TopMemoryContext?  I don't think it will be a leak in
> > > TopMemoryContext, because next time we try to check/assign s_s_names, it
> > > will free the previous result.
> >
> > I agree with you. A temporary context for the parser seems
> > reasonable. TopMemoryContext is created very early in main() so
> > palloc on it is effectively the same with malloc.
> >
> > One problem is that only the top memory block is assumed to be
> > free()'d, not pfree()'d by guc_set_extra. It makes this quite
> > ugly..
> >
> 
> + newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
> Is there a reason to use malloc here, can't we use palloc directly?

The reason is the memory block is to released using free() in
guc_extra_field (not guc_set_extra). Even if we allocate and
deallocate it using palloc/pfree, the 'extra' pointer to the
block in gconf cannot be NULLed there and guc_extra_field tries
freeing it again using free() then bang.

> Also
> for both the functions SyncRepCopyConfig() and SyncRepFreeConfig(), if we
> directly use TopMemoryContext inside the function (if required) rather than
> taking it as argument, then it will simplify the code a lot.

Either is fine. I placed the parameter in order to emphasize
where the memory block is placed on, other than current memory
context nor bare heap, rather than for some practical reasons.

> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext
> cxt)
> 
> Do we really need 'bool itself' parameter in above function?
> 
> + if (cxt)
> 
> + oldcxt = MemoryContextSwitchTo(cxt);
> 
> + list_free_deep(config->members);
> 
> +
> 
> + if(oldcxt)
> 
> + MemoryContextSwitchTo(oldcxt);
> Why do you need MemoryContextSwitchTo for freeing members?

Ah, sorry. It's just a slip of my fingers.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com>
> >> How about if we do all the parsing stuff in temporary context and then copy
> >> the results using TopMemoryContext?  I don't think it will be a leak in
> >> TopMemoryContext, because next time we try to check/assign s_s_names, it
> >> will free the previous result.
> >
> > I agree with you. A temporary context for the parser seems
> > reasonable. TopMemoryContext is created very early in main() so
> > palloc on it is effectively the same with malloc.
> > One problem is that only the top memory block is assumed to be
> > free()'d, not pfree()'d by guc_set_extra. It makes this quite
> > ugly..
> >
> > Maybe we shouldn't use the extra for this purpose.
> >
> > Thoughts?
> >
> 
> How about if check_hook just parses parameter in
> CurrentMemoryContext(i.g., T_AllocSetContext), and then the
> assign_hook copies syncrep_parse_result to TopMemoryContext.
> Because syncrep_parse_result is a global variable, these hooks can see it.

Hmm. Somewhat uneasy but should work. The attached patch does it.

> Here are some comments.
> 
> -SyncRepUpdateConfig(void)
> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt)
> 
> Sorry, it's my bad. itself variables is no longer needed because
> SyncRepFreeConfig is called by only one function.
> 
> -void
> -SyncRepFreeConfig(SyncRepConfigData *config)
> +SyncRepConfigData *
> +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt)
> 
> I'm not sure targetcxt argument is necessary.

Yes, these are just for signalling so removal doesn't harm.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 3c9142e..3d68fb5 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h"
+#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */
@@ -361,11 +362,6 @@ SyncRepInitConfig(void){    int            priority;
-    /* Update the config data of synchronous replication */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    SyncRepUpdateConfig();
-    /*     * Determine if we are a potential sync standby and remember the result     * for handling replies from
standby.
@@ -868,47 +864,50 @@ SyncRepUpdateSyncStandbysDefined(void)}/*
- * Parse synchronous_standby_names and update the config data
- * of synchronous standbys.
+ * Free a previously-allocated config data of synchronous replication. */void
-SyncRepUpdateConfig(void)
+SyncRepFreeConfig(SyncRepConfigData *config){
-    int    parse_rc;
-
-    if (!SyncStandbysDefined())
+    if (!config)        return;
-    /*
-     * check_synchronous_standby_names() verifies the setting value of
-     * synchronous_standby_names before this function is called. So
-     * syncrep_yyparse() must not cause an error here.
-     */
-    syncrep_scanner_init(SyncRepStandbyNames);
-    parse_rc = syncrep_yyparse();
-    syncrep_scanner_finish();
-
-    if (parse_rc != 0)
-        ereport(ERROR,
-                (errcode(ERRCODE_SYNTAX_ERROR),
-                 errmsg_internal("synchronous_standby_names parser returned %d",
-                                 parse_rc)));
-
-    SyncRepConfig = syncrep_parse_result;
-    syncrep_parse_result = NULL;
+    list_free_deep(config->members);
+    pfree(config);}/*
- * Free a previously-allocated config data of synchronous replication.
+ * Returns a copy of a replication config data into the TopMemoryContext. */
-void
-SyncRepFreeConfig(SyncRepConfigData *config)
+SyncRepConfigData *
+SyncRepCopyConfig(SyncRepConfigData *oldconfig){
-    if (!config)
-        return;
+    MemoryContext        oldcxt;
+    SyncRepConfigData  *newconfig;
+    ListCell           *lc;
-    list_free_deep(config->members);
-    pfree(config);
+    if (!oldconfig)
+        return NULL;
+
+    oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+    newconfig = (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
+    newconfig->num_sync = oldconfig->num_sync;
+    newconfig->members = list_copy(oldconfig->members);
+
+    /*
+     * The new members list is a combination of list cells on the new context
+     * and data pointed from each cell on the old context. So we explicitly
+     * copy the data.
+     */
+    foreach (lc, newconfig->members)
+    {
+        lfirst(lc) = pstrdup((char *) lfirst(lc));
+    }
+
+    MemoryContextSwitchTo(oldcxt);
+
+    return newconfig;}#ifdef USE_ASSERT_CHECKING
@@ -957,6 +956,8 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source){    int
parse_rc;
+    Assert(syncrep_parse_result == NULL);
+    if (*newval != NULL && (*newval)[0] != '\0')    {        syncrep_scanner_init(*newval);
@@ -965,6 +966,7 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)        if (parse_rc
!=0)        {
 
+            syncrep_parse_result = NULL;            GUC_check_errcode(ERRCODE_SYNTAX_ERROR);
GUC_check_errdetail("synchronous_standby_namesparser returned %d",                                parse_rc);
 
@@ -1017,17 +1019,39 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)        }
/*
-         * syncrep_yyparse sets the global syncrep_parse_result as side effect.
-         * But this function is required to just check, so frees it
-         * after parsing the parameter.
+         * We leave syncrep_parse_result for the use in
+         * assign_synchronous_standby_names.         */
-        SyncRepFreeConfig(syncrep_parse_result);    }    return true;}void
+assign_synchronous_standby_names(const char *newval, void *extra)
+{
+    /* Free the old SyncRepConfig if exists */
+    if (SyncRepConfig)
+        SyncRepFreeConfig(SyncRepConfig);
+
+    SyncRepConfig = NULL;
+
+    /* Copy the parsed config into TopMemoryContext if exists */
+    if (syncrep_parse_result)
+    {
+        SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
+
+        /*
+         * this memory block will be freed as a part of the memory contxt for
+         * config file processing.
+         */
+        syncrep_parse_result = NULL;
+    }
+
+    return;
+}
+
+voidassign_synchronous_commit(int newval, void *extra){    switch (newval)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 81d3d28..20d23d5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)    MemoryContextSwitchTo(oldcontext);    /*
-     * Allocate and update the config data of synchronous replication,
-     * and then get the currently active synchronous standbys.
+     * Get the currently active synchronous standbys.     */
-    SyncRepUpdateConfig();    LWLockAcquire(SyncRepLock, LW_SHARED);    sync_standbys = SyncRepGetSyncStandbys(NULL);
 LWLockRelease(SyncRepLock);
 
-    /*
-     * Free the previously-allocated config data because a backend
-     * no longer needs it. The next call of this function needs to
-     * allocate and update the config data newly because the setting
-     * of sync replication might be changed between the calls.
-     */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    for (i = 0; i < max_wal_senders; i++)    {        WalSnd *walsnd = &WalSndCtl->walsnds[i];
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fb091bc..3ce83bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] =        },        &SyncRepStandbyNames,
"",
 
-        check_synchronous_standby_names, NULL, NULL
+        check_synchronous_standby_names, assign_synchronous_standby_names, NULL    },    {
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 14b5664..9a1eb2f 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -59,13 +59,14 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List
*SyncRepGetSyncStandbys(bool*am_sync);
 
-extern void SyncRepUpdateConfig(void);extern void SyncRepFreeConfig(SyncRepConfigData *config);
+extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig);/* called by checkpointer */extern void
SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra,
GucSourcesource);
 
+extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void
assign_synchronous_commit(intnewval, void *extra);/* 

Re: Support for N synchronous standby servers - take 2

From
Masahiko Sawada
Date:
On Mon, Apr 18, 2016 at 2:15 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com>
>> >> How about if we do all the parsing stuff in temporary context and then copy
>> >> the results using TopMemoryContext?  I don't think it will be a leak in
>> >> TopMemoryContext, because next time we try to check/assign s_s_names, it
>> >> will free the previous result.
>> >
>> > I agree with you. A temporary context for the parser seems
>> > reasonable. TopMemoryContext is created very early in main() so
>> > palloc on it is effectively the same with malloc.
>> > One problem is that only the top memory block is assumed to be
>> > free()'d, not pfree()'d by guc_set_extra. It makes this quite
>> > ugly..
>> >
>> > Maybe we shouldn't use the extra for this purpose.
>> >
>> > Thoughts?
>> >
>>
>> How about if check_hook just parses parameter in
>> CurrentMemoryContext(i.g., T_AllocSetContext), and then the
>> assign_hook copies syncrep_parse_result to TopMemoryContext.
>> Because syncrep_parse_result is a global variable, these hooks can see it.
>
> Hmm. Somewhat uneasy but should work. The attached patch does it.
>
>> Here are some comments.
>>
>> -SyncRepUpdateConfig(void)
>> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt)
>>
>> Sorry, it's my bad. itself variables is no longer needed because
>> SyncRepFreeConfig is called by only one function.
>>
>> -void
>> -SyncRepFreeConfig(SyncRepConfigData *config)
>> +SyncRepConfigData *
>> +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt)
>>
>> I'm not sure targetcxt argument is necessary.
>
> Yes, these are just for signalling so removal doesn't harm.
>

Thank you for updating the patch.

Here are some comments.

+       Assert(syncrep_parse_result == NULL);
+

Why do we need Assert at this point?
It's possible that syncrep_parse_result is not NULL after setting
s_s_names by ALTER SYSTEM.

+               /*
+                * this memory block will be freed as a part of the
memory contxt for
+                * config file processing.
+                */

s/contxt/context/

Regards,

--
Masahiko Sawada



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Wed, 20 Apr 2016 11:51:09 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoC5rrWSk-V79xjVfYr2UqQYrrCKsXkSxZrN9p5YAaeKJA@mail.gmail.com>
> On Mon, Apr 18, 2016 at 2:15 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com>
> >> How about if check_hook just parses parameter in
> >> CurrentMemoryContext(i.g., T_AllocSetContext), and then the
> >> assign_hook copies syncrep_parse_result to TopMemoryContext.
> >> Because syncrep_parse_result is a global variable, these hooks can see it.
> >
> > Hmm. Somewhat uneasy but should work. The attached patch does it.
..
> Thank you for updating the patch.
> 
> Here are some comments.
> 
> +       Assert(syncrep_parse_result == NULL);
> +
> 
> Why do we need Assert at this point?
> It's possible that syncrep_parse_result is not NULL after setting
> s_s_names by ALTER SYSTEM.

Thank you for pointing it out. It is just a trace of an
assumption no longer useful.

> +               /*
> +                * this memory block will be freed as a part of the
> memory contxt for
> +                * config file processing.
> +                */
> 
> s/contxt/context/

Thanks. I removed whole the comment and the corresponding code
since it's meaningless.

assign_s_s_names causes SEGV when it is called without calling
check_s_s_names. I think that's not the case for this varialbe
because it is unresettable amid a session. It is very uneasy for
me but I don't see a proper means to reset
syncrep_parse_result. MemoryContext deletion hook would work but
it seems to be an overkill for this single use.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 3c9142e..bdd6de0 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h"
+#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */
@@ -361,11 +362,6 @@ SyncRepInitConfig(void){    int            priority;
-    /* Update the config data of synchronous replication */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    SyncRepUpdateConfig();
-    /*     * Determine if we are a potential sync standby and remember the result     * for handling replies from
standby.
@@ -868,47 +864,50 @@ SyncRepUpdateSyncStandbysDefined(void)}/*
- * Parse synchronous_standby_names and update the config data
- * of synchronous standbys.
+ * Free a previously-allocated config data of synchronous replication. */void
-SyncRepUpdateConfig(void)
+SyncRepFreeConfig(SyncRepConfigData *config){
-    int    parse_rc;
-
-    if (!SyncStandbysDefined())
+    if (!config)        return;
-    /*
-     * check_synchronous_standby_names() verifies the setting value of
-     * synchronous_standby_names before this function is called. So
-     * syncrep_yyparse() must not cause an error here.
-     */
-    syncrep_scanner_init(SyncRepStandbyNames);
-    parse_rc = syncrep_yyparse();
-    syncrep_scanner_finish();
-
-    if (parse_rc != 0)
-        ereport(ERROR,
-                (errcode(ERRCODE_SYNTAX_ERROR),
-                 errmsg_internal("synchronous_standby_names parser returned %d",
-                                 parse_rc)));
-
-    SyncRepConfig = syncrep_parse_result;
-    syncrep_parse_result = NULL;
+    list_free_deep(config->members);
+    pfree(config);}/*
- * Free a previously-allocated config data of synchronous replication.
+ * Returns a copy of a replication config data into the TopMemoryContext. */
-void
-SyncRepFreeConfig(SyncRepConfigData *config)
+SyncRepConfigData *
+SyncRepCopyConfig(SyncRepConfigData *oldconfig){
-    if (!config)
-        return;
+    MemoryContext        oldcxt;
+    SyncRepConfigData  *newconfig;
+    ListCell           *lc;
-    list_free_deep(config->members);
-    pfree(config);
+    if (!oldconfig)
+        return NULL;
+
+    oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+
+    newconfig = (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
+    newconfig->num_sync = oldconfig->num_sync;
+    newconfig->members = list_copy(oldconfig->members);
+
+    /*
+     * The new members list is a combination of list cells on the new context
+     * and data pointed from each cell on the old context. So we explicitly
+     * copy the data.
+     */
+    foreach (lc, newconfig->members)
+    {
+        lfirst(lc) = pstrdup((char *) lfirst(lc));
+    }
+
+    MemoryContextSwitchTo(oldcxt);
+
+    return newconfig;}#ifdef USE_ASSERT_CHECKING
@@ -952,13 +951,30 @@ SyncRepQueueIsOrderedByLSN(int mode) *
===========================================================*/
 
+/*
+ * check_synchronous_standby_names and assign_synchronous_standby_names are to
+ * be used from guc.c. The former generates a result pointed by
+ * syncrep_parse_result in the current memory context as the side effect and
+ * the latter reads it. This won't be a problem as long as the guc variable
+ * synchronous_standby_names cannot be set during a session.
+ */
+boolcheck_synchronous_standby_names(char **newval, void **extra, GucSource source){    int    parse_rc;
+    syncrep_parse_result = NULL;
+    if (*newval != NULL && (*newval)[0] != '\0')    {
+        /*
+         * syncrep_yyparse generates a result on the current memory context as
+         * the side effect and points it using the global
+         * syncrep_prase_result.  We don't clear the pointer even after the
+         * result is invalidated by discarding the context so make sure not to
+         * use it after invalidation.
+         */        syncrep_scanner_init(*newval);        parse_rc = syncrep_yyparse();
syncrep_scanner_finish();
@@ -1015,19 +1031,28 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)
             syncrep_parse_result->num_sync, list_length(syncrep_parse_result->members)),
errhint("Specifymore names of potential synchronous standbys in synchronous_standby_names.")));        }
 
-
-        /*
-         * syncrep_yyparse sets the global syncrep_parse_result as side effect.
-         * But this function is required to just check, so frees it
-         * after parsing the parameter.
-         */
-        SyncRepFreeConfig(syncrep_parse_result);    }    return true;}void
+assign_synchronous_standby_names(const char *newval, void *extra)
+{
+    /* Free the old SyncRepConfig if exists */
+    if (SyncRepConfig)
+        SyncRepFreeConfig(SyncRepConfig);
+
+    /* Copy the parsed config into TopMemoryContext if exists */
+    if (syncrep_parse_result)
+        SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
+    else
+        SyncRepConfig = NULL;
+
+    return;
+}
+
+voidassign_synchronous_commit(int newval, void *extra){    switch (newval)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 81d3d28..20d23d5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)    MemoryContextSwitchTo(oldcontext);    /*
-     * Allocate and update the config data of synchronous replication,
-     * and then get the currently active synchronous standbys.
+     * Get the currently active synchronous standbys.     */
-    SyncRepUpdateConfig();    LWLockAcquire(SyncRepLock, LW_SHARED);    sync_standbys = SyncRepGetSyncStandbys(NULL);
 LWLockRelease(SyncRepLock);
 
-    /*
-     * Free the previously-allocated config data because a backend
-     * no longer needs it. The next call of this function needs to
-     * allocate and update the config data newly because the setting
-     * of sync replication might be changed between the calls.
-     */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    for (i = 0; i < max_wal_senders; i++)    {        WalSnd *walsnd = &WalSndCtl->walsnds[i];
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fb091bc..3ce83bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] =        },        &SyncRepStandbyNames,
"",
 
-        check_synchronous_standby_names, NULL, NULL
+        check_synchronous_standby_names, assign_synchronous_standby_names, NULL    },    {
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 14b5664..9a1eb2f 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -59,13 +59,14 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List
*SyncRepGetSyncStandbys(bool*am_sync);
 
-extern void SyncRepUpdateConfig(void);extern void SyncRepFreeConfig(SyncRepConfigData *config);
+extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig);/* called by checkpointer */extern void
SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra,
GucSourcesource);
 
+extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void
assign_synchronous_commit(intnewval, void *extra);/* 

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Wed, Apr 20, 2016 at 12:46 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
>
> assign_s_s_names causes SEGV when it is called without calling
> check_s_s_names. I think that's not the case for this varialbe
> because it is unresettable amid a session. It is very uneasy for
> me but I don't see a proper means to reset
> syncrep_parse_result.
>

Is it because syncrep_parse_result is not freed after creating a copy of it in assign_synchronous_standby_names()?  If it so, then I think we need to call SyncRepFreeConfig(syncrep_parse_result); in assign_synchronous_standby_names at below place:

+ /* Copy the parsed config into TopMemoryContext if exists */

+ if (syncrep_parse_result)

+ SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);

Could you please explain how to trigger the scenario where you have seen SEGV?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Michael Paquier
Date:
On Sat, Apr 23, 2016 at 7:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 20, 2016 at 12:46 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>
>>
>> assign_s_s_names causes SEGV when it is called without calling
>> check_s_s_names. I think that's not the case for this varialbe
>> because it is unresettable amid a session. It is very uneasy for
>> me but I don't see a proper means to reset
>> syncrep_parse_result.
>>
>
> Is it because syncrep_parse_result is not freed after creating a copy of it
> in assign_synchronous_standby_names()?  If it so, then I think we need to
> call SyncRepFreeConfig(syncrep_parse_result); in
> assign_synchronous_standby_names at below place:
>
> + /* Copy the parsed config into TopMemoryContext if exists */
>
> + if (syncrep_parse_result)
>
> + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
>
> Could you please explain how to trigger the scenario where you have seen
> SEGV?

Seeing this discussion moving on, I am wondering if we should not
discuss those improvements for 9.7. We are getting close to beta 1,
and this is clearly not a bug, and it's not like HEAD is broken. So I
think that we should not take the risk to make the code unstable at
this stage.
-- 
Michael



Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Sat, Apr 23, 2016 at 5:20 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Sat, Apr 23, 2016 at 7:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Apr 20, 2016 at 12:46 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>
> >>
> >> assign_s_s_names causes SEGV when it is called without calling
> >> check_s_s_names. I think that's not the case for this varialbe
> >> because it is unresettable amid a session. It is very uneasy for
> >> me but I don't see a proper means to reset
> >> syncrep_parse_result.
> >>
> >
> > Is it because syncrep_parse_result is not freed after creating a copy of it
> > in assign_synchronous_standby_names()?  If it so, then I think we need to
> > call SyncRepFreeConfig(syncrep_parse_result); in
> > assign_synchronous_standby_names at below place:
> >
> > + /* Copy the parsed config into TopMemoryContext if exists */
> >
> > + if (syncrep_parse_result)
> >
> > + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
> >
> > Could you please explain how to trigger the scenario where you have seen
> > SEGV?
>
> Seeing this discussion moving on, I am wondering if we should not
> discuss those improvements for 9.7.
>

The main point for this improvement is that the handling for guc s_s_names is not similar to what we do for other somewhat similar guc's and which causes in-efficiency in non-hot code path (less used code).  So, we can push this improvement to 9.7, but OTOH we can also consider it as a non-beta blocker issue and see if we can make this code path better in the mean time.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> The main point for this improvement is that the handling for guc s_s_names
> is not similar to what we do for other somewhat similar guc's and which
> causes in-efficiency in non-hot code path (less used code).

This is not about efficiency, this is about correctness.  The proposed
v7 patch is flat out not acceptable, not now and not for 9.7 either,
because it introduces a GUC assign hook that can easily fail (eg, through
out-of-memory for the copy step).  Assign hook functions need to be
incapable of failure.  I do not see any good reason why this one cannot
satisfy that requirement, either.  It just needs to make use of the
"extra" mechanism to pass back an already-suitably-long-lived result from
check_synchronous_standby_names.  See check_timezone_abbreviations/
assign_timezone_abbreviations for a model to follow.  You are going to
need to find a way to package the parse result into a single malloc'd
blob, though, because that's as much as guc.c can keep track of for an
"extra" value.
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > The main point for this improvement is that the handling for guc s_s_names
> > is not similar to what we do for other somewhat similar guc's and which
> > causes in-efficiency in non-hot code path (less used code).
> 
> This is not about efficiency, this is about correctness.  The proposed
> v7 patch is flat out not acceptable, not now and not for 9.7 either,
> because it introduces a GUC assign hook that can easily fail (eg, through
> out-of-memory for the copy step).  Assign hook functions need to be
> incapable of failure.  I do not see any good reason why this one cannot
> satisfy that requirement, either.  It just needs to make use of the
> "extra" mechanism to pass back an already-suitably-long-lived result from
> check_synchronous_standby_names.  See check_timezone_abbreviations/
> assign_timezone_abbreviations for a model to follow. 

I had already seen there before the v7 and had the same feeling
below in mind but packing in a blob needs to use other than List
to hold the name list (just should be an array) and it is
followed by the necessity of many changes in where the list is
accessed. But the result is hopeless as you mentioned :(

> You are going to
> need to find a way to package the parse result into a single malloc'd
> blob, though, because that's as much as guc.c can keep track of for an
> "extra" value.

Ok, I'll post the v8 with the blob solution sooner.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello, attached is the new version v8.

At Tue, 26 Apr 2016 11:02:25 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160426.110225.35506931.horiguchi.kyotaro@lab.ntt.co.jp>
> At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us>
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > The main point for this improvement is that the handling for guc s_s_names
> > > is not similar to what we do for other somewhat similar guc's and which
> > > causes in-efficiency in non-hot code path (less used code).
> > 
> > This is not about efficiency, this is about correctness.  The proposed
> > v7 patch is flat out not acceptable, not now and not for 9.7 either,
> > because it introduces a GUC assign hook that can easily fail (eg, through
> > out-of-memory for the copy step).  Assign hook functions need to be
> > incapable of failure.  I do not see any good reason why this one cannot
> > satisfy that requirement, either.  It just needs to make use of the
> > "extra" mechanism to pass back an already-suitably-long-lived result from
> > check_synchronous_standby_names.  See check_timezone_abbreviations/
> > assign_timezone_abbreviations for a model to follow. 
> 
> I had already seen there before the v7 and had the same feeling
> below in mind but packing in a blob needs to use other than List
> to hold the name list (just should be an array) and it is
> followed by the necessity of many changes in where the list is
> accessed. But the result is hopeless as you mentioned :(
> 
> > You are going to
> > need to find a way to package the parse result into a single malloc'd
> > blob, though, because that's as much as guc.c can keep track of for an
> > "extra" value.
> 
> Ok, I'll post the v8 with the blob solution sooner.

Hmm. It was way easier than I thought. The attached v8 patch does,

- Changed SyncRepConfigData from a struct using liked list to a blob. Since the former struct is useful in parsing, it
isstill used and converted into the latter form in check_s_s_names.
 

- Make assign_s_s_names not to do nothing other than just assigning SyncRepConfig.

- Change SyncRepGetSyncStandbys to read the latter form of configuration.

- SyncRepFreeConfig is removed since it is no longer needed.

It passes both make check and recovery/make check.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 3c9142e..376fe51 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -361,11 +361,6 @@ SyncRepInitConfig(void){    int            priority;
-    /* Update the config data of synchronous replication */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    SyncRepUpdateConfig();
-    /*     * Determine if we are a potential sync standby and remember the result     * for handling replies from
standby.
@@ -575,7 +570,7 @@ SyncRepGetSyncStandbys(bool *am_sync)    if (am_sync != NULL)        *am_sync = false;
-    lowest_priority = list_length(SyncRepConfig->members);
+    lowest_priority = SyncRepConfig->nmembers;    next_highest_priority = lowest_priority + 1;    /*
@@ -730,9 +725,7 @@ SyncRepGetSyncStandbys(bool *am_sync)static intSyncRepGetStandbyPriority(void){
-    List       *members;
-    ListCell   *l;
-    int            priority = 0;
+    int            priority;    bool        found = false;    /*
@@ -745,12 +738,9 @@ SyncRepGetStandbyPriority(void)    if (!SyncStandbysDefined())        return 0;
-    members = SyncRepConfig->members;
-    foreach(l, members)
+    for (priority = 1 ; priority <= SyncRepConfig->nmembers ; priority++)    {
-        char       *standby_name = (char *) lfirst(l);
-
-        priority++;
+        char  *standby_name = SyncRepConfig->members[priority - 1];        if (pg_strcasecmp(standby_name,
application_name)== 0 ||            pg_strcasecmp(standby_name, "*") == 0)
 
@@ -867,50 +857,6 @@ SyncRepUpdateSyncStandbysDefined(void)    }}
-/*
- * Parse synchronous_standby_names and update the config data
- * of synchronous standbys.
- */
-void
-SyncRepUpdateConfig(void)
-{
-    int    parse_rc;
-
-    if (!SyncStandbysDefined())
-        return;
-
-    /*
-     * check_synchronous_standby_names() verifies the setting value of
-     * synchronous_standby_names before this function is called. So
-     * syncrep_yyparse() must not cause an error here.
-     */
-    syncrep_scanner_init(SyncRepStandbyNames);
-    parse_rc = syncrep_yyparse();
-    syncrep_scanner_finish();
-
-    if (parse_rc != 0)
-        ereport(ERROR,
-                (errcode(ERRCODE_SYNTAX_ERROR),
-                 errmsg_internal("synchronous_standby_names parser returned %d",
-                                 parse_rc)));
-
-    SyncRepConfig = syncrep_parse_result;
-    syncrep_parse_result = NULL;
-}
-
-/*
- * Free a previously-allocated config data of synchronous replication.
- */
-void
-SyncRepFreeConfig(SyncRepConfigData *config)
-{
-    if (!config)
-        return;
-
-    list_free_deep(config->members);
-    pfree(config);
-}
-#ifdef USE_ASSERT_CHECKINGstatic boolSyncRepQueueIsOrderedByLSN(int mode)
@@ -956,9 +902,16 @@ boolcheck_synchronous_standby_names(char **newval, void **extra, GucSource source){    int
parse_rc;
+    SyncRepConfigData *pconf;
+    int i;
+    ListCell *lc;    if (*newval != NULL && (*newval)[0] != '\0')    {
+        /*
+         * syncrep_yyparse generates a result on the current memory context as
+         * the side effect and points it using syncrep_prase_result.
+         */        syncrep_scanner_init(*newval);        parse_rc = syncrep_yyparse();
syncrep_scanner_finish();
@@ -1016,18 +969,35 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source)
     errhint("Specify more names of potential synchronous standbys in synchronous_standby_names.")));        }
 
-        /*
-         * syncrep_yyparse sets the global syncrep_parse_result as side effect.
-         * But this function is required to just check, so frees it
-         * after parsing the parameter.
-         */
-        SyncRepFreeConfig(syncrep_parse_result);
-    }
+        /* Convert SyncRepConfig into the packed struct fit to guc extra */
+        pconf = (SyncRepConfigData *)
+            malloc(SizeOfSyncRepConfig(
+                       list_length(syncrep_parse_result->members)));
+        pconf->num_sync = syncrep_parse_result->num_sync;
+        pconf->nmembers = list_length(syncrep_parse_result->members);
+        i = 0;
+        foreach (lc, syncrep_parse_result->members)
+        {
+            strncpy(pconf->members[i], (char*) lfirst (lc), NAMEDATALEN - 1);
+            pconf->members[i][NAMEDATALEN - 1] = 0;
+            i++;
+        }
+        *extra = (void *)pconf;
+
+        /* No further need for syncrep_parse_result */
+        syncrep_parse_result = NULL;
+    }    return true;}void
+assign_synchronous_standby_names(const char *newval, void *extra)
+{
+    SyncRepConfig = (SyncRepConfigData *) extra;
+}
+
+voidassign_synchronous_commit(int newval, void *extra){    switch (newval)
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 380fedc..932fa9d 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -19,9 +19,9 @@#include "utils/formatting.h"/* Result of the parsing is returned here */
-SyncRepConfigData    *syncrep_parse_result;
+SyncRepParseData    *syncrep_parse_result;
-static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members);
+static SyncRepParseData *create_syncrep_config(char *num_sync, List *members);/* * Bison doesn't allocate anything
thatneeds to live across parser calls,
 
@@ -43,7 +43,7 @@ static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members);{    char       *str;
 List       *list;
 
-    SyncRepConfigData  *config;
+    SyncRepParseData  *config;}%token <str> NAME NUM
@@ -72,11 +72,11 @@ standby_name:;%%
-static SyncRepConfigData *
+static SyncRepParseData *create_syncrep_config(char *num_sync, List *members){
-    SyncRepConfigData *config =
-        (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
+    SyncRepParseData *config =
+        (SyncRepParseData *) palloc(sizeof(SyncRepParseData));    config->num_sync = atoi(num_sync);
config->members= members;
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 81d3d28..20d23d5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)    MemoryContextSwitchTo(oldcontext);    /*
-     * Allocate and update the config data of synchronous replication,
-     * and then get the currently active synchronous standbys.
+     * Get the currently active synchronous standbys.     */
-    SyncRepUpdateConfig();    LWLockAcquire(SyncRepLock, LW_SHARED);    sync_standbys = SyncRepGetSyncStandbys(NULL);
 LWLockRelease(SyncRepLock);
 
-    /*
-     * Free the previously-allocated config data because a backend
-     * no longer needs it. The next call of this function needs to
-     * allocate and update the config data newly because the setting
-     * of sync replication might be changed between the calls.
-     */
-    SyncRepFreeConfig(SyncRepConfig);
-    SyncRepConfig = NULL;
-    for (i = 0; i < max_wal_senders; i++)    {        WalSnd *walsnd = &WalSndCtl->walsnds[i];
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 60856dd..cccc8eb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] =        },        &SyncRepStandbyNames,
"",
 
-        check_synchronous_standby_names, NULL, NULL
+        check_synchronous_standby_names, assign_synchronous_standby_names, NULL    },    {
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 14b5664..6197308 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -33,16 +33,30 @@#define SYNC_REP_WAIT_COMPLETE        2/*
+ * Struct for parsing synchronous_standby_names
+ */
+typedef struct SyncRepParseData
+{
+    int    num_sync;    /* number of sync standbys that we need to wait for */
+    List    *members;    /* list of names of potential sync standbys */
+} SyncRepParseData;
+
+/* * Struct for the configuration of synchronous replication. */typedef struct SyncRepConfigData{    int    num_sync;
 /* number of sync standbys that we need to wait for */
 
-    List    *members;    /* list of names of potential sync standbys */
+    int    nmembers;    /* number of members in the following list */
+    char members[FLEXIBLE_ARRAY_MEMBER][NAMEDATALEN];/* list of names of
+                                                      * potential sync
+                                                      * standbys */} SyncRepConfigData;
-extern SyncRepConfigData *syncrep_parse_result;
-extern SyncRepConfigData *SyncRepConfig;
+#define SizeOfSyncRepConfig(n) \
+    (offsetof(SyncRepConfigData, members) + (n) * NAMEDATALEN)
+
+extern SyncRepParseData *syncrep_parse_result;/* user-settable parameters for synchronous replication */extern char
*SyncRepStandbyNames;
@@ -59,13 +73,12 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List
*SyncRepGetSyncStandbys(bool*am_sync);
 
-extern void SyncRepUpdateConfig(void);
-extern void SyncRepFreeConfig(SyncRepConfigData *config);/* called by checkpointer */extern void
SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra,
GucSourcesource);
 
+extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void
assign_synchronous_commit(intnewval, void *extra);/* 

Re: Support for N synchronous standby servers - take 2

From
Amit Kapila
Date:
On Tue, Apr 26, 2016 at 9:15 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello, attached is the new version v8.

At Tue, 26 Apr 2016 11:02:25 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160426.110225.35506931.horiguchi.kyotaro@lab.ntt.co.jp>
> At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us>
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > The main point for this improvement is that the handling for guc s_s_names
> > > is not similar to what we do for other somewhat similar guc's and which
> > > causes in-efficiency in non-hot code path (less used code).
> >
> > This is not about efficiency, this is about correctness.  The proposed
> > v7 patch is flat out not acceptable, not now and not for 9.7 either,
> > because it introduces a GUC assign hook that can easily fail (eg, through
> > out-of-memory for the copy step).  Assign hook functions need to be
> > incapable of failure.

It seems to me that similar problem can be there for assign_pgstat_temp_directory() as it can also lead to "out of memory" error.  However, in general I understand your concern and I think we should avoid any such failure in assign functions.
 
  I do not see any good reason why this one cannot
> > satisfy that requirement, either.  It just needs to make use of the
> > "extra" mechanism to pass back an already-suitably-long-lived result from
> > check_synchronous_standby_names.  See check_timezone_abbreviations/
> > assign_timezone_abbreviations for a model to follow.
>
> I had already seen there before the v7 and had the same feeling
> below in mind but packing in a blob needs to use other than List
> to hold the name list (just should be an array) and it is
> followed by the necessity of many changes in where the list is
> accessed. But the result is hopeless as you mentioned :(
>
> > You are going to
> > need to find a way to package the parse result into a single malloc'd
> > blob, though, because that's as much as guc.c can keep track of for an
> > "extra" value.
>
> Ok, I'll post the v8 with the blob solution sooner.

Hmm. It was way easier than I thought. The attached v8 patch does,

- Changed SyncRepConfigData from a struct using liked list to a
  blob. Since the former struct is useful in parsing, it is still
  used and converted into the latter form in check_s_s_names.

- Make assign_s_s_names not to do nothing other than just
  assigning SyncRepConfig.

- Change SyncRepGetSyncStandbys to read the latter form of
  configuration.

- SyncRepFreeConfig is removed since it is no longer needed.


+ /* Convert SyncRepConfig into the packed struct fit to guc extra */

+ pconf = (SyncRepConfigData *)

+ malloc(SizeOfSyncRepConfig(

+   list_length(syncrep_parse_result->members)));

I think there should be a check for malloc failure in above code.


+ /* No further need for syncrep_parse_result */

+ syncrep_parse_result = NULL;

Isn't this a memory leak?  Shouldn't we need to free the corresponding memory as well.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Tue, 26 Apr 2016 09:57:50 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1KGVrQTueP2Rijjg_FNQ_TU3n5rt8-X5a0LaEzUQ-+i-Q@mail.gmail.com>
> > > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > > > The main point for this improvement is that the handling for guc
> > s_s_names
> > > > > is not similar to what we do for other somewhat similar guc's and
> > which
> > > > > causes in-efficiency in non-hot code path (less used code).
> > > >
> > > > This is not about efficiency, this is about correctness.  The proposed
> > > > v7 patch is flat out not acceptable, not now and not for 9.7 either,
> > > > because it introduces a GUC assign hook that can easily fail (eg,
> > through
> > > > out-of-memory for the copy step).  Assign hook functions need to be
> > > > incapable of failure.
> 
> 
> It seems to me that similar problem can be there
> for assign_pgstat_temp_directory() as it can also lead to "out of memory"
> error.  However, in general I understand your concern and I think we should
> avoid any such failure in assign functions.

I noticed that forgetting error handling of malloc then searched
for the callers of guc_malloc just now and found the same
thing. This should be addressed as another issue.

> > > > You are going to
> > > > need to find a way to package the parse result into a single malloc'd
> > > > blob, though, because that's as much as guc.c can keep track of for an
> > > > "extra" value.
> > >
> > > Ok, I'll post the v8 with the blob solution sooner.
> >
> > Hmm. It was way easier than I thought. The attached v8 patch does,
...
> + /* Convert SyncRepConfig into the packed struct fit to guc extra */
> 
> + pconf = (SyncRepConfigData *)
> 
> + malloc(SizeOfSyncRepConfig(
> 
> +   list_length(syncrep_parse_result->members)));
> 
> I think there should be a check for malloc failure in above code.

Yes, I'm ashamed to have forgotten what I mentioned just
before. Added the same thing with guc_malloc. The error is at
ERROR since parsing GUC files should continue on parse errors
(and seeing check_log_destination).


> + /* No further need for syncrep_parse_result */
> 
> + syncrep_parse_result = NULL;
> 
> Isn't this a memory leak?  Shouldn't we need to free the corresponding
> memory as well.

It is palloc'ed on the current context, which AFAICS would be
'config file processing' or 'PortalHeapMemory'for the ALTER
SYSTEM case. Both of them are rather short-living. I don't think
that leaving them is a problem on both of the cases and there's
no point freeing only it among those (if any) allocated in the
generated code by bison and flex... I suppose.

I just added a comment in the v9.

|   * No further need for syncrep_parse_result. The memory blocks are
|   * released along with the deletion of the current context.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
On Wed, Apr 27, 2016 at 10:14 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I just added a comment in the v9.

Sorry, I have attached an empty patch. This is another one that should
be with content.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> Sorry, I have attached an empty patch. This is another one that should
> be with content.

I started to review this, and in passing came across this gem in
syncrep_scanner.l:


/* * flex emits a yy_fatal_error() function that it calls in response to * critical errors like malloc failure, file
I/Oerrors, and detection of * internal inconsistency.  That function prints a message and calls exit(). * Mutate it to
insteadcall ereport(FATAL), which terminates this process. * * The process that causes this fatal error should be
terminated.* Otherwise it has to abandon the new setting value of * synchronous_standby_names and keep running with the
previousone * while the other processes switch to the new one. * This inconsistency of the setting that each process is
basedon * can cause a serious problem. Though it's basically not good idea to * use FATAL here because it can take down
thepostmaster, * we should do that in order to avoid such an inconsistency. */
 
#undef fprintf
#define fprintf(file, fmt, msg) syncrep_flex_fatal(fmt, msg)

static void
syncrep_flex_fatal(const char *fmt, const char *msg)
{ereport(FATAL, (errmsg_internal("%s", msg)));
}


This is the faultiest reasoning possible.  There are a hundred reasons why
a process might fail to absorb a GUC setting, and causing just one such
code path to FATAL out is not going to improve system stability one bit.

If you think it is absolutely imperative that all processes in the system
have identical synchronous_standby_names settings, then we need to make
it be PGC_POSTMASTER, not indulge in half-baked non-solutions like this.
But I'd like to know why that is so essential.  It looks to me like what
matters is only whether each individual walsender thinks its client is
a sync standby, and so inconsistent settings between different walsenders
don't really matter.  Which is a good thing, because if it's to remain
SIGHUP, you can't promise that they'll all absorb a new value at the same
instant anyway.

In short, I don't see any good reason not to make this be a plain ERROR
like it is in every other scanner in the backend.
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> Sorry, I have attached an empty patch. This is another one that should
> be with content.

I pushed this after whacking it around some, and cleaning up some
sort-of-related problems in the syncrep parser/lexer.

There remains a point that I'm not very happy about, which is the code
in check_synchronous_standby_names to emit a WARNING if the num_sync
setting is too large.  That's a pretty bad compromise: we should either
decide that the case is legal or that it is not.  If it's legal, people
who are correctly using the case will not thank us for logging a WARNING
every single time the postmaster gets a SIGHUP (and those who aren't using
it correctly will have their systems freezing up, warning or no warning).
If it's not legal, we should make it an error not a warning.

My inclination is to just rip out the warning.  But I wonder whether the
desire to have one doesn't imply that the semantics are poorly chosen
and should be revisited.
        regards, tom lane



Re: Support for N synchronous standby servers - take 2

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 27 Apr 2016 18:05:26 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3167.1461794726@sss.pgh.pa.us>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> > Sorry, I have attached an empty patch. This is another one that should
> > be with content.
> 
> I pushed this after whacking it around some, and cleaning up some
> sort-of-related problems in the syncrep parser/lexer.

Thank you for pushing this (with improvements) and improvements
of synchronous_standby_names. I agree to the discussion that
standby names should have restriction not to break possible
extension to be happen near future.

> There remains a point that I'm not very happy about, which is the code
> in check_synchronous_standby_names to emit a WARNING if the num_sync
> setting is too large.  That's a pretty bad compromise: we should either
> decide that the case is legal or that it is not.  If it's legal, people
> who are correctly using the case will not thank us for logging a WARNING
> every single time the postmaster gets a SIGHUP (and those who aren't using
> it correctly will have their systems freezing up, warning or no warning).
> If it's not legal, we should make it an error not a warning.

This specification makes the code a bit complex and makes the
document a bit less understandable. It seems to me somewhat
suspicious that allowing duplcate (potentially synchronous)
walrecivers is so useful as to justify such disadvantages.

In spite of this, my inclination is also the same as the
following:p rather than making the behavior consistent and clear.

> My inclination is to just rip out the warning.

Is there anyone object to removing the warining?

> But I wonder whether the
> desire to have one doesn't imply that the semantics are poorly chosen
> and should be revisited.

We already have abandoned a bit of backward compatibility in this
feature.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Support for N synchronous standby servers - take 2

From
Tom Lane
Date:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Wed, 27 Apr 2016 18:05:26 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3167.1461794726@sss.pgh.pa.us>
>> My inclination is to just rip out the warning.

> Is there anyone object to removing the warining?

Hearing no objections, done.
        regards, tom lane