Thread: Support for N synchronous standby servers - take 2
There was a discussion on support for N synchronous standby servers started by Michael. Refer http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com . The use of hooks and dedicated language was suggested, however, it seemed to be an overkill for the scenario and there was no consensus on this. Exploring GUC-land was preferred.
Please find attached a patch, built on Michael's patch from above mentioned thread, which supports choosing different number of nodes from each set i.e. k nodes from set 1, l nodes from set 2, so on.
The format of synchronous_standby_names has been updated to standby name followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The transaction waits for all the specified number of standby in each group. Any extra nodes with the same name will be considered potential. The special entry * for the standby name is also supported.
Thanks,
Beena Emerson
Attachment
On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote: > There was a discussion on support for N synchronous standby servers started > by Michael. Refer > http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com > . The use of hooks and dedicated language was suggested, however, it seemed > to be an overkill for the scenario and there was no consensus on this. > Exploring GUC-land was preferred. Cool. > Please find attached a patch, built on Michael's patch from above mentioned > thread, which supports choosing different number of nodes from each set i.e. > k nodes from set 1, l nodes from set 2, so on. > The format of synchronous_standby_names has been updated to standby name > followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The > transaction waits for all the specified number of standby in each group. Any > extra nodes with the same name will be considered potential. The special > entry * for the standby name is also supported. I don't think that this is going in the good direction, what was suggested mainly by Robert was to use a micro-language that would allow far more extensibility that what you are proposing. See for example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com for some ideas. IMO, before writing any patch in this area we should find a clear consensus on what we want to do. Also, unrelated to this patch, we should really get first the patch implementing the... Hum... infrastructure for regression tests regarding replication and archiving to be able to have actual tests for this feature (working on it for next CF). + if (!SplitIdentifierString(standby_detail, '-', &elemlist2)) + { + /* syntax error in list */ + pfree(rawstring); + list_free(elemlist1); + return 0; + } At quick glance, this looks problematic to me if application_name has an hyphen. Regards, -- Michael
On Fri, May 15, 2015 at 9:18 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> There was a discussion on support for N synchronous standby servers started >> by Michael. Refer >> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com >> . The use of hooks and dedicated language was suggested, however, it seemed >> to be an overkill for the scenario and there was no consensus on this. >> Exploring GUC-land was preferred. > > Cool. > >> Please find attached a patch, built on Michael's patch from above mentioned >> thread, which supports choosing different number of nodes from each set i.e. >> k nodes from set 1, l nodes from set 2, so on. >> The format of synchronous_standby_names has been updated to standby name >> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The >> transaction waits for all the specified number of standby in each group. Any >> extra nodes with the same name will be considered potential. The special >> entry * for the standby name is also supported. > > I don't think that this is going in the good direction, what was > suggested mainly by Robert was to use a micro-language that would > allow far more extensibility that what you are proposing. See for > example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com > for some ideas. IMO, before writing any patch in this area we should > find a clear consensus on what we want to do. Also, unrelated to this > patch, we should really get first the patch implementing the... Hum... > infrastructure for regression tests regarding replication and > archiving to be able to have actual tests for this feature (working on > it for next CF). The dedicated language for multiple sync replication would be more extensibility as you said, but I think there are not a lot of user who want to or should use this. IMHO such a dedicated extensible feature could be extension module, i.g. contrib. And we could implement more simpler feature into PostgreSQL core with some restriction. Regards, ------- Sawada Masahiko
On Sat, May 16, 2015 at 5:58 PM, Sawada Masahiko wrote: > The dedicated language for multiple sync replication would be more > extensibility as you said, but I think there are not a lot of user who > want to or should use this. > IMHO such a dedicated extensible feature could be extension module, > i.g. contrib. And we could implement more simpler feature into > PostgreSQL core with some restriction. As proposed, this feature does not bring us really closer to quorum commit, and AFAIK this is what we are more or less aiming at recalling previous discussions. Particularly with the syntax proposed above, it is not possible to do some OR conditions on subgroups of nodes, the list of nodes is forcibly using AND because it is necessary to wait for all the subgroups. Also, users may want to track nodes from the same group with different application_name. -- Michael
Hello, > I don't think that this is going in the good direction, what was > suggested mainly by Robert was to use a micro-language that would > allow far more extensibility that what you are proposing. I agree, the micro-language would give far more extensibility. However, as stated ibefore, the previous discussions concluded that GUC was a preferred way because it is more user-friendly. > See for > example [hidden email] > for some ideas. IMO, before writing any patch in this area we should > find a clear consensus on what we want to do. Also, unrelated to this > patch, we should really get first the patch implementing the... Hum... > infrastructure for regression tests regarding replication and > archiving to be able to have actual tests for this feature (working on > it for next CF). We could decide and work on patch for n-sync along with setting up regression test infrastructure. > At quick glance, this looks problematic to me if application_name has an > hyphen. Yes, I overlooked the fact that application name could have a hyphen. This can be modified. Regards, Beena Emerson ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849711.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
> As proposed, this feature does not bring us really closer to quorum > commit, and AFAIK this is what we are more or less aiming at recalling > previous discussions. Particularly with the syntax proposed above, it > is not possible to do some OR conditions on subgroups of nodes, the > list of nodes is forcibly using AND because it is necessary to wait > for all the subgroups. Also, users may want to track nodes from the > same group with different application_name. The patch assumes that all standbys of a group share a name and so the "OR" condition would be taken care of that way. Also, since uniqueness of standby_name cannot be enforced, the same name could be repeated across groups!. Regards, Beena ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849712.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Mon, May 18, 2015 at 8:42 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > >> I don't think that this is going in the good direction, what was >> suggested mainly by Robert was to use a micro-language that would >> allow far more extensibility that what you are proposing. > > I agree, the micro-language would give far more extensibility. However, as > stated before, the previous discussions concluded that GUC was a preferred > way because it is more user-friendly. Er, I am not sure I follow here. The idea proposed was to define a string formatted with some infra-language within the existing GUC s_s_names. -- Michael
Hello, > Er, I am not sure I follow here. The idea proposed was to define a > string formatted with some infra-language within the existing GUC > s_s_names. I am sorry, I misunderstood. I thought the "language" approach meant use of hooks and module. As you mentioned the first step would be to reach the consensus on the method. If I understand correctly, s_s_names should be able to define: - a count of sync rep from a given group of names ex : 2 from A,B,C. - AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y AND 2 from A,B,C. In this case, we can give the same priority to all the names specified in a group. The standby_names cannot be repeated across groups. Robert had also talked about a little more complex scenarios of choosing either A or both B and C. Additionally, preference for a standby could also be specified. Ex: among A,B and C, A can have higher priority and would be selected if an standby with name A is connected. This can make the language very complicated. Should all these scenarios be covered in the n-sync selection or can we start with the basic 2 and then update later? Thanks & Regards, Beena Emerson ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5849736.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Mon, May 18, 2015 at 9:40 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> Er, I am not sure I follow here. The idea proposed was to define a >> string formatted with some infra-language within the existing GUC >> s_s_names. > > I am sorry, I misunderstood. I thought the "language" approach meant use of > hooks and module. > As you mentioned the first step would be to reach the consensus on the > method. > > If I understand correctly, s_s_names should be able to define: > - a count of sync rep from a given group of names ex : 2 from A,B,C. > - AND condition: Multiple groups and count can be defined. Ex: 1 from X,Y > AND 2 from A,B,C. > > In this case, we can give the same priority to all the names specified in a > group. The standby_names cannot be repeated across groups. > > Robert had also talked about a little more complex scenarios of choosing > either A or both B and C. > Additionally, preference for a standby could also be specified. Ex: among > A,B and C, A can have higher priority and would be selected if an standby > with name A is connected. > This can make the language very complicated. > > Should all these scenarios be covered in the n-sync selection or can we > start with the basic 2 and then update later? If it were me, I'd just go implement a scanner using flex and a parser using bison and use that to parse the format I suggested before, or some similar one. This may sound hard, but it's really not: I put together the patch that became commit 878fdcb843e087cc1cdeadc987d6ef55202ddd04 in just a few hours. I don't see why this would be particularly harder. Then instead of arguing about whether some stop-gap implementation is good enough until we do the real thing, we can just have the real thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 15, 2015 at 9:18 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> There was a discussion on support for N synchronous standby servers started >> by Michael. Refer >> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com >> . The use of hooks and dedicated language was suggested, however, it seemed >> to be an overkill for the scenario and there was no consensus on this. >> Exploring GUC-land was preferred. > > Cool. > >> Please find attached a patch, built on Michael's patch from above mentioned >> thread, which supports choosing different number of nodes from each set i.e. >> k nodes from set 1, l nodes from set 2, so on. >> The format of synchronous_standby_names has been updated to standby name >> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The >> transaction waits for all the specified number of standby in each group. Any >> extra nodes with the same name will be considered potential. The special >> entry * for the standby name is also supported. > > I don't think that this is going in the good direction, what was > suggested mainly by Robert was to use a micro-language that would > allow far more extensibility that what you are proposing. See for > example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com > for some ideas. Doesn't this approach prevent us from specifying the "potential" synchronous standby server? For example, imagine the case where you want to treat the server AAA as synchronous standby. You also want to use the server BBB as synchronous standby only if the server AAA goes down. IOW, you want to prefer to the server AAA as synchronous standby rather than BBB. Currently we can easily set up that case by just setting synchronous_standby_names as follows. synchronous_standby_names = 'AAA, BBB' However, after we adopt the quorum commit feature with the proposed macro-language, how can we set up that case? It seems impossible... I'm afraid that this might be a backward compatibility issue. Or we should extend the proposed micro-language so that it also can handle the priority of each standby servers? Not sure that's possible, though. Regards, -- Fujii Masao
On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, May 15, 2015 at 9:18 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote: >>> There was a discussion on support for N synchronous standby servers started >>> by Michael. Refer >>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com >>> . The use of hooks and dedicated language was suggested, however, it seemed >>> to be an overkill for the scenario and there was no consensus on this. >>> Exploring GUC-land was preferred. >> >> Cool. >> >>> Please find attached a patch, built on Michael's patch from above mentioned >>> thread, which supports choosing different number of nodes from each set i.e. >>> k nodes from set 1, l nodes from set 2, so on. >>> The format of synchronous_standby_names has been updated to standby name >>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The >>> transaction waits for all the specified number of standby in each group. Any >>> extra nodes with the same name will be considered potential. The special >>> entry * for the standby name is also supported. >> >> I don't think that this is going in the good direction, what was >> suggested mainly by Robert was to use a micro-language that would >> allow far more extensibility that what you are proposing. See for >> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com >> for some ideas. > > Doesn't this approach prevent us from specifying the "potential" synchronous > standby server? For example, imagine the case where you want to treat > the server AAA as synchronous standby. You also want to use the server BBB > as synchronous standby only if the server AAA goes down. IOW, you want to > prefer to the server AAA as synchronous standby rather than BBB. > Currently we can easily set up that case by just setting > synchronous_standby_names as follows. > > synchronous_standby_names = 'AAA, BBB' > > However, after we adopt the quorum commit feature with the proposed > macro-language, how can we set up that case? It seems impossible... > I'm afraid that this might be a backward compatibility issue. Like that: synchronous_standby_names = 'AAA, BBB' The thing is that we need to support the old grammar as well to be fully backward compatible, and that's actually equivalent to that in the grammar: 1(AAA,BBB,CCC). This is something I understood was included in Robert's draft proposal. > Or we should extend the proposed micro-language so that it also can handle > the priority of each standby servers? Not sure that's possible, though. I am not sure that's really necessary, we need only to be able to manage priorities within each subgroup. Putting it in a shape that user can understand easily in pg_stat_replication looks more challenging though. We are going to need a new view like pg_stat_replication group that shows up the priority status of each group, with one record for each group, taking into account that a group can be included in another one. -- Michael
On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote: > On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote: >> and that's actually equivalent to that in >> the grammar: 1(AAA,BBB,CCC). > > I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while > two servers AAA and BBB are running, the master server may return a success > of the transaction to the client just after it receives the ACK from BBB. > OTOH, in the case of AAA,BBB, that never happens. The master must wait for > the ACK from AAA to arrive before completing the transaction. And then, > if AAA goes down, BBB should become synchronous standby. Ah. Right. I missed your point, that's a bad day... We could have multiple separators to define group types then: - "()" where the order of acknowledgement does not matter - "[]" where it does not. You would find the old grammar with: 1[AAA,BBB,CCC] -- Michael
On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jun 24, 2015 at 11:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Fri, May 15, 2015 at 9:18 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Fri, May 15, 2015 at 8:55 PM, Beena Emerson <memissemerson@gmail.com> wrote: >>>> There was a discussion on support for N synchronous standby servers started >>>> by Michael. Refer >>>> http://archives.postgresql.org/message-id/CAB7nPqR9c84ig0ZUvhMQAMq53VQsD4rC82vYci4Dr27PVOFf9w@mail.gmail.com >>>> . The use of hooks and dedicated language was suggested, however, it seemed >>>> to be an overkill for the scenario and there was no consensus on this. >>>> Exploring GUC-land was preferred. >>> >>> Cool. >>> >>>> Please find attached a patch, built on Michael's patch from above mentioned >>>> thread, which supports choosing different number of nodes from each set i.e. >>>> k nodes from set 1, l nodes from set 2, so on. >>>> The format of synchronous_standby_names has been updated to standby name >>>> followed by the required count separated by hyphen. Ex: 'aa-1, bb-3'. The >>>> transaction waits for all the specified number of standby in each group. Any >>>> extra nodes with the same name will be considered potential. The special >>>> entry * for the standby name is also supported. >>> >>> I don't think that this is going in the good direction, what was >>> suggested mainly by Robert was to use a micro-language that would >>> allow far more extensibility that what you are proposing. See for >>> example CA+TgmobPWoeNMMEpfx0jWRvQufxVbqRv26Ezq_XHk21GxrXo9w@mail.gmail.com >>> for some ideas. >> >> Doesn't this approach prevent us from specifying the "potential" synchronous >> standby server? For example, imagine the case where you want to treat >> the server AAA as synchronous standby. You also want to use the server BBB >> as synchronous standby only if the server AAA goes down. IOW, you want to >> prefer to the server AAA as synchronous standby rather than BBB. >> Currently we can easily set up that case by just setting >> synchronous_standby_names as follows. >> >> synchronous_standby_names = 'AAA, BBB' >> >> However, after we adopt the quorum commit feature with the proposed >> macro-language, how can we set up that case? It seems impossible... >> I'm afraid that this might be a backward compatibility issue. > > Like that: > synchronous_standby_names = 'AAA, BBB' > The thing is that we need to support the old grammar as well to be > fully backward compatible, Yep, that's an idea. Supporting two different grammars is a bit messy, though... If we merge the "priority" concept to the quorum commit, that's better. But for now I have no idea about how we can do that. > and that's actually equivalent to that in > the grammar: 1(AAA,BBB,CCC). I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while two servers AAA and BBB are running, the master server may return a success of the transaction to the client just after it receives the ACK from BBB. OTOH, in the case of AAA,BBB, that never happens. The master must wait for the ACK from AAA to arrive before completing the transaction. And then, if AAA goes down, BBB should become synchronous standby. Regards, -- Fujii Masao
On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote:
> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote:
>> and that's actually equivalent to that in
>> the grammar: 1(AAA,BBB,CCC).
>
> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while
> two servers AAA and BBB are running, the master server may return a success
> of the transaction to the client just after it receives the ACK from BBB.
> OTOH, in the case of AAA,BBB, that never happens. The master must wait for
> the ACK from AAA to arrive before completing the transaction. And then,
> if AAA goes down, BBB should become synchronous standby.
Ah. Right. I missed your point, that's a bad day... We could have
multiple separators to define group types then:
- "()" where the order of acknowledgement does not matter
- "[]" where it does not.
You would find the old grammar with:
1[AAA,BBB,CCC]
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 June 2015 at 05:01, Michael Paquier <michael.paquier@gmail.com> wrote: >> >> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote: >> > On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote: >> >> and that's actually equivalent to that in >> >> the grammar: 1(AAA,BBB,CCC). >> > >> > I don't think that they are the same. In the case of 1(AAA,BBB,CCC), >> > while >> > two servers AAA and BBB are running, the master server may return a >> > success >> > of the transaction to the client just after it receives the ACK from >> > BBB. >> > OTOH, in the case of AAA,BBB, that never happens. The master must wait >> > for >> > the ACK from AAA to arrive before completing the transaction. And then, >> > if AAA goes down, BBB should become synchronous standby. >> >> Ah. Right. I missed your point, that's a bad day... We could have >> multiple separators to define group types then: >> - "()" where the order of acknowledgement does not matter >> - "[]" where it does not. >> You would find the old grammar with: >> 1[AAA,BBB,CCC] > > Let's start with a complex, fully described use case then work out how to > specify what we want. > > I'm nervous of "it would be good ifs" because we do a ton of work only to > find a design flaw. > I'm not sure specific implementation yet, but I came up with solution for this case. For example, - s_s_name = '1(a, b), c, d' The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3. i.g, 'b' and 'c' are potential sync node, and the quorum commit is enable only between 'a' and 'b'. - s_s_name = 'a, 1(b,c), d' priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3. So the quorum commit with 'b' and 'c' will be enabled after 'a' down. With this idea, I think that we could use conventional syntax as in the past. Though? Regards, -- Sawada Masahiko
On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote: > Let's start with a complex, fully described use case then work out how to > specify what we want. Well, one of the most simple cases where quorum commit and this feature would be useful for is that, with 2 data centers: - on center 1, master A and standby B - on center 2, standby C and standby D With the current synchronous_standby_names, what we can do now is ensuring that one node has acknowledged the commit of master. For example synchronous_standby_names = 'B,C,D'. But you know that :) What this feature would allow use to do is for example being able to ensure that a node on the data center 2 has acknowledged the commit of master, meaning that even if data center 1 completely lost for a reason or another we have at least one node on center 2 that has lost no data at transaction commit. Now, regarding the way to express that, we need to use a concept of node group for each element of synchronous_standby_names. A group contains a set of elements, each element being a group or a single node. And for each group we need to know three things when a commit needs to be acknowledged: - Does my group need to acknowledge the commit? - If yes, how many elements in my group need to acknowledge it? - Does the order of my elements matter? That's where the micro-language idea makes sense to use. For example, we can define a group using separators and like (elt1,...eltN) or [elt1,elt2,eltN]. Appending a number in front of a group is essential as well for quorum commits. Hence for example, assuming that '()' is used for a group whose element order does not matter, if we use that: - k(elt1,elt2,eltN) means that we need for the k elements in the set to return true (aka commit confirmation). - k[elt1,elt2,eltN] means that we need for the first k elements in the set to return true. When k is not defined for a group, k = 1. Using only elements separated by commas for the upper group means that we wait for the first element in the set (for backward compatibility), hence: 1(elt1,elt2,eltN) <=> elt1,elt2,eltN We could as well mix each behavior, aka being able to define for a group to wait for the first k elements and a total of j elements in the whole set, but I don't think that we need to go that far. I suspect that in most cases users will be satisfied with only cases where there is a group of data centers, and they want to be sure that one or two in each center has acknowledged a commit to master (performance is not the matter here if centers are not close). Hence in the case above, you could get the behavior wanted with this definition: 2(B,(C,D)) With more data centers, like 3 (wait for two nodes in the 3rd set): 3(B,(C,D),2(E,F,G)) Users could define more levels of group, like that: 2(A,(B,(C,D))) But that's actually something few people would do in real cases. > I'm nervous of "it would be good ifs" because we do a ton of work only to > find a design flaw. That makes sense. Let's continue arguing on it then. -- Michael
Hi, On 2015-06-26 AM 12:49, Sawada Masahiko wrote: > On Thu, Jun 25, 2015 at 7:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Let's start with a complex, fully described use case then work out how to >> specify what we want. >> >> I'm nervous of "it would be good ifs" because we do a ton of work only to >> find a design flaw. >> > > I'm not sure specific implementation yet, but I came up with solution > for this case. > > For example, > - s_s_name = '1(a, b), c, d' > The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3. > i.g, 'b' and 'c' are potential sync node, and the quorum commit is > enable only between 'a' and 'b'. > > - s_s_name = 'a, 1(b,c), d' > priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3. > So the quorum commit with 'b' and 'c' will be enabled after 'a' down. > Do we really need to add a number like '1' in '1(a, b), c, d'? The order of writing names already implies priorities like 2 & 3 for c & d, respectively, like in your example. Having to write '1' for the group '(a, b)' seems unnecessary, IMHO. Sorry if I have missed any previous discussion where its necessity was discussed. So, the order of writing standby names in the list should declare their relative priorities and parentheses (possibly nested) should help inform about the grouping (for quorum?) Thanks, Amit
On Fri, Jun 26, 2015 at 2:59 PM, Amit Langote wrote: > Do we really need to add a number like '1' in '1(a, b), c, d'? > The order of writing names already implies priorities like 2 & 3 for c & d, > respectively, like in your example. Having to write '1' for the group '(a, b)' > seems unnecessary, IMHO. Sorry if I have missed any previous discussion where > its necessity was discussed. '1' is implied if no number is specified. That's the idea as written here, not something decided of course :) > So, the order of writing standby names in the list should declare their > relative priorities and parentheses (possibly nested) should help inform about > the grouping (for quorum?) Yes. -- Michael
On 2015-06-26 PM 02:59, Amit Langote wrote: > On 2015-06-26 AM 12:49, Sawada Masahiko wrote: >> >> For example, >> - s_s_name = '1(a, b), c, d' >> The priority of both 'a' and 'b' are 1, and 'c' is 2, 'd' is 3. >> i.g, 'b' and 'c' are potential sync node, and the quorum commit is >> enable only between 'a' and 'b'. >> >> - s_s_name = 'a, 1(b,c), d' >> priority of 'a' is 1, 'b' and 'c' are 2, 'd' is 3. >> So the quorum commit with 'b' and 'c' will be enabled after 'a' down. >> > > Do we really need to add a number like '1' in '1(a, b), c, d'? > > The order of writing names already implies priorities like 2 & 3 for c & d, > respectively, like in your example. Having to write '1' for the group '(a, b)' > seems unnecessary, IMHO. Sorry if I have missed any previous discussion where > its necessity was discussed. > > So, the order of writing standby names in the list should declare their > relative priorities and parentheses (possibly nested) should help inform about > the grouping (for quorum?) > Oh, I missed Michael's latest message that describes its necessity. So, the number is essentially the quorum for a group. Sorry about the noise. Thanks, Amit
Hi, On 2015-06-25 PM 01:01, Michael Paquier wrote: > On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote: >> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote: >>> and that's actually equivalent to that in >>> the grammar: 1(AAA,BBB,CCC). >> >> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while >> two servers AAA and BBB are running, the master server may return a success >> of the transaction to the client just after it receives the ACK from BBB. >> OTOH, in the case of AAA,BBB, that never happens. The master must wait for >> the ACK from AAA to arrive before completing the transaction. And then, >> if AAA goes down, BBB should become synchronous standby. > > Ah. Right. I missed your point, that's a bad day... We could have > multiple separators to define group types then: > - "()" where the order of acknowledgement does not matter > - "[]" where it does not. For '[]', I guess you meant "where it does." > You would find the old grammar with: > 1[AAA,BBB,CCC] > Thanks, Amit
On Fri, Jun 26, 2015 at 5:04 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Hi, > > On 2015-06-25 PM 01:01, Michael Paquier wrote: >> On Thu, Jun 25, 2015 at 12:57 PM, Fujii Masao wrote: >>> On Thu, Jun 25, 2015 at 12:15 PM, Michael Paquier wrote: >>>> and that's actually equivalent to that in >>>> the grammar: 1(AAA,BBB,CCC). >>> >>> I don't think that they are the same. In the case of 1(AAA,BBB,CCC), while >>> two servers AAA and BBB are running, the master server may return a success >>> of the transaction to the client just after it receives the ACK from BBB. >>> OTOH, in the case of AAA,BBB, that never happens. The master must wait for >>> the ACK from AAA to arrive before completing the transaction. And then, >>> if AAA goes down, BBB should become synchronous standby. >> >> Ah. Right. I missed your point, that's a bad day... We could have >> multiple separators to define group types then: >> - "()" where the order of acknowledgement does not matter >> - "[]" where it does not. > > For '[]', I guess you meant "where it does." Yes, thanks :p -- Michael
On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote: >> Let's start with a complex, fully described use case then work out how to >> specify what we want. > > Well, one of the most simple cases where quorum commit and this > feature would be useful for is that, with 2 data centers: > - on center 1, master A and standby B > - on center 2, standby C and standby D > With the current synchronous_standby_names, what we can do now is > ensuring that one node has acknowledged the commit of master. For > example synchronous_standby_names = 'B,C,D'. But you know that :) > What this feature would allow use to do is for example being able to > ensure that a node on the data center 2 has acknowledged the commit of > master, meaning that even if data center 1 completely lost for a > reason or another we have at least one node on center 2 that has lost > no data at transaction commit. > > Now, regarding the way to express that, we need to use a concept of > node group for each element of synchronous_standby_names. A group > contains a set of elements, each element being a group or a single > node. And for each group we need to know three things when a commit > needs to be acknowledged: > - Does my group need to acknowledge the commit? > - If yes, how many elements in my group need to acknowledge it? > - Does the order of my elements matter? > > That's where the micro-language idea makes sense to use. For example, > we can define a group using separators and like (elt1,...eltN) or > [elt1,elt2,eltN]. Appending a number in front of a group is essential > as well for quorum commits. Hence for example, assuming that '()' is > used for a group whose element order does not matter, if we use that: > - k(elt1,elt2,eltN) means that we need for the k elements in the set > to return true (aka commit confirmation). > - k[elt1,elt2,eltN] means that we need for the first k elements in the > set to return true. > > When k is not defined for a group, k = 1. Using only elements > separated by commas for the upper group means that we wait for the > first element in the set (for backward compatibility), hence: > 1(elt1,elt2,eltN) <=> elt1,elt2,eltN Nice design. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/26/2015 09:42 AM, Robert Haas wrote: > On Fri, Jun 26, 2015 at 1:46 AM, Michael Paquier >> That's where the micro-language idea makes sense to use. For example, >> we can define a group using separators and like (elt1,...eltN) or >> [elt1,elt2,eltN]. Appending a number in front of a group is essential >> as well for quorum commits. Hence for example, assuming that '()' is >> used for a group whose element order does not matter, if we use that: >> - k(elt1,elt2,eltN) means that we need for the k elements in the set >> to return true (aka commit confirmation). >> - k[elt1,elt2,eltN] means that we need for the first k elements in the >> set to return true. >> >> When k is not defined for a group, k = 1. Using only elements >> separated by commas for the upper group means that we wait for the >> first element in the set (for backward compatibility), hence: >> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN This really feels like we're going way beyond what we want a single string GUC. I feel that this feature, as outlined, is a terrible hack which we will regret supporting in the future. You're taking something which was already a fast hack because we weren't sure if anyone would use it, and building two levels on top of that. If we're going to do quorum, multi-set synchrep, then we need to have a real management interface. Like, we really ought to have a system catalog and some built in functions to manage this instead, e.g. pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC) pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5') pg_modify_sync_set(quorum INT, set_members VARIADIC) pg_drop_synch_set(set_name NAME) For users who want the new functionality, they just set synchronous_standby_names='catalog' in pg.conf. Having a function interface for this would make it worlds easier for the DBA to reconfigure in order to accomodate network changes as well. Let's face it, a DBA with three synch sets in different geos is NOT going to want to edit pg.conf by hand and reload when the link to Brazil goes down. That's a really sucky workflow, and near-impossible to automate. We'll also want a new system view, pg_stat_syncrep: pg_stat_synchrepstandby_nameclient_addrreplication_statussynch_setsynch_quorumsynch_status Alternately, we could overload those columns onto pg_stat_replication, but that seems messy. Finally, while I'm raining on everyone's parade: the mechanism of identifying synchronous replicas by setting the application_name on the replica is confusing and error-prone; if we're building out synchronous replication into a sophisticated system, we ought to think about replacing it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jun 26, 2015 at 1:12 PM, Josh Berkus <josh@agliodbs.com> wrote: > This really feels like we're going way beyond what we want a single > string GUC. I feel that this feature, as outlined, is a terrible hack > which we will regret supporting in the future. You're taking something > which was already a fast hack because we weren't sure if anyone would > use it, and building two levels on top of that. > > If we're going to do quorum, multi-set synchrep, then we need to have a > real management interface. Like, we really ought to have a system > catalog and some built in functions to manage this instead, e.g. > > pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC) > > pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5') > > pg_modify_sync_set(quorum INT, set_members VARIADIC) > > pg_drop_synch_set(set_name NAME) > > For users who want the new functionality, they just set > synchronous_standby_names='catalog' in pg.conf. > > Having a function interface for this would make it worlds easier for the > DBA to reconfigure in order to accomodate network changes as well. > Let's face it, a DBA with three synch sets in different geos is NOT > going to want to edit pg.conf by hand and reload when the link to Brazil > goes down. That's a really sucky workflow, and near-impossible to automate. I think your proposal is worth considering, but you would need to fill in a lot more details and explain how it works in detail, rather than just via a set of example function calls. The GUC-based syntax proposal covers cases like multi-level rules and, now, prioritization, and it's not clear how those would be reflected in what you propose. > Finally, while I'm raining on everyone's parade: the mechanism of > identifying synchronous replicas by setting the application_name on the > replica is confusing and error-prone; if we're building out synchronous > replication into a sophisticated system, we ought to think about > replacing it. I'm not averse to replacing it with something we all agree is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/26/2015 11:32 AM, Robert Haas wrote: > I think your proposal is worth considering, but you would need to fill > in a lot more details and explain how it works in detail, rather than > just via a set of example function calls. The GUC-based syntax > proposal covers cases like multi-level rules and, now, prioritization, > and it's not clear how those would be reflected in what you propose. So what I'm seeing from the current proposal is: 1. we have several defined synchronous sets 2. each set requires a quorum of k (defined per set) 3. within each set, replicas are arranged in priority order. One thing which the proposal does not implement is *names* for synchronous sets. I would also suggest that if I lose this battle and we decide to go with a single stringy GUC, that we at least use JSON instead of defining our out, proprietary, syntax? Point 3. also seems kind of vaguely defined. Are we still relying on the idea that multiple servers have the same application_name to make them equal, and that anything else is a proritization? That is, if we have: replica1: appname=group1 replica2: appname=group2 replica3: appname=group1 replica4: appname=group2 replica5: appname=group1 replica6: appname=group2 And the definition: synchset: Aquorum: 2members: [ group1, group2 ] Then the desired behavior would be: we must get acks from at least 2 servers in group1, but if group1 isn't responding, then from group2? What if *one* server in group1 responds? What do we do? Do we fail the whole group and try for 2 out of 3 in group2? Or do we only need one in group2? In which case, what prioritization is there? Who could possibly use anything so complex? I'm personally not convinced that quorum and prioritization are compatible. I suggest instead that quorum and prioritization should be exclusive alternatives, that is that a synch set should be either a quorum set (with all members as equals) or a prioritization set (if rep1 fails, try rep2). I can imagine use cases for either mode, but not one which would involve doing both together. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote: >> Let's start with a complex, fully described use case then work out how to >> specify what we want. > > Well, one of the most simple cases where quorum commit and this > feature would be useful for is that, with 2 data centers: > - on center 1, master A and standby B > - on center 2, standby C and standby D > With the current synchronous_standby_names, what we can do now is > ensuring that one node has acknowledged the commit of master. For > example synchronous_standby_names = 'B,C,D'. But you know that :) > What this feature would allow use to do is for example being able to > ensure that a node on the data center 2 has acknowledged the commit of > master, meaning that even if data center 1 completely lost for a > reason or another we have at least one node on center 2 that has lost > no data at transaction commit. > > Now, regarding the way to express that, we need to use a concept of > node group for each element of synchronous_standby_names. A group > contains a set of elements, each element being a group or a single > node. And for each group we need to know three things when a commit > needs to be acknowledged: > - Does my group need to acknowledge the commit? > - If yes, how many elements in my group need to acknowledge it? > - Does the order of my elements matter? > > That's where the micro-language idea makes sense to use. For example, > we can define a group using separators and like (elt1,...eltN) or > [elt1,elt2,eltN]. Appending a number in front of a group is essential > as well for quorum commits. Hence for example, assuming that '()' is > used for a group whose element order does not matter, if we use that: > - k(elt1,elt2,eltN) means that we need for the k elements in the set > to return true (aka commit confirmation). > - k[elt1,elt2,eltN] means that we need for the first k elements in the > set to return true. > > When k is not defined for a group, k = 1. Using only elements > separated by commas for the upper group means that we wait for the > first element in the set (for backward compatibility), hence: > 1(elt1,elt2,eltN) <=> elt1,elt2,eltN > I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this case (for backward compatibility), right? Regards, -- Sawada Masahiko
On Sun, Jun 28, 2015 at 5:52 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Jun 26, 2015 at 2:46 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote: >>> Let's start with a complex, fully described use case then work out how to >>> specify what we want. >> >> Well, one of the most simple cases where quorum commit and this >> feature would be useful for is that, with 2 data centers: >> - on center 1, master A and standby B >> - on center 2, standby C and standby D >> With the current synchronous_standby_names, what we can do now is >> ensuring that one node has acknowledged the commit of master. For >> example synchronous_standby_names = 'B,C,D'. But you know that :) >> What this feature would allow use to do is for example being able to >> ensure that a node on the data center 2 has acknowledged the commit of >> master, meaning that even if data center 1 completely lost for a >> reason or another we have at least one node on center 2 that has lost >> no data at transaction commit. >> >> Now, regarding the way to express that, we need to use a concept of >> node group for each element of synchronous_standby_names. A group >> contains a set of elements, each element being a group or a single >> node. And for each group we need to know three things when a commit >> needs to be acknowledged: >> - Does my group need to acknowledge the commit? >> - If yes, how many elements in my group need to acknowledge it? >> - Does the order of my elements matter? >> >> That's where the micro-language idea makes sense to use. For example, >> we can define a group using separators and like (elt1,...eltN) or >> [elt1,elt2,eltN]. Appending a number in front of a group is essential >> as well for quorum commits. Hence for example, assuming that '()' is >> used for a group whose element order does not matter, if we use that: >> - k(elt1,elt2,eltN) means that we need for the k elements in the set >> to return true (aka commit confirmation). >> - k[elt1,elt2,eltN] means that we need for the first k elements in the >> set to return true. >> >> When k is not defined for a group, k = 1. Using only elements >> separated by commas for the upper group means that we wait for the >> first element in the set (for backward compatibility), hence: >> 1(elt1,elt2,eltN) <=> elt1,elt2,eltN >> > > I think that you meant "1[elt1,elt2,eltN] <=> elt1,elt2,eltN" in this > case (for backward compatibility), right? Yes, [] is where the order of items matter. Thanks for the correction. Still we could do the opposite, there is nothing decided here. -- Michael
On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/26/2015 11:32 AM, Robert Haas wrote: >> I think your proposal is worth considering, but you would need to fill >> in a lot more details and explain how it works in detail, rather than >> just via a set of example function calls. The GUC-based syntax >> proposal covers cases like multi-level rules and, now, prioritization, >> and it's not clear how those would be reflected in what you propose. > > So what I'm seeing from the current proposal is: > > 1. we have several defined synchronous sets > 2. each set requires a quorum of k (defined per set) > 3. within each set, replicas are arranged in priority order. > > One thing which the proposal does not implement is *names* for > synchronous sets. I would also suggest that if I lose this battle and > we decide to go with a single stringy GUC, that we at least use JSON > instead of defining our out, proprietary, syntax? JSON would be more flexible for making synchronous set, but it will make us to change how to parse configuration file to enable a value contains newline. > Point 3. also seems kind of vaguely defined. Are we still relying on > the idea that multiple servers have the same application_name to make > them equal, and that anything else is a proritization? That is, if we have: Yep, I guess that the same application name servers have same priority, and the servers in same set have same priority. (The set means here that bunch of application name in GUC). > replica1: appname=group1 > replica2: appname=group2 > replica3: appname=group1 > replica4: appname=group2 > replica5: appname=group1 > replica6: appname=group2 > > And the definition: > > synchset: A > quorum: 2 > members: [ group1, group2 ] > > Then the desired behavior would be: we must get acks from at least 2 > servers in group1, but if group1 isn't responding, then from group2? In this case, If we want to use quorum commit (i.g., all replica have same priority), I guess that we must get ack from 2 *elements* in listed (both group1 and group2). If quorumm = 1, we must get ack from either group1 or group2. > What if *one* server in group1 responds? What do we do? Do we fail the > whole group and try for 2 out of 3 in group2? Or do we only need one in > group2? In which case, what prioritization is there? Who could > possibly use anything so complex? If some servers have same application name, the master server will get each different ack(write, flush LSN) from same application name servers. We can use the lowest LSN of them to release backend waiters, for more safety. But if only one server in group1 returns ack to the master server, and other two servers are not working, I guess the master server can use it because other servers is invalid server. That is, we must get ack at least 1 from each group1 and group2. > I'm personally not convinced that quorum and prioritization are > compatible. I suggest instead that quorum and prioritization should be > exclusive alternatives, that is that a synch set should be either a > quorum set (with all members as equals) or a prioritization set (if rep1 > fails, try rep2). I can imagine use cases for either mode, but not one > which would involve doing both together. > Yep, separating the GUC parameter between prioritization and quorum could be also good idea. Also I think that we must enable us to decide which server we should promote when the master server is down. Regards, -- Sawada Masahiko
On Sat, Jun 27, 2015 at 2:12 AM, Josh Berkus <josh@agliodbs.com> wrote: > Finally, while I'm raining on everyone's parade: the mechanism of > identifying synchronous replicas by setting the application_name on the > replica is confusing and error-prone; if we're building out synchronous > replication into a sophisticated system, we ought to think about > replacing it. I assume that you do not refer to a new parameter in the connection string like node_name, no? Are you referring to an extension of START_REPLICATION in the replication protocol to pass an ID? -- Michael
On 06/28/2015 04:36 AM, Sawada Masahiko wrote: > On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote: >> On 06/26/2015 11:32 AM, Robert Haas wrote: >>> I think your proposal is worth considering, but you would need to fill >>> in a lot more details and explain how it works in detail, rather than >>> just via a set of example function calls. The GUC-based syntax >>> proposal covers cases like multi-level rules and, now, prioritization, >>> and it's not clear how those would be reflected in what you propose. >> >> So what I'm seeing from the current proposal is: >> >> 1. we have several defined synchronous sets >> 2. each set requires a quorum of k (defined per set) >> 3. within each set, replicas are arranged in priority order. >> >> One thing which the proposal does not implement is *names* for >> synchronous sets. I would also suggest that if I lose this battle and >> we decide to go with a single stringy GUC, that we at least use JSON >> instead of defining our out, proprietary, syntax? > > JSON would be more flexible for making synchronous set, but it will > make us to change how to parse configuration file to enable a value > contains newline. Right. Well, another reason we should be using a system catalog and not a single GUC ... > In this case, If we want to use quorum commit (i.g., all replica have > same priority), > I guess that we must get ack from 2 *elements* in listed (both group1 > and group2). > If quorumm = 1, we must get ack from either group1 or group2. In that case, then priority among quorum groups is pretty meaningless, isn't it? >> I'm personally not convinced that quorum and prioritization are >> compatible. I suggest instead that quorum and prioritization should be >> exclusive alternatives, that is that a synch set should be either a >> quorum set (with all members as equals) or a prioritization set (if rep1 >> fails, try rep2). I can imagine use cases for either mode, but not one >> which would involve doing both together. >> > > Yep, separating the GUC parameter between prioritization and quorum > could be also good idea. We're agreed, then ... > Also I think that we must enable us to decide which server we should > promote when the master server is down. Yes, and probably my biggest issue with this patch is that it makes deciding which server to fail over to *more* difficult (by adding more synchronous options) without giving the DBA any more tools to decide how to fail over. Aside from "because we said we'd eventually do it", what real-world problem are we solving with this patch? I'm serious. Only if we define the real reliability/availability problem we want to solve can we decide if the new feature solves it. I've seen a lot of technical discussion about the syntax for the proposed GUC, and zilch about what's going to happen when the master fails, or who the target audience for this feature is. On 06/28/2015 05:11 AM, Michael Paquier wrote:> On Sat, Jun 27, 2015 at 2:12 AM, Josh Berkus <josh@agliodbs.com> wrote: >> Finally, while I'm raining on everyone's parade: the mechanism of >> identifying synchronous replicas by setting the application_name on the >> replica is confusing and error-prone; if we're building out synchronous >> replication into a sophisticated system, we ought to think about >> replacing it. > > I assume that you do not refer to a new parameter in the connection > string like node_name, no? Are you referring to an extension of > START_REPLICATION in the replication protocol to pass an ID? Well, if I had my druthers, we'd have a way to map client_addr (or replica IDs, which would be better, in case of network proxying) *on the master* to synchronous standby roles. Synch roles should be defined on the master, not on the replica, because it's the master which is going to stop accepting writes if they've been defined incorrectly. It's always been a problem that one can accomplish a de-facto denial-of-service by joining a cluster using the same application_name as the synch standby, moreso because it's far too easy to do that accidentally. One needs to simply make the mistake of copying recovery.conf from the synch replica instead of the async replica, and you've created a reliability problem. Also, the fact that we use application_name for synch_standby groups prevents us from giving the standbys in the group their own names for identification purposes. It's only the fact that synchronous groups are relatively useless in the current feature set that's prevented this from being a real operational problem; if we implement quorum commit, then users are going to want to use groups more often and will want to identify the members of the group, and not just by IP address. We *really* should have discussed this feature at PGCon. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/28/2015 04:36 AM, Sawada Masahiko wrote: >> On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus <josh@agliodbs.com> wrote: >>> On 06/26/2015 11:32 AM, Robert Haas wrote: >>>> I think your proposal is worth considering, but you would need to fill >>>> in a lot more details and explain how it works in detail, rather than >>>> just via a set of example function calls. The GUC-based syntax >>>> proposal covers cases like multi-level rules and, now, prioritization, >>>> and it's not clear how those would be reflected in what you propose. >>> >>> So what I'm seeing from the current proposal is: >>> >>> 1. we have several defined synchronous sets >>> 2. each set requires a quorum of k (defined per set) >>> 3. within each set, replicas are arranged in priority order. >>> >>> One thing which the proposal does not implement is *names* for >>> synchronous sets. I would also suggest that if I lose this battle and >>> we decide to go with a single stringy GUC, that we at least use JSON >>> instead of defining our out, proprietary, syntax? >> >> JSON would be more flexible for making synchronous set, but it will >> make us to change how to parse configuration file to enable a value >> contains newline. > > Right. Well, another reason we should be using a system catalog and not > a single GUC ... I assume that this takes into account the fact that you will still need a SIGHUP to reload properly the new node information from those catalogs and to track if some information has been modified or not. And the fact that a connection to those catalogs will be needed as well, something that we don't have now. Another barrier to the catalog approach is that catalogs get replicated to the standbys, and I think that we want to avoid that. But perhaps you simply meant having an SQL interface with some metadata, right? Perhaps I got confused by the word 'catalog'. >>> I'm personally not convinced that quorum and prioritization are >>> compatible. I suggest instead that quorum and prioritization should be >>> exclusive alternatives, that is that a synch set should be either a >>> quorum set (with all members as equals) or a prioritization set (if rep1 >>> fails, try rep2). I can imagine use cases for either mode, but not one >>> which would involve doing both together. >>> >> >> Yep, separating the GUC parameter between prioritization and quorum >> could be also good idea. > > We're agreed, then ... Er, I disagree here. Being able to get prioritization and quorum working together is a requirement of this feature in my opinion. Using again the example above with 2 data centers, being able to define a prioritization set on the set of nodes of data center 1, and a quorum set in data center 2 would reduce failure probability by being able to prevent problems where for example one or more nodes lag behind (improving performance at the same time). >> Also I think that we must enable us to decide which server we should >> promote when the master server is down. > > Yes, and probably my biggest issue with this patch is that it makes > deciding which server to fail over to *more* difficult (by adding more > synchronous options) without giving the DBA any more tools to decide how > to fail over. Aside from "because we said we'd eventually do it", what > real-world problem are we solving with this patch? Hm. This patch needs to be coupled with improvements to pg_stat_replication to be able to represent a node tree by basically adding to which group a node is assigned. I can draft that if needed, I am just a bit too lazy now... Honestly, this is not a matter of tooling. Even today if a DBA wants to change s_s_names without touching postgresql.conf you could just run ALTER SYSTEM and then reload parameters. > It's always been a problem that one can accomplish a de-facto > denial-of-service by joining a cluster using the same application_name > as the synch standby, moreso because it's far too easy to do that > accidentally. One needs to simply make the mistake of copying > recovery.conf from the synch replica instead of the async replica, and > you've created a reliability problem. That's a scripting problem then. There are many ways to do a false manipulation in this area when setting up a standby. application_name value is one, you can do worse by pointing to an incorrect IP as well, miss a firewall filter or point to an incorrect port. > Also, the fact that we use application_name for synch_standby groups > prevents us from giving the standbys in the group their own names for > identification purposes. It's only the fact that synchronous groups are > relatively useless in the current feature set that's prevented this from > being a real operational problem; if we implement quorum commit, then > users are going to want to use groups more often and will want to > identify the members of the group, and not just by IP address. Managing groups in the synchronous protocol is adding one level of complexity for the operator, while what I had in mind first was to allow a user to be able to pass to the server a formula that decides if synchronous_commit is validated or not. In any case this feels like a different feature thinking of it now. > We *really* should have discussed this feature at PGCon. What is done is done. Sawada-san and I have met last weekend, and we agreed to get a clear image of a spec for this features on this thread before doing any coding. So let's continue the discussion.. -- Michael
On 06/29/2015 01:01 AM, Michael Paquier wrote: > On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote: >> Right. Well, another reason we should be using a system catalog and not >> a single GUC ... > > I assume that this takes into account the fact that you will still > need a SIGHUP to reload properly the new node information from those > catalogs and to track if some information has been modified or not. Well, my hope was NOT to need a sighup, which is something I see as a failing of the current system. > And the fact that a connection to those catalogs will be needed as > well, something that we don't have now. Hmmm? I was envisioning the catalog being used as one on the master. Why do we need an additional connection for that? Don't we already need a connection in order to update pg_stat_replication? > Another barrier to the catalog > approach is that catalogs get replicated to the standbys, and I think > that we want to avoid that. Yeah, it occurred to me that that approach has its downside as well as an upside. For example, you wouldn't want a failed-over new master to synchrep to itself. Mostly, I was looking for something reactive, relational, and validated, instead of passing an unvalidated string to pg.conf and hoping that it's accepted on reload. Also some kind of catalog approach would permit incremental changes to the config instead of wholesale replacement. > But perhaps you simply meant having an SQL > interface with some metadata, right? Perhaps I got confused by the > word 'catalog'. No, that doesn't make any sense. >>>> I'm personally not convinced that quorum and prioritization are >>>> compatible. I suggest instead that quorum and prioritization should be >>>> exclusive alternatives, that is that a synch set should be either a >>>> quorum set (with all members as equals) or a prioritization set (if rep1 >>>> fails, try rep2). I can imagine use cases for either mode, but not one >>>> which would involve doing both together. >>>> >>> >>> Yep, separating the GUC parameter between prioritization and quorum >>> could be also good idea. >> >> We're agreed, then ... > > Er, I disagree here. Being able to get prioritization and quorum > working together is a requirement of this feature in my opinion. Using > again the example above with 2 data centers, being able to define a > prioritization set on the set of nodes of data center 1, and a quorum > set in data center 2 would reduce failure probability by being able to > prevent problems where for example one or more nodes lag behind > (improving performance at the same time). Well, then *someone* needs to define the desired behavior for all permutations of prioritized synch sets. If it's undefined, then we're far worse off than we are now. >>> Also I think that we must enable us to decide which server we should >>> promote when the master server is down. >> >> Yes, and probably my biggest issue with this patch is that it makes >> deciding which server to fail over to *more* difficult (by adding more >> synchronous options) without giving the DBA any more tools to decide how >> to fail over. Aside from "because we said we'd eventually do it", what >> real-world problem are we solving with this patch? > > Hm. This patch needs to be coupled with improvements to > pg_stat_replication to be able to represent a node tree by basically > adding to which group a node is assigned. I can draft that if needed, > I am just a bit too lazy now... > > Honestly, this is not a matter of tooling. Even today if a DBA wants > to change s_s_names without touching postgresql.conf you could just > run ALTER SYSTEM and then reload parameters. You're confusing two separate things. The primary manageability problem has nothing to do with altering the parameter. The main problem is: if there is more than one synch candidate, how do we determine *after the master dies* which candidate replica was in synch at the time of failure? Currently there is no way to do that. This proposal plans to, effectively, add more synch candidate configurations without addressing that core design failure *at all*. That's why I say that this patch decreases overall reliability of the system instead of increasing it. When I set up synch rep today, I never use more than two candidate synch servers because of that very problem. And even with two I have to check replay point because I have no way to tell which replica was in-sync at the time of failure. Even in the current limited feature, this significantly reduces the utility of synch rep. In your proposal, where I could have multiple synch rep groups in multiple geos, how on Earth could I figure out what to do when the master datacenter dies? BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC (assuming it's one parameter) instead of some custom syntax. If it's JSON, we can validate it in psql, whereas if it's some custom syntax we have to wait for the db to reload and fail to figure out that we forgot a comma. Using JSON would also permit us to use jsonb_set and jsonb_delete to incrementally change the configuration. Question: what happens *today* if we have two different synch rep strings in two different *.conf files? I wouldn't assume that anyone has tested this ... >> It's always been a problem that one can accomplish a de-facto >> denial-of-service by joining a cluster using the same application_name >> as the synch standby, moreso because it's far too easy to do that >> accidentally. One needs to simply make the mistake of copying >> recovery.conf from the synch replica instead of the async replica, and >> you've created a reliability problem. > > That's a scripting problem then. There are many ways to do a false > manipulation in this area when setting up a standby. application_name > value is one, you can do worse by pointing to an incorrect IP as well, > miss a firewall filter or point to an incorrect port. You're missing the point. We've created something unmanageable because we piggy-backed it onto features intended for something else entirely. Now you're proposing to piggy-back additional features on top of the already teetering Bejing-acrobat-stack of piggy-backs we already have. I'm saying that if you want synch rep to actually be a sophisticated, high-availability system, you need it to actually be high-availability, not just pile on additional configuration options. I'm in favor of a more robust and sophisticated synch rep. But not if nobody not on this mailing list can configure it, and not if even we don't know what it will do in an actual failure situation. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/29/2015 01:01 AM, Michael Paquier wrote: >> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh@agliodbs.com> wrote: > >>> Right. Well, another reason we should be using a system catalog and not >>> a single GUC ... The problem by using system catalog to configure the synchronous replication is that even configuration change needs to wait for its WAL record (i.e., caused by change of system catalog) to be replicated. Imagine the case where you have one synchronous standby but it does down. To keep the system up, you'd like to switch the replication mode to asynchronous by changing the corresponding system catalog. But that change may need to wait until synchronous standby starts up again and its WAL record is successfully replicated. This means that you may need to wait forever... One approach to address this problem is to introduce something like unlogged system catalog. I'm not sure if that causes another big problem, though... > You're confusing two separate things. The primary manageability problem > has nothing to do with altering the parameter. The main problem is: if > there is more than one synch candidate, how do we determine *after the > master dies* which candidate replica was in synch at the time of > failure? Currently there is no way to do that. This proposal plans to, > effectively, add more synch candidate configurations without addressing > that core design failure *at all*. That's why I say that this patch > decreases overall reliability of the system instead of increasing it. I agree this is a problem even today, but it's basically independent from the proposed feature *itself*. So I think that it's better to discuss and work on the problem separately. If so, we might be able to provide good way to find new master even if the proposed feature finally fails to be adopted. Regards, -- Fujii Masao
On 6/26/15 1:46 AM, Michael Paquier wrote: > - k(elt1,elt2,eltN) means that we need for the k elements in the set > to return true (aka commit confirmation). > - k[elt1,elt2,eltN] means that we need for the first k elements in the > set to return true. I think the difference between (...) and [...] is not intuitive. To me, {...} would be more intuitive to indicate order does not matter. > When k is not defined for a group, k = 1. How about putting it at the end? Like [foo,bar,baz](2)
On 6/26/15 2:53 PM, Josh Berkus wrote: > I would also suggest that if I lose this battle and > we decide to go with a single stringy GUC, that we at least use JSON > instead of defining our out, proprietary, syntax? Does JSON have a natural syntax for a set without order?
On 7/1/15 10:15 AM, Fujii Masao wrote: > One approach to address this problem is to introduce something like unlogged > system catalog. I'm not sure if that causes another big problem, though... Yeah, like the data disappearing after a crash. ;-)
On 6/26/15 1:12 PM, Josh Berkus wrote: > If we're going to do quorum, multi-set synchrep, then we need to have a > real management interface. Like, we really ought to have a system > catalog and some built in functions to manage this instead, e.g. > > pg_add_synch_set(set_name NAME, quorum INT, set_members VARIADIC) > > pg_add_synch_set('bolivia', 1, 'bsrv-2,'bsrv-3','bsrv-5') > > pg_modify_sync_set(quorum INT, set_members VARIADIC) > > pg_drop_synch_set(set_name NAME) I respect that some people might like this, but I don't really see this as an improvement. It's much easier for an administration person or program to type out a list of standbys in a text file than having to go through these interfaces that are non-idempotent, verbose, and only available when the database server is up. The nice thing about a plain and simple system is that you can build a complicated system on top of it, if desired.
On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/29/2015 01:01 AM, Michael Paquier wrote: > > You're confusing two separate things. The primary manageability problem > has nothing to do with altering the parameter. The main problem is: if > there is more than one synch candidate, how do we determine *after the > master dies* which candidate replica was in synch at the time of > failure? Currently there is no way to do that. This proposal plans to, > effectively, add more synch candidate configurations without addressing > that core design failure *at all*. That's why I say that this patch > decreases overall reliability of the system instead of increasing it. > > When I set up synch rep today, I never use more than two candidate synch > servers because of that very problem. And even with two I have to check > replay point because I have no way to tell which replica was in-sync at > the time of failure. Even in the current limited feature, this > significantly reduces the utility of synch rep. In your proposal, where > I could have multiple synch rep groups in multiple geos, how on Earth > could I figure out what to do when the master datacenter dies? We can have same application name servers today, it's like group. So there are two problems regarding fail-over: 1. How can we know which group(set) we should use? (group means application_name here) 2. And how can we decide which a server of that group we should promote to the next master server? #1, it's one of the big problem, I think. I haven't came up with correct solution yet, but we would need to know which server(group) is the best for promoting without the running old master server. For example, improving pg_stat_replication view. or the mediation process always check each progress of standby. #2, I guess the best solution is that the DBA can promote any server of group. That is, DBA always can promote server without considering state of server of that group. It's not difficult, always using lowest LSN of a group as group LSN. > > BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC > (assuming it's one parameter) instead of some custom syntax. If it's > JSON, we can validate it in psql, whereas if it's some custom syntax we > have to wait for the db to reload and fail to figure out that we forgot > a comma. Using JSON would also permit us to use jsonb_set and > jsonb_delete to incrementally change the configuration. Sounds convenience and flexibility. I agree with this json format parameter only if we don't combine both quorum and prioritization. Because of backward compatibility. I tend to use json format value and it's new separated GUC parameter. Anyway, if we use json, I'm imaging parameter values like below. { "group1" : { "quorum" : 1, "standbys" : [ { "a" : { "quorum" :2, "standbys" : [ "c", "d" ] } }, "b" ] } } > Question: what happens *today* if we have two different synch rep > strings in two different *.conf files? I wouldn't assume that anyone > has tested this ... We use last defied parameter even if sync rep strings in several file, right? Regards, -- Sawada Masahiko
All: Replying to multiple people below. On 07/01/2015 07:15 AM, Fujii Masao wrote: > On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote: >> You're confusing two separate things. The primary manageability problem >> has nothing to do with altering the parameter. The main problem is: if >> there is more than one synch candidate, how do we determine *after the >> master dies* which candidate replica was in synch at the time of >> failure? Currently there is no way to do that. This proposal plans to, >> effectively, add more synch candidate configurations without addressing >> that core design failure *at all*. That's why I say that this patch >> decreases overall reliability of the system instead of increasing it. > > I agree this is a problem even today, but it's basically independent from > the proposed feature *itself*. So I think that it's better to discuss and > work on the problem separately. If so, we might be able to provide > good way to find new master even if the proposed feature finally fails > to be adopted. I agree that they're separate features. My argument is that the quorum synch feature isn't materially useful if we don't create some feature to identify which server(s) were in synch at the time the master died. The main reason I'm arguing on this thread is that discussion of this feature went straight into GUC syntax, without ever discussing: * what use cases are we serving? * what features do those use cases need? I'm saying that we need to have that discussion first before we go into syntax. We gave up on quorum commit in 9.1 partly because nobody was convinced that it was actually useful; that case still needs to be established, and if we can determine *under what circumstances* it's useful, then we can know if the proposed feature we have is what we want or not. Myself, I have two use case for changes to sync rep: 1. the ability to specify a group of three replicas in the same data center, and have commit succeed if it succeeds on two of them. The purpose of this is to avoid data loss even if we lose the master and one replica. 2. the ability to specify that synch needs to succeed on two replicas in two different data centers. The idea here is to be able to ensure consistency between all data centers. Speaking of which: how does the proposed patch roll back the commit on one replica if it fails to get quorum? On 07/01/2015 07:55 AM, Peter Eisentraut wrote:> I respect that some people might like this, but I don't really see this > as an improvement. It's much easier for an administration person or > program to type out a list of standbys in a text file than having to go > through these interfaces that are non-idempotent, verbose, and only > available when the database server is up. The nice thing about a plain > and simple system is that you can build a complicated system on top of > it, if desired. I'm disagreeing that the proposed system is "plain and simple". What we have now is simple; anything we try to add on top of it is goign to be much less so. Frankly, given the proposed feature, I'm not sure that a "plain and simple" implementation is *possible*; it's not a simple problem. On 07/01/2015 07:58 AM, Sawada Masahiko wrote:> On Tue, Jun 30, 2015 at > We can have same application name servers today, it's like group. > So there are two problems regarding fail-over: > 1. How can we know which group(set) we should use? (group means > application_name here) > 2. And how can we decide which a server of that group we should > promote to the next master server? Well, one possibility is to have each replica keep a flag which indicates whether it thinks it's in sync or not. This flag would be updated every time the replica sends a sync-ack to the master. There's a couple issues with that though: Synch Flag: the flag would need to be WAL-logged or written to disk somehow on the replica, in case of the situation where the whole data center shuts down, comes back up, and the master fails on restart. In order for the replica to WAL-log this, we'd need to add special .sync files to pg_xlog, like we currently have .history. Such a file could be getting updated thousands of times per second, which is potentially an issue. We could reduce writes by either synching to disk periodically, or having the master write the sync state to a catalog, and replicate it, but ... Race Condition: there's a bit of a race condition during adverse shutdown situations which could result in uncertainty, especially in general data center failures and network failures which might not hit all servers at the same time. If the master is wal-logging sync state, this race condition is much worse, because it's pretty much certain that one message updating sync state would be lost in the event of a master crash. Likewise, if we don't log every synch state change, we've widened the opportunity for a race condition. > #1, it's one of the big problem, I think. > I haven't came up with correct solution yet, but we would need to know > which server(group) is the best for promoting > without the running old master server. > For example, improving pg_stat_replication view. or the mediation > process always check each progress of standby. Well, pg_stat_replication is useless for promotion, because if you need to do an emergency promotion, you don't have access to that view. Mind you, any adding additional synch configurations will require either extra columns in pg_stat_replication, or a new system view, but that doesn't help us for the failover issue. > #2, I guess the best solution is that the DBA can promote any server of group. > That is, DBA always can promote server without considering state of > server of that group. > It's not difficult, always using lowest LSN of a group as group LSN. Sure, but if we're going to do that, why use synch rep at all? Let alone quorum commit? > Sounds convenience and flexibility. I agree with this json format > parameter only if we don't combine both quorum and prioritization. > Because of backward compatibility. > I tend to use json format value and it's new separated GUC parameter. Well, we could just detect if the parameter begins with { or not. ;-) We could also do an end-run around the current GUC code by not permitting line breaks in the JSON. >> Question: what happens *today* if we have two different synch rep >> strings in two different *.conf files? I wouldn't assume that anyone >> has tested this ... > > We use last defied parameter even if sync rep strings in several file, right? Yeah, I was just wondering if anyone had tested that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jul 1, 2015 at 11:45 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 6/26/15 1:46 AM, Michael Paquier wrote: >> - k(elt1,elt2,eltN) means that we need for the k elements in the set >> to return true (aka commit confirmation). >> - k[elt1,elt2,eltN] means that we need for the first k elements in the >> set to return true. > > I think the difference between (...) and [...] is not intuitive. To me, > {...} would be more intuitive to indicate order does not matter. When defining a set of elements {} defines elements one by one, () and [] are used for ranges. Perhaps the difference is better this way. >> When k is not defined for a group, k = 1. > > How about putting it at the end? Like > > [foo,bar,baz](2) I am less convinced by that, now I won't argue against it either. -- Michael
On Wed, Jul 1, 2015 at 11:58 PM, Sawada Masahiko wrote: > On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus wrote: >> >> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC >> (assuming it's one parameter) instead of some custom syntax. If it's >> JSON, we can validate it in psql, whereas if it's some custom syntax we >> have to wait for the db to reload and fail to figure out that we forgot >> a comma. Using JSON would also permit us to use jsonb_set and >> jsonb_delete to incrementally change the configuration. > > Sounds convenience and flexibility. I agree with this json format > parameter only if we don't combine both quorum and prioritization. > Because of backward compatibility. > I tend to use json format value and it's new separated GUC parameter. This is going to make postgresql.conf unreadable. That does not look very user-friendly, and a JSON object is actually longer in characters than the formula spec proposed upthread. > Anyway, if we use json, I'm imaging parameter values like below. > [JSON] >> Question: what happens *today* if we have two different synch rep >> strings in two different *.conf files? I wouldn't assume that anyone >> has tested this ... > We use last defied parameter even if sync rep strings in several file, right? The last one wins, that's the rule in GUCs. Note that postgresql.auto.conf has the top priority over the rest, and that files included in postgresql.conf have their value considered when they are opened by the parser. Well, the JSON format has merit, if stored as metadata in PGDATA such as it is independent on WAL, in something like pg_syncdata/ and if it can be modified with a useful interface, which is where Josh's first idea could prove to be useful. We just need a clear representation of the JSON schema we would use and with what kind of functions we could manipulate it on top of a get/set that can be used to retrieve and update the metadata as wanted. In order to preserve backward-compatibility, set s_s_names as 'special_value' and switch to the old interface. We could consider dropping it after a couple of releases and being sure that the new system is stable. Also, I think that we should rely on SIGHUP as a first step of the implementation to update the status of sync nodes in backend processes. As a future improvement we could perhaps get rid. Still it seems safer to me to rely on a signal to update the in-memory status as a first step as this is what we have now. -- Michael
On Thu, Jul 2, 2015 at 3:21 AM, Josh Berkus <josh@agliodbs.com> wrote: > All: > > Replying to multiple people below. > > On 07/01/2015 07:15 AM, Fujii Masao wrote: >> On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote: >>> You're confusing two separate things. The primary manageability problem >>> has nothing to do with altering the parameter. The main problem is: if >>> there is more than one synch candidate, how do we determine *after the >>> master dies* which candidate replica was in synch at the time of >>> failure? Currently there is no way to do that. This proposal plans to, >>> effectively, add more synch candidate configurations without addressing >>> that core design failure *at all*. That's why I say that this patch >>> decreases overall reliability of the system instead of increasing it. >> >> I agree this is a problem even today, but it's basically independent from >> the proposed feature *itself*. So I think that it's better to discuss and >> work on the problem separately. If so, we might be able to provide >> good way to find new master even if the proposed feature finally fails >> to be adopted. > > I agree that they're separate features. My argument is that the quorum > synch feature isn't materially useful if we don't create some feature to > identify which server(s) were in synch at the time the master died. > > The main reason I'm arguing on this thread is that discussion of this > feature went straight into GUC syntax, without ever discussing: > > * what use cases are we serving? > * what features do those use cases need? > > I'm saying that we need to have that discussion first before we go into > syntax. We gave up on quorum commit in 9.1 partly because nobody was > convinced that it was actually useful; that case still needs to be > established, and if we can determine *under what circumstances* it's > useful, then we can know if the proposed feature we have is what we want > or not. > > Myself, I have two use case for changes to sync rep: > > 1. the ability to specify a group of three replicas in the same data > center, and have commit succeed if it succeeds on two of them. The > purpose of this is to avoid data loss even if we lose the master and one > replica. > > 2. the ability to specify that synch needs to succeed on two replicas in > two different data centers. The idea here is to be able to ensure > consistency between all data centers. Yeah, I'm also thinking those *simple* use cases. I'm not sure how many people really want to have very complicated quorum commit setting. > Speaking of which: how does the proposed patch roll back the commit on > one replica if it fails to get quorum? You meant the case where there are two sync replicas and the master needs to wait until both send the ACK, then only one replica goes down? In this case, the master receives the ACK from only one replica and it must keep waiting until new sync replica appears and sends back the ACK. So the committed transaction (written WAL record) would not be rolled back. > Well, one possibility is to have each replica keep a flag which > indicates whether it thinks it's in sync or not. This flag would be > updated every time the replica sends a sync-ack to the master. There's a > couple issues with that though: I don't think this is good approach because there can be the case where you need to promote even the standby server not having sync flag. Please imagine the case where you have sync and async standby servers. When the master goes down, the async standby might be ahead of the sync one. This is possible in practice. In this case, it might be better to promote the async standby instead of sync one. Because the remaining sync standby which is behind can easily follow up with new master. We can promote the sync standby in this case. But since the remaining async standby is ahead, it's not easy to follow up with new master. Probably new base backup needs to be taken onto async standby from new master, or pg_rewind needs to be executed. That is, the async standby basically needs to be set up again. So I'm thinking that we basically need to check the progress on each standby to choose new master. Regards, -- Fujii Masao
On 2015-07-02 PM 03:12, Fujii Masao wrote: > > So I'm thinking that we basically need to check the progress on each > standby to choose new master. > Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Thanks, Amit
Amit wrote: > Does HA software determine a standby to promote based on replication > progress > or would things be reliable enough for it to infer one from the quorum > setting > specified in GUC (or wherever)? Is part of the job of this patch to make > the > latter possible? Just wondering or perhaps I am completely missing the > point. Deciding the failover standby is not exactly part of this patch but we should be able to set up a mechanism to decide which is the best standby to be promoted. We might not be able to conclude this from the sync parameter alone. As specified before in some cases an async standby could also be most eligible for the promotion. ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856201.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2015-07-02 PM 03:12, Fujii Masao wrote: >> >> So I'm thinking that we basically need to check the progress on each >> standby to choose new master. >> > > Does HA software determine a standby to promote based on replication progress > or would things be reliable enough for it to infer one from the quorum setting > specified in GUC (or wherever)? Is part of the job of this patch to make the > latter possible? Just wondering or perhaps I am completely missing the point. Replication progress is a factor of choice, but not the only one. The sole role of this patch is just to allow us to have more advanced policy in defining how synchronous replication works, aka how we want to let the master acknowledge a commit synchronously from a set of N standbys. In any case, this is something unrelated to the discussion happening here. -- Michael
On 2015-07-02 PM 03:52, Michael Paquier wrote: > On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2015-07-02 PM 03:12, Fujii Masao wrote: >>> >>> So I'm thinking that we basically need to check the progress on each >>> standby to choose new master. >>> >> >> Does HA software determine a standby to promote based on replication progress >> or would things be reliable enough for it to infer one from the quorum setting >> specified in GUC (or wherever)? Is part of the job of this patch to make the >> latter possible? Just wondering or perhaps I am completely missing the point. > > Replication progress is a factor of choice, but not the only one. The > sole role of this patch is just to allow us to have more advanced > policy in defining how synchronous replication works, aka how we want > to let the master acknowledge a commit synchronously from a set of N > standbys. In any case, this is something unrelated to the discussion > happening here. > Got it, thanks! Regards, Amit
On 2015-07-02 PM 03:43, Beena Emerson wrote: > Amit wrote: > >> Does HA software determine a standby to promote based on replication >> progress >> or would things be reliable enough for it to infer one from the quorum >> setting >> specified in GUC (or wherever)? Is part of the job of this patch to make >> the >> latter possible? Just wondering or perhaps I am completely missing the >> point. > > Deciding the failover standby is not exactly part of this patch but we > should be able to set up a mechanism to decide which is the best standby to > be promoted. > > We might not be able to conclude this from the sync parameter alone. > > As specified before in some cases an async standby could also be most > eligible for the promotion. > Thanks for the explanation. Regards, Amit
Hello, There has been a lot of discussion. It has become a bit confusing. I am summarizing my understanding of the discussion till now. Kindly let me know if I missed anything important. Backward compatibility: We have to provide support for the current format and behavior for synchronous replication (The first running standby from list s_s_names) In case the new format does not include GUC, then a special value to be specified for s_s_names to indicate that. Priority and quorum: Quorum treats all the standby with same priority while in priority behavior, each one has a different priority and ACK must be received from the specified k lowest priority servers. I am not sure how combining both will work out. Mostly we would like to have some standbys from each data center to be in sync. Can it not be achieved by quorum only? GUC parameter: There are some arguments over the text format. However if we continue using this, specifying the number before the group is a more readable option than specifying it later. S_s_names = 3(A, (P,Q), 2(X,Y,Z)) is better compared to S_s_names = (A, (P,Q), (X,Y,Z) (2)) (3) Catalog Method: Is it safe to assume we may not going ahead with the Catalog approach? A system catalog and some built in functions to set the sync parameters is not viable because it can cause- promoted master to sync rep itself- changes to catalog may continuously wait for ACK froma down server. The main problem of unlogged system catalog is data loss during crash. JSON: I agree it would make GUC very complex and unreadable. We can consider using is as meta data. I think the only point in favor of JSON is to be able to set it using functions instead of having to edit and reload right? Identifying standby: The main concern for the current use of application_name seems to be that multiple standby with same name would form an intentional group (maybe across data clusters too?). I agree it would be better to have a mechanism to uniquely identify a standby and groups can be made using whatever method we use to set the sync requirements. Main concern seems to be about deciding which standby is to be promoted is a separate issue altogether. ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856216.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > There has been a lot of discussion. It has become a bit confusing. > I am summarizing my understanding of the discussion till now. > Kindly let me know if I missed anything important. > > Backward compatibility: > We have to provide support for the current format and behavior for > synchronous replication (The first running standby from list s_s_names) > In case the new format does not include GUC, then a special value to be > specified for s_s_names to indicate that. > > Priority and quorum: > Quorum treats all the standby with same priority while in priority behavior, > each one has a different priority and ACK must be received from the > specified k lowest priority servers. > I am not sure how combining both will work out. > Mostly we would like to have some standbys from each data center to be in > sync. Can it not be achieved by quorum only? So you're wondering if there is the use case where both quorum and priority are used together? For example, please imagine the case where you have two standby servers (say A and B) in local site, and one standby server (say C) in remote disaster recovery site. You want to set up sync replication so that the master waits for ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A or B crashes, you want to make the master wait for ACK from either the remaining local standby or C. On the other hand, you don't want to use the setting like 1(A, B, C). Because in this setting, C can be sync standby when the master craches, and both A and B might be very behind of C. In this case, you need to promote the remote standby server C to new master,,, this is what you'd like to avoid. The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer. Regards, -- Fujii Masao
I am not sure how combining both will work out.
Catalog Method:
Is it safe to assume we may not going ahead with the Catalog approach?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/01/2015 11:12 PM, Fujii Masao wrote: > I don't think this is good approach because there can be the case where > you need to promote even the standby server not having sync flag. > Please imagine the case where you have sync and async standby servers. > When the master goes down, the async standby might be ahead of the > sync one. This is possible in practice. In this case, it might be better to > promote the async standby instead of sync one. Because the remaining > sync standby which is behind can easily follow up with new master. If we're always going to be polling the replicas for furthest ahead, then why bother implementing quorum synch at all? That's the basic question I'm asking. What does it buy us that we don't already have? I'm serious, here. Without any additional information on synch state at failure time, I would never use quorum synch. If there's someone on this thread who *would*, let's speak to their use case and then we can actually get the feature right. Anyone? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-07-02 11:10:27 -0700, Josh Berkus wrote: > If we're always going to be polling the replicas for furthest ahead, > then why bother implementing quorum synch at all? That's the basic > question I'm asking. What does it buy us that we don't already have? What do those topic have to do with each other? A standby fundamentally can be further ahead than what the primary knows about. So you can't do very much with that knowledge on the master anyway? > I'm serious, here. Without any additional information on synch state at > failure time, I would never use quorum synch. If there's someone on > this thread who *would*, let's speak to their use case and then we can > actually get the feature right. Anyone? How would you otherwise ensure that your data is both on a second server in the same DC and in another DC? Which is a pretty darn common desire? Greetings, Andres Freund
On 07/02/2015 11:31 AM, Andres Freund wrote: > On 2015-07-02 11:10:27 -0700, Josh Berkus wrote: >> If we're always going to be polling the replicas for furthest ahead, >> then why bother implementing quorum synch at all? That's the basic >> question I'm asking. What does it buy us that we don't already have? > > What do those topic have to do with each other? A standby fundamentally > can be further ahead than what the primary knows about. So you can't do > very much with that knowledge on the master anyway? > >> I'm serious, here. Without any additional information on synch state at >> failure time, I would never use quorum synch. If there's someone on >> this thread who *would*, let's speak to their use case and then we can >> actually get the feature right. Anyone? > > How would you otherwise ensure that your data is both on a second server > in the same DC and in another DC? Which is a pretty darn common desire? So there's two parts to this: 1. I need to ensure that data is replicated to X places. 2. I need to *know* which places data was synchronously replicated to when the master goes down. My entire point is that (1) alone is useless unless you also have (2). And do note that I'm talking about information on the replica, not on the master, since in any failure situation we don't have the old master around to check. Say you take this case: "2" : { "local_replica", "london_server", "nyc_server" } ... which should ensure that any data which is replicated is replicated to at least two places, so that even if you lose the entire local datacenter, you have the data on at least one remote data center. EXCEPT: say you lose both the local datacenter and communication with the london server at the same time (due to transatlantic cable issues, a huge DDOS, or whatever). You'd like to promote the NYC server to be the new master, but only if it was in sync at the time its communication with the original master was lost ... except that you have no way of knowing that. Given that, we haven't really reduced our data loss potential or improved availabilty from the current 1-redundant synch rep. We still need to wait to get the London server back to figure out if we want to promote or not. Now, this configuration would reduce the data loss window: "3" : { "local_replica", "london_server", "nyc_server" } As would this one: "2" : { "local_replica", "nyc_server" } ... because we would know definitively which servers were in sync. So maybe that's the use case we should be supporting? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: > So there's two parts to this: > > 1. I need to ensure that data is replicated to X places. > > 2. I need to *know* which places data was synchronously replicated to > when the master goes down. > > My entire point is that (1) alone is useless unless you also have (2). I think there's a good set of usecases where that's really not the case. > And do note that I'm talking about information on the replica, not on > the master, since in any failure situation we don't have the old > master around to check. How would you, even theoretically, synchronize that knowledge to all the replicas? Even when they're temporarily disconnected? > Say you take this case: > > "2" : { "local_replica", "london_server", "nyc_server" } > > ... which should ensure that any data which is replicated is replicated > to at least two places, so that even if you lose the entire local > datacenter, you have the data on at least one remote data center. > EXCEPT: say you lose both the local datacenter and communication with > the london server at the same time (due to transatlantic cable issues, a > huge DDOS, or whatever). You'd like to promote the NYC server to be the > new master, but only if it was in sync at the time its communication > with the original master was lost ... except that you have no way of > knowing that. Pick up the phone, compare the lsns, done. > Given that, we haven't really reduced our data loss potential or > improved availabilty from the current 1-redundant synch rep. We still > need to wait to get the London server back to figure out if we want to > promote or not. > > Now, this configuration would reduce the data loss window: > > "3" : { "local_replica", "london_server", "nyc_server" } > > As would this one: > > "2" : { "local_replica", "nyc_server" } > > ... because we would know definitively which servers were in sync. So > maybe that's the use case we should be supporting? If you want automated failover you need a leader election amongst the surviving nodes. The replay position is all they need to elect the node that's furthest ahead, and that information exists today. Greetings, Andres Freund
On 07/02/2015 12:44 PM, Andres Freund wrote: > On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: >> So there's two parts to this: >> >> 1. I need to ensure that data is replicated to X places. >> >> 2. I need to *know* which places data was synchronously replicated to >> when the master goes down. >> >> My entire point is that (1) alone is useless unless you also have (2). > > I think there's a good set of usecases where that's really not the case. Please share! My plea for usecases was sincere. I can't think of any. >> And do note that I'm talking about information on the replica, not on >> the master, since in any failure situation we don't have the old >> master around to check. > > How would you, even theoretically, synchronize that knowledge to all the > replicas? Even when they're temporarily disconnected? You can't, which is why what we need to know is when the replica thinks it was last synced from the replica side. That is, a sync timestamp and lsn from the last time the replica ack'd a sync commit back to the master successfully. Based on that information, I can make an informed decision, even if I'm down to one replica. >> ... because we would know definitively which servers were in sync. So >> maybe that's the use case we should be supporting? > > If you want automated failover you need a leader election amongst the > surviving nodes. The replay position is all they need to elect the node > that's furthest ahead, and that information exists today. I can do that already. If quorum synch commit doesn't help us minimize data loss any better than async replication or the current 1-redundant, why would we want it? If it does help us minimize data loss, how? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/02/2015 12:44 PM, Andres Freund wrote: >> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: >>> So there's two parts to this: >>> >>> 1. I need to ensure that data is replicated to X places. >>> >>> 2. I need to *know* which places data was synchronously replicated to >>> when the master goes down. >>> >>> My entire point is that (1) alone is useless unless you also have (2). >> >> I think there's a good set of usecases where that's really not the case. > > Please share! My plea for usecases was sincere. I can't think of any. > >>> And do note that I'm talking about information on the replica, not on >>> the master, since in any failure situation we don't have the old >>> master around to check. >> >> How would you, even theoretically, synchronize that knowledge to all the >> replicas? Even when they're temporarily disconnected? > > You can't, which is why what we need to know is when the replica thinks > it was last synced from the replica side. That is, a sync timestamp and > lsn from the last time the replica ack'd a sync commit back to the > master successfully. Based on that information, I can make an informed > decision, even if I'm down to one replica. > >>> ... because we would know definitively which servers were in sync. So >>> maybe that's the use case we should be supporting? >> >> If you want automated failover you need a leader election amongst the >> surviving nodes. The replay position is all they need to elect the node >> that's furthest ahead, and that information exists today. > > I can do that already. If quorum synch commit doesn't help us minimize > data loss any better than async replication or the current 1-redundant, > why would we want it? If it does help us minimize data loss, how? In your example of "2" : { "local_replica", "london_server", "nyc_server" }, if there is not something like quorum commit, only local_replica is synch and the other two are async. In this case, if the local data center gets destroyed, you need to promote either london_server or nyc_server. But since they are async, they might not have the data which have been already committed in the master. So data loss! Of course, as I said yesterday, they might have all the data and no data loss happens at the promotion. But the point is that there is no guarantee that no data loss happens. OTOH, if we use quorum commit, we can guarantee that either london_server or nyc_server has all the data which have been committed in the master. So I think that quorum commit is helpful for minimizing the data loss. Regards, -- Fujii Masao
Josh Berkus wrote: > > Say you take this case: > > "2" : { "local_replica", "london_server", "nyc_server" } > > ... which should ensure that any data which is replicated is replicated > to at least two places, so that even if you lose the entire local > datacenter, you have the data on at least one remote data center. > EXCEPT: say you lose both the local datacenter and communication with > the london server at the same time (due to transatlantic cable issues, a > huge DDOS, or whatever). You'd like to promote the NYC server to be the > new master, but only if it was in sync at the time its communication > with the original master was lost ... except that you have no way of > knowing that. Please consider the following: If we have multiple replica on each DC, we can use the following: 3(local1, 1(london1, london2), 1(nyc1, nyc2)) In this case at least 1 from each DC is sync rep. When local and London center is lost, NYC promotion can be done by comparing the LSN. Also quorum would also ensure that even if one of the standby in a data center goes down, another can take over, preventing data loss. In the case 3(local1, london1, nyc1) If nyc1, is down, the transaction would wait continuously. This can be avoided. ----- -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856394.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
<p dir="ltr">Hello,<p dir="ltr">This has been registered in the next 2015-09 CF since majority are in favor of adding thismultiple sync replication feature (with quorum/priority).<p dir="ltr">New patch will be submitted once we have reacheda consensus on the design.<p dir="ltr">--<br /> Beena Emerson<br />
Re: Synch failover WAS: Support for N synchronous standby servers - take 2
On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote: >> On 07/02/2015 12:44 PM, Andres Freund wrote: >>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: >>>> So there's two parts to this: >>>> >>>> 1. I need to ensure that data is replicated to X places. >>>> >>>> 2. I need to *know* which places data was synchronously replicated to >>>> when the master goes down. >>>> >>>> My entire point is that (1) alone is useless unless you also have (2). >>> >>> I think there's a good set of usecases where that's really not the case. >> >> Please share! My plea for usecases was sincere. I can't think of any. >> >>>> And do note that I'm talking about information on the replica, not on >>>> the master, since in any failure situation we don't have the old >>>> master around to check. >>> >>> How would you, even theoretically, synchronize that knowledge to all the >>> replicas? Even when they're temporarily disconnected? >> >> You can't, which is why what we need to know is when the replica thinks >> it was last synced from the replica side. That is, a sync timestamp and >> lsn from the last time the replica ack'd a sync commit back to the >> master successfully. Based on that information, I can make an informed >> decision, even if I'm down to one replica. >> >>>> ... because we would know definitively which servers were in sync. So >>>> maybe that's the use case we should be supporting? >>> >>> If you want automated failover you need a leader election amongst the >>> surviving nodes. The replay position is all they need to elect the node >>> that's furthest ahead, and that information exists today. >> >> I can do that already. If quorum synch commit doesn't help us minimize >> data loss any better than async replication or the current 1-redundant, >> why would we want it? If it does help us minimize data loss, how? > > In your example of "2" : { "local_replica", "london_server", "nyc_server" }, > if there is not something like quorum commit, only local_replica is synch > and the other two are async. In this case, if the local data center gets > destroyed, you need to promote either london_server or nyc_server. But > since they are async, they might not have the data which have been already > committed in the master. So data loss! Of course, as I said yesterday, > they might have all the data and no data loss happens at the promotion. > But the point is that there is no guarantee that no data loss happens. > OTOH, if we use quorum commit, we can guarantee that either london_server > or nyc_server has all the data which have been committed in the master. > > So I think that quorum commit is helpful for minimizing the data loss. > Yeah, quorum commit is helpful for minimizing data loss in comparison with today replication. But in this your case, how can we know which server we should use as the next master server, after local data center got down? If we choose a wrong one, we would get the data loss. Regards, -- Sawada Masahiko
On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Jul 3, 2015 at 12:18 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus <josh@agliodbs.com> wrote: >>> On 07/02/2015 12:44 PM, Andres Freund wrote: >>>> On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: >>>>> So there's two parts to this: >>>>> >>>>> 1. I need to ensure that data is replicated to X places. >>>>> >>>>> 2. I need to *know* which places data was synchronously replicated to >>>>> when the master goes down. >>>>> >>>>> My entire point is that (1) alone is useless unless you also have (2). >>>> >>>> I think there's a good set of usecases where that's really not the case. >>> >>> Please share! My plea for usecases was sincere. I can't think of any. >>> >>>>> And do note that I'm talking about information on the replica, not on >>>>> the master, since in any failure situation we don't have the old >>>>> master around to check. >>>> >>>> How would you, even theoretically, synchronize that knowledge to all the >>>> replicas? Even when they're temporarily disconnected? >>> >>> You can't, which is why what we need to know is when the replica thinks >>> it was last synced from the replica side. That is, a sync timestamp and >>> lsn from the last time the replica ack'd a sync commit back to the >>> master successfully. Based on that information, I can make an informed >>> decision, even if I'm down to one replica. >>> >>>>> ... because we would know definitively which servers were in sync. So >>>>> maybe that's the use case we should be supporting? >>>> >>>> If you want automated failover you need a leader election amongst the >>>> surviving nodes. The replay position is all they need to elect the node >>>> that's furthest ahead, and that information exists today. >>> >>> I can do that already. If quorum synch commit doesn't help us minimize >>> data loss any better than async replication or the current 1-redundant, >>> why would we want it? If it does help us minimize data loss, how? >> >> In your example of "2" : { "local_replica", "london_server", "nyc_server" }, >> if there is not something like quorum commit, only local_replica is synch >> and the other two are async. In this case, if the local data center gets >> destroyed, you need to promote either london_server or nyc_server. But >> since they are async, they might not have the data which have been already >> committed in the master. So data loss! Of course, as I said yesterday, >> they might have all the data and no data loss happens at the promotion. >> But the point is that there is no guarantee that no data loss happens. >> OTOH, if we use quorum commit, we can guarantee that either london_server >> or nyc_server has all the data which have been committed in the master. >> >> So I think that quorum commit is helpful for minimizing the data loss. >> > > Yeah, quorum commit is helpful for minimizing data loss in comparison > with today replication. > But in this your case, how can we know which server we should use as > the next master server, after local data center got down? > If we choose a wrong one, we would get the data loss. Check the progress of each server, e.g., by using pg_last_xlog_replay_location(), and choose the server which is ahead of as new master. Regards, -- Fujii Masao
Re: Synch failover WAS: Support for N synchronous standby servers - take 2
On Fri, Jul 3, 2015 at 6:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jul 3, 2015 at 5:59 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> Yeah, quorum commit is helpful for minimizing data loss in comparison >> with today replication. >> But in this your case, how can we know which server we should use as >> the next master server, after local data center got down? >> If we choose a wrong one, we would get the data loss. > > Check the progress of each server, e.g., by using > pg_last_xlog_replay_location(), > and choose the server which is ahead of as new master. > Thanks. So we can choice the next master server using by checking the progress of each server, if hot standby is enabled. And a such procedure is needed even today replication. I think that the #2 problem which is Josh pointed out seems to be solved; 1. I need to ensure that data is replicated toX places. 2. I need to *know* which places data was synchronously replicated to when the master goes down. And we can address #1 problem using quorum commit. Thought? Regards, -- Sawada Masahiko
Sawada Masahiko wrote: > > I think that the #2 problem which is Josh pointed out seems to be solved; > 1. I need to ensure that data is replicated to X places. > 2. I need to *know* which places data was synchronously replicated > to when the master goes down. > And we can address #1 problem using quorum commit. > > Thought? I agree. The knowledge of which servers where in sync(#2) would not actually help us determine the new master and quorum solves #1. ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856459.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 2015-07-02 14:54:19 -0700, Josh Berkus wrote: > On 07/02/2015 12:44 PM, Andres Freund wrote: > > On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: > >> So there's two parts to this: > >> > >> 1. I need to ensure that data is replicated to X places. > >> > >> 2. I need to *know* which places data was synchronously replicated to > >> when the master goes down. > >> > >> My entire point is that (1) alone is useless unless you also have (2). > > > > I think there's a good set of usecases where that's really not the case. > > Please share! My plea for usecases was sincere. I can't think of any. "I have important data. I want to survive both a local hardware failure (it's faster to continue using the local standby) and I want to protect myself against actual disaster striking the primary datacenter". Pretty common. > >> And do note that I'm talking about information on the replica, not on > >> the master, since in any failure situation we don't have the old > >> master around to check. > > > > How would you, even theoretically, synchronize that knowledge to all the > > replicas? Even when they're temporarily disconnected? > > You can't, which is why what we need to know is when the replica thinks > it was last synced from the replica side. That is, a sync timestamp and > lsn from the last time the replica ack'd a sync commit back to the > master successfully. Based on that information, I can make an informed > decision, even if I'm down to one replica. I think you're mashing together nearly unrelated topics. Note that we already have the last replayed lsn, and we have the timestamp of the last replayed transaction. > > If you want automated failover you need a leader election amongst the > > surviving nodes. The replay position is all they need to elect the node > > that's furthest ahead, and that information exists today. > > I can do that already. If quorum synch commit doesn't help us minimize > data loss any better than async replication or the current 1-redundant, > why would we want it? If it does help us minimize data loss, how? But it does make us safer against data loss? If your app gets back the commit you know that the data has made it both to the local replica and one other datacenter. And you're now safe agains the loss of either the master's hardware (most likely scenario) and safe against the loss of the entire primary datacenter. That you need additional logic to know to which other datacenter to fail over is just yet another piece (which you *can* build today).
On 07/03/2015 03:12 AM, Sawada Masahiko wrote: > Thanks. So we can choice the next master server using by checking the > progress of each server, if hot standby is enabled. > And a such procedure is needed even today replication. > > I think that the #2 problem which is Josh pointed out seems to be solved; > 1. I need to ensure that data is replicated to X places. > 2. I need to *know* which places data was synchronously replicated > to when the master goes down. > And we can address #1 problem using quorum commit. It's not solved. I still have zero ways of knowing if a replica was in sync or not at the time the master went down. Now, you and others have argued persuasively that there are valuable use cases for quorum commit even without solving that particular issue, but there's a big difference between "we can work around this problem" and the problem is solved. I forked the subject line because I think that the inability to identify synch replicas under failover conditions is a serious problem with synch rep *today*, and pretending that it doesn't exist doesn't help us even if we don't fix it in 9.6. Let me give you three cases where our lack of information on the replica side about whether it thinks it's in sync or not causes synch rep to fail to protect data. The first case is one I've actually seen in production, and the other two are hypothetical but entirely plausible. Case #1: two synchronous replica servers have the application name "synchreplica". An admin uses the wrong Chef template, and deploys a server which was supposed to be an async replica with the same recovery.conf template, and it ends up in the "synchreplica" group as well. Due to restarts (pushing out an update release), the new server ends up seizing and keeping sync. Then the master dies. Because the new server wasn't supposed to be a sync replica in the first place, it is not checked; they just fail over to the furthest ahead of the two original synch replicas, neither of which was actually in synch. Case #2: "2 { local, london, nyc }" setup. At 2am, the links between data centers become unreliable, such that the on-call sysadmin disables synch rep because commits on the master are intolerably slow. Then, at 10am, the links between data centers fail entirely. The day shift, not knowing that the night shift disabled sync, fail over to London thinking that they can do so with zero data loss. Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group priority setup. We lose communication with everything but Europe. How can we decide whether to wait to get sydney back, or to promote London immedately? I could come up with numerous other situations, but all of the three above completely reasonable cases show how having the knowledge of what time a replica thought it was last in sync is vital to preventing bad failovers and data loss, and to knowing the quantity of data loss when it can't be prevented. It's an issue *now* that the only data we have about the state of sync rep is on the master, and dies with the master. And it severely limits the actual utility of our synch rep. People implement synch rep in the first place because the "best effort" of asynch rep isn't good enough for them, and yet when it comes to failover we're just telling them "give it your best effort". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-07-03 10:27:05 -0700, Josh Berkus wrote: > On 07/03/2015 03:12 AM, Sawada Masahiko wrote: > > Thanks. So we can choice the next master server using by checking the > > progress of each server, if hot standby is enabled. > > And a such procedure is needed even today replication. > > > > I think that the #2 problem which is Josh pointed out seems to be solved; > > 1. I need to ensure that data is replicated to X places. > > 2. I need to *know* which places data was synchronously replicated > > to when the master goes down. > > And we can address #1 problem using quorum commit. > > It's not solved. I still have zero ways of knowing if a replica was in > sync or not at the time the master went down. What? You pick the standby that's furthest ahead. And you use a high enough quorum so that given your tolerance for failures you'll always be able to reach at least one of the synchronous replicas. Then you promote the one with the highest LSN. Done. This is something that gets *easier* by quorum, not harder. > I forked the subject line because I think that the inability to > identify synch replicas under failover conditions is a serious problem > with synch rep *today*, and pretending that it doesn't exist doesn't > help us even if we don't fix it in 9.6. That's just not how failovers can sanely work. And again, you *have* the information you can have on the standbys already. You *know* what/from when the last replayed xact is. > Let me give you three cases where our lack of information on the replica > side about whether it thinks it's in sync or not causes synch rep to > fail to protect data. The first case is one I've actually seen in > production, and the other two are hypothetical but entirely plausible. > > Case #1: two synchronous replica servers have the application name > "synchreplica". An admin uses the wrong Chef template, and deploys a > server which was supposed to be an async replica with the same > recovery.conf template, and it ends up in the "synchreplica" group as > well. Due to restarts (pushing out an update release), the new server > ends up seizing and keeping sync. Then the master dies. Because the new > server wasn't supposed to be a sync replica in the first place, it is > not checked; they just fail over to the furthest ahead of the two > original synch replicas, neither of which was actually in synch. Nobody can protect you against such configuration errors. We can make it harder to misconfigure, sure, but it doesn't have anything to do with the topic at hand. > Case #2: "2 { local, london, nyc }" setup. At 2am, the links between > data centers become unreliable, such that the on-call sysadmin disables > synch rep because commits on the master are intolerably slow. Then, at > 10am, the links between data centers fail entirely. The day shift, not > knowing that the night shift disabled sync, fail over to London thinking > that they can do so with zero data loss. As I said earlier, you can check against that today by checking the last replayed timestamp. SELECT pg_last_xact_replay_timestamp(); You don't have to pick the one that used to be a sync replica. You pick the one with the most data received. If the day shift doesn't bother to check the standbys now, they'd not check either if they had some way to check whether a node was the chosen sync replica. > Case #3 "1 { london, frankfurt }, 1 { sydney, tokyo }" multi-group > priority setup. We lose communication with everything but Europe. How > can we decide whether to wait to get sydney back, or to promote London > immedately? You normally don't continue automatically at all in that situation. To avoid/minimize data loss you want to have a majority election system to select the new primary. That requires reaching the majority of the nodes. This isn't something specific to postgres, if you look at any solution out there, they're also doing it that way. Statically choosing which of the replicas in a group is the current sync one is a *bad* idea. You want to ensure that at least node in a group has received the data, and stop waiting as soon that's the case. > It's an issue *now* that the only data we have about the state of sync > rep is on the master, and dies with the master. And it severely limits > the actual utility of our synch rep. People implement synch rep in the > first place because the "best effort" of asynch rep isn't good enough > for them, and yet when it comes to failover we're just telling them > "give it your best effort". We don't tell them that, but apparently you do. This subthread is getting absurd, stopping here.
Re: Synch failover WAS: Support for N synchronous standby servers - take 2
On Sat, Jul 4, 2015 at 2:44 AM, Andres Freund wrote: > This subthread is getting absurd, stopping here. Yeah, I agree with Andres here, we are making a mountain of nothing (Frenglish?). I'll send to the other thread some additional ideas soon using a JSON structure. -- Michael
On Thu, Jul 2, 2015 at 9:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> Hello, >> There has been a lot of discussion. It has become a bit confusing. >> I am summarizing my understanding of the discussion till now. >> Kindly let me know if I missed anything important. >> >> Backward compatibility: >> We have to provide support for the current format and behavior for >> synchronous replication (The first running standby from list s_s_names) >> In case the new format does not include GUC, then a special value to be >> specified for s_s_names to indicate that. >> >> Priority and quorum: >> Quorum treats all the standby with same priority while in priority behavior, >> each one has a different priority and ACK must be received from the >> specified k lowest priority servers. >> I am not sure how combining both will work out. >> Mostly we would like to have some standbys from each data center to be in >> sync. Can it not be achieved by quorum only? > > So you're wondering if there is the use case where both quorum and priority are > used together? > > For example, please imagine the case where you have two standby servers > (say A and B) in local site, and one standby server (say C) in remote disaster > recovery site. You want to set up sync replication so that the master waits for > ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A > or B crashes, you want to make the master wait for ACK from either the > remaining local standby or C. On the other hand, you don't want to use the > setting like 1(A, B, C). Because in this setting, C can be sync standby when > the master craches, and both A and B might be very behind of C. In this case, > you need to promote the remote standby server C to new master,,, this is what > you'd like to avoid. > > The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer. > If we set the remote disaster recovery site up as synch replica, we would get some big latencies even though we use quorum commit. So I think this case Fujii-san suggested is a good configuration, and many users would want to use it. I tend to agree with combine quorum and prioritization into one GUC parameter while keeping backward compatibility. Regards, -- Sawada Masahiko
On 07/06/2015 10:03 AM, Sawada Masahiko wrote: >> > The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer. >> > > If we set the remote disaster recovery site up as synch replica, we > would get some big latencies even though we use quorum commit. > So I think this case Fujii-san suggested is a good configuration, and > many users would want to use it. > I tend to agree with combine quorum and prioritization into one GUC > parameter while keeping backward compatibility. OK, so here's the arguments pro-JSON and anti-JSON: pro-JSON: * standard syntax which is recognizable to sysadmins and devops. * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make additions/deletions from the synch rep config. * can add group labels (see below) anti-JSON: * more verbose * syntax is not backwards-compatible, we'd need a switch * people will want to use line breaks, which we can't support Re: group labels: I see a lot of value in being able to add names to quorum groups. Think about how this will be represented in system views; it will be difficult to show sync status of any quorum group in any meaningful way if the group has no label, and any system-assigned label would change unpredictably from the user's perspective. To give a JSON example, let's take the case of needing to sync to two of the servers in either London or NC: '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [ "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1, "servers" [ "nc1", "nc2" ] } }' This says: as the "remotes" group, synch with a quorum of 2 servers in london and a quorum of 1 server in NC. This assumes for backwards-compatibility reasons that we support a priority list of groups of quorums, and not some other combination (see below for more on this). The advantage of having these labels is that it becomes easy to represent statuses for them: sync_group state definition remotes waiting { "london_servers" : { "quorum" ... london_servers synced { "quorum" : 2, "servers" : ... nc_servers waiting { "quorum" : 1, "servers" [ ... Without labels, we force the DBA to track groups by raw definitions, which would be difficult. Also, there's the question of what we do on reload with any statuses of synch groups which are currently in-process, if we don't have a stable key with which to identify groups. The other grammar issue has to do with the nesting nature of quorums and priorities. A theoretical user could want: * a priority list of quorum groups * a quorum group of priority lists * a quorum group of quorum groups * a priority list of quorum groups of quorum groups * a quorum group of quorum groups of priority lists ... etc. I don't really see any possible end to the possible permutations, which is why it would be good to establish some real use cases from now in order to figure out what we really want to support. Absent that, my inclination is that we should implement the simplest possible thing (i.e. no nesting) for 9.5. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2015-07-07 AM 02:56, Josh Berkus wrote: > > Re: group labels: I see a lot of value in being able to add names to > quorum groups. Think about how this will be represented in system > views; it will be difficult to show sync status of any quorum group in > any meaningful way if the group has no label, and any system-assigned > label would change unpredictably from the user's perspective. > > To give a JSON example, let's take the case of needing to sync to two of > the servers in either London or NC: > > '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [ > "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1, > "servers" [ "nc1", "nc2" ] } }' > What if we write the above as: remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1 [nc1, nc2]) That requires only slightly altering the proposed format, that is prepend sync group label string to the quorum number. The monitoring view can be made to internally generate JSON output (if needed) from it. It does not seem very ALTER SYSTEM SET friendly but there are trade-offs either way. Just my 2c. Thanks, Amit
On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote: > pro-JSON: > > * standard syntax which is recognizable to sysadmins and devops. > * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make > additions/deletions from the synch rep config. > * can add group labels (see below) If we go this way, I think that managing a JSON blob with a GUC parameter is crazy, this is way longer in character size than a simple formula because of the key names. Hence, this JSON blob should be in a separate place than postgresql.conf not within the catalog tables, manageable using an SQL interface, and reloaded in backends using SIGHUP. > anti-JSON: > * more verbose > * syntax is not backwards-compatible, we'd need a switch This point is valid as well in the pro-JSON portion. > * people will want to use line breaks, which we can't support Yes, this is caused by the fact of using a GUC. For a simple formula this seems fine to me though, that's what we have today for s_s_names and using a formula is not much longer in character size than what we have now. > Re: group labels: I see a lot of value in being able to add names to > quorum groups. Think about how this will be represented in system > views; it will be difficult to show sync status of any quorum group in > any meaningful way if the group has no label, and any system-assigned > label would change unpredictably from the user's perspective. > To give a JSON example, let's take the case of needing to sync to two of > the servers in either London or NC: > > '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [ > "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1, > "servers" [ "nc1", "nc2" ] } }' The JSON blob managing sync node information could contain additional JSON objects that register a set of nodes as a given group. More easily, you could use let's say the following structure to store the blobs: - pg_syncinfo/global, to store the root of the formula, that could use groups. - pg_syncinfo/groups/$GROUP_NAME to store a set JSON blobs representing a group. > The advantage of having these labels is that it becomes easy to > represent statuses for them: > > sync_group state definition > remotes waiting { "london_servers" : { "quorum" ... > london_servers synced { "quorum" : 2, "servers" : ... > nc_servers waiting { "quorum" : 1, "servers" [ ... > Without labels, we force the DBA to track groups by raw definitions, > which would be difficult. Also, there's the question of what we do on > reload with any statuses of synch groups which are currently in-process, > if we don't have a stable key with which to identify groups. Well, yes. > The other grammar issue has to do with the nesting nature of quorums and > priorities. A theoretical user could want: > > * a priority list of quorum groups > * a quorum group of priority lists > * a quorum group of quorum groups > * a priority list of quorum groups of quorum groups > * a quorum group of quorum groups of priority lists > ... etc. > > I don't really see any possible end to the possible permutations, which > is why it would be good to establish some real use cases from now in > order to figure out what we really want to support. Absent that, my > inclination is that we should implement the simplest possible thing > (i.e. no nesting) for 9.5. I am not sure I agree that this will simplify the work. Currently s_s_names has already 1 level, and we want to append groups to each element of it as well, meaning that we'll need at least 2 level of nesting. -- Michael
On 07/06/2015 06:40 PM, Michael Paquier wrote: > On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote: >> pro-JSON: >> >> * standard syntax which is recognizable to sysadmins and devops. >> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >> additions/deletions from the synch rep config. >> * can add group labels (see below) > > If we go this way, I think that managing a JSON blob with a GUC > parameter is crazy, this is way longer in character size than a simple > formula because of the key names. Hence, this JSON blob should be in a > separate place than postgresql.conf not within the catalog tables, > manageable using an SQL interface, and reloaded in backends using > SIGHUP. I'm not following this at all. What are you saying here? >> I don't really see any possible end to the possible permutations, which >> is why it would be good to establish some real use cases from now in >> order to figure out what we really want to support. Absent that, my >> inclination is that we should implement the simplest possible thing >> (i.e. no nesting) for 9.5. > > I am not sure I agree that this will simplify the work. Currently > s_s_names has already 1 level, and we want to append groups to each > element of it as well, meaning that we'll need at least 2 level of > nesting. Well, we have to draw a line somewhere, unless we're going to support infinite recursion. And if we are going to support infinitie recursion, and kind of compact syntax for a GUC isn't even worth talking about ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/06/2015 06:40 PM, Michael Paquier wrote: >> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote: >>> pro-JSON: >>> >>> * standard syntax which is recognizable to sysadmins and devops. >>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >>> additions/deletions from the synch rep config. >>> * can add group labels (see below) >> >> If we go this way, I think that managing a JSON blob with a GUC >> parameter is crazy, this is way longer in character size than a simple >> formula because of the key names. Hence, this JSON blob should be in a >> separate place than postgresql.conf not within the catalog tables, >> manageable using an SQL interface, and reloaded in backends using >> SIGHUP. > > I'm not following this at all. What are you saying here? A JSON string is longer in terms of number of characters than a formula because it contains key names, and those key names are usually repeated several times, making it harder to read in a configuration file. So what I am saying that that we do not save it as a GUC, but as a separate metadata that can be accessed with a set of SQL functions to manipulate it. -- Michael
On 07/06/2015 09:56 PM, Michael Paquier wrote: > On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 07/06/2015 06:40 PM, Michael Paquier wrote: >>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote: >>>> pro-JSON: >>>> >>>> * standard syntax which is recognizable to sysadmins and devops. >>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >>>> additions/deletions from the synch rep config. >>>> * can add group labels (see below) >>> >>> If we go this way, I think that managing a JSON blob with a GUC >>> parameter is crazy, this is way longer in character size than a simple >>> formula because of the key names. Hence, this JSON blob should be in a >>> separate place than postgresql.conf not within the catalog tables, >>> manageable using an SQL interface, and reloaded in backends using >>> SIGHUP. >> >> I'm not following this at all. What are you saying here? > > A JSON string is longer in terms of number of characters than a > formula because it contains key names, and those key names are usually > repeated several times, making it harder to read in a configuration > file. So what I am saying that that we do not save it as a GUC, but as > a separate metadata that can be accessed with a set of SQL functions > to manipulate it. Where, though? Someone already pointed out the issues with storing it in a system catalog, and adding an additional .conf file with a different format is too horrible to contemplate. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus wrote: > '{ "remotes" : { "london_servers" : { "quorum" : 2, "servers" : [ > "london1", "london2", "london3" ] }, "nc_servers" : { "quorum" : 1, > "servers" [ "nc1", "nc2" ] } }' > > This says: as the "remotes" group, synch with a quorum of 2 servers in > london and a quorum of 1 server in NC. I wanted to clarify about the format. The remotes group does not specify any quorum, only its individual elements mention the quorum. "remotes" is said to sync in london_servers "and" NC. Would absence of a quorum number in a group mean "all" elements? Or the above would be represented as following to imply "AND" between the 2 DC. '{ "remotes" : "quorum" : 2, "servers" :{ "london_servers" : { "quorum" : 2, "servers" : [ "london1", "london2", "london3"] }, "nc_servers" : { "quorum" : 1, "servers" : [ "nc1", "nc2" ] } } }' ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856868.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Amit wrote: > What if we write the above as: > > remotes-1 (london_servers-2 [london1, london2, london3], nc_servers-1 > [nc1, nc2]) Yes this we can consider. Thanks, ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856869.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/06/2015 09:56 PM, Michael Paquier wrote: >> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> On 07/06/2015 06:40 PM, Michael Paquier wrote: >>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote: >>>>> pro-JSON: >>>>> >>>>> * standard syntax which is recognizable to sysadmins and devops. >>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >>>>> additions/deletions from the synch rep config. >>>>> * can add group labels (see below) >>>> >>>> If we go this way, I think that managing a JSON blob with a GUC >>>> parameter is crazy, this is way longer in character size than a simple >>>> formula because of the key names. Hence, this JSON blob should be in a >>>> separate place than postgresql.conf not within the catalog tables, >>>> manageable using an SQL interface, and reloaded in backends using >>>> SIGHUP. >>> >>> I'm not following this at all. What are you saying here? >> >> A JSON string is longer in terms of number of characters than a >> formula because it contains key names, and those key names are usually >> repeated several times, making it harder to read in a configuration >> file. So what I am saying that that we do not save it as a GUC, but as >> a separate metadata that can be accessed with a set of SQL functions >> to manipulate it. > > Where, though? Someone already pointed out the issues with storing it > in a system catalog, and adding an additional .conf file with a > different format is too horrible to contemplate. Something like pg_syncinfo/ coupled with a LW lock, we already do something similar for replication slots with pg_replslot/. -- Michael
Hello, Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote: > pro-JSON: > > * standard syntax which is recognizable to sysadmins and devops. > * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make > additions/deletions from the synch rep config. > * can add group labels (see below) Adding group labels do have a lot of values but as Amit has pointed out, with little modification, they can be included in GUC as well. It will not make it any more complex. On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote: > Something like pg_syncinfo/ coupled with a LW lock, we already do > something similar for replication slots with pg_replslot/. I was trying to figure out how the JSON metadata can be used. It would have to be set using a given set of functions. Right? I am sorry this question is very basic. The functions could be something like: 1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members VARIADIC) This will be used to add a sync set. The set_members can be individual elements of another set name. The parameter is_priority is used to decide whether the set is priority (true) set or quorum (false). This function call will create a folder pg_syncinfo/groups/$NAME and store the json blob? The root group would be automatically sset by finding the group which is not included in other groups? or can be set by another function? 2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool, set_members VARIADIC) This will update the pg_syncinfo/groups/$NAME to store the new values. 3. pg_drop_synch_set(set_name NAME) This will update the pg_syncinfo/groups/$NAME folder. Also all the groups which included this would be updated? 4. pg_show_synch_set() this will display the current sync setting in json format. Am I missing something? Is JSON being preferred because it would be ALTER SYSTEM friendly and in a format already known to users? In a real-life scenario, at most how many groups and nesting would be expected? ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5857516.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote: >> pro-JSON: >> >> * standard syntax which is recognizable to sysadmins and devops. >> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >> additions/deletions from the synch rep config. >> * can add group labels (see below) > > Adding group labels do have a lot of values but as Amit has pointed out, > with little modification, they can be included in GUC as well. It will not > make it any more complex. > > On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote: > >> Something like pg_syncinfo/ coupled with a LW lock, we already do >> something similar for replication slots with pg_replslot/. > > I was trying to figure out how the JSON metadata can be used. > It would have to be set using a given set of functions. Right? > I am sorry this question is very basic. > > The functions could be something like: > 1. pg_add_synch_set(set_name NAME, quorum INT, is_priority bool, set_members > VARIADIC) > > This will be used to add a sync set. The set_members can be individual > elements of another set name. The parameter is_priority is used to decide > whether the set is priority (true) set or quorum (false). This function call > will create a folder pg_syncinfo/groups/$NAME and store the json blob? > > The root group would be automatically sset by finding the group which is not > included in other groups? or can be set by another function? > > 2. pg_modify_sync_set(set_name NAME, quorum INT, is_priority bool, > set_members VARIADIC) > > This will update the pg_syncinfo/groups/$NAME to store the new values. > > 3. pg_drop_synch_set(set_name NAME) > > This will update the pg_syncinfo/groups/$NAME folder. Also all the groups > which included this would be updated? > > 4. pg_show_synch_set() > > this will display the current sync setting in json format. > > Am I missing something? > > Is JSON being preferred because it would be ALTER SYSTEM friendly and in a > format already known to users? > > In a real-life scenario, at most how many groups and nesting would be > expected? > I might missing something but, these functions will generate WAL? If they does, we will face the situation where we need to wait forever, Fujii-san pointed out. Regards, -- Masahiko Sawada
On Mon, Jul 13, 2015 at 9:22 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > I might missing something but, these functions will generate WAL? > If they does, we will face the situation where we need to wait > forever, Fujii-san pointed out. No, those functions are here to manipulate the metadata defining the quorum/priority set. We definitely do not want something that generates WAL. -- Michael
On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > Tue, Jul 7, 2015 at 02:56 AM, Josh Berkus wrote: >> pro-JSON: >> >> * standard syntax which is recognizable to sysadmins and devops. >> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make >> additions/deletions from the synch rep config. >> * can add group labels (see below) > > Adding group labels do have a lot of values but as Amit has pointed out, > with little modification, they can be included in GUC as well. Or you can extend the custom GUC mechanism so that we can specify the groups by using them, for example, quorum_commit.mygroup1 = 'london, nyc' quorum_commit.mygruop2 = 'tokyo, pune' synchronous_standby_names = '1(mygroup1),1(mygroup2)' > On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote: > >> Something like pg_syncinfo/ coupled with a LW lock, we already do >> something similar for replication slots with pg_replslot/. > > I was trying to figure out how the JSON metadata can be used. > It would have to be set using a given set of functions. So we can use only such a set of functions to configure synch rep? I don't like that idea. Because it prevents us from configuring that while the server is not running. > Is JSON being preferred because it would be ALTER SYSTEM friendly and in a > format already known to users? At least currently ALTER SYSTEM cannot accept the JSON data (e.g., the return value of JSON function like json_build_object()) as the setting value. So I'm not sure how friendly ALTER SYSTEM and JSON format really. If you want to argue that, probably you need to improve ALTER SYSTEM so that JSON can be specified. > In a real-life scenario, at most how many groups and nesting would be > expected? I don't think that many groups and nestings are common. Regards, -- Fujii Masao
On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote: > On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote: >> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote: >> >>> Something like pg_syncinfo/ coupled with a LW lock, we already do >>> something similar for replication slots with pg_replslot/. >> >> I was trying to figure out how the JSON metadata can be used. >> It would have to be set using a given set of functions. > > So we can use only such a set of functions to configure synch rep? > I don't like that idea. Because it prevents us from configuring that > while the server is not running. If you store a json blob in a set of files of PGDATA you could update them manually there as well. That's perhaps re-inventing the wheel with what is available with GUCs though. >> Is JSON being preferred because it would be ALTER SYSTEM friendly and in a >> format already known to users? > > At least currently ALTER SYSTEM cannot accept the JSON data > (e.g., the return value of JSON function like json_build_object()) > as the setting value. So I'm not sure how friendly ALTER SYSTEM > and JSON format really. If you want to argue that, probably you > need to improve ALTER SYSTEM so that JSON can be specified. > >> In a real-life scenario, at most how many groups and nesting would be >> expected? > > I don't think that many groups and nestings are common. Yeah, in most common configurations people are not going to have more than 3 groups with only one level of nodes. -- Michael
On Tue, Jul 14, 2015 at 9:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote: >> On Fri, Jul 10, 2015 at 10:06 PM, Beena Emerson wrote: >>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote: >>> >>>> Something like pg_syncinfo/ coupled with a LW lock, we already do >>>> something similar for replication slots with pg_replslot/. >>> >>> I was trying to figure out how the JSON metadata can be used. >>> It would have to be set using a given set of functions. >> >> So we can use only such a set of functions to configure synch rep? >> I don't like that idea. Because it prevents us from configuring that >> while the server is not running. > > If you store a json blob in a set of files of PGDATA you could update > them manually there as well. That's perhaps re-inventing the wheel > with what is available with GUCs though. Why don't we just use GUC? If the quorum setting is not so complicated in real scenario, GUC seems enough for that. Regards, -- Fujii Masao
<p dir="ltr">On Jul 14, 2015 7:15 AM, "Fujii Masao" <<a href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>>wrote:<br /> ><br /> > On Tue, Jul 14, 2015 at 9:00AM, Michael Paquier<br /> > <<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>> wrote:<br/> > > On Mon, Jul 13, 2015 at 10:34 PM, Fujii Masao wrote:<br /> > >> On Fri, Jul 10, 2015 at 10:06PM, Beena Emerson wrote:<br /> > >>> On Tue, Jul 7, 2015 at 2:19 PM, Michael Paquier wrote:<br /> > >>><br/> > >>>> Something like pg_syncinfo/ coupled with a LW lock, we already do<br /> > >>>>something similar for replication slots with pg_replslot/.<br /> > >>><br /> > >>>I was trying to figure out how the JSON metadata can be used.<br /> > >>> It would have to be setusing a given set of functions.<br /> > >><br /> > >> So we can use only such a set of functions toconfigure synch rep?<br /> > >> I don't like that idea. Because it prevents us from configuring that<br /> >>> while the server is not running.<br /> > ><br /> > > If you store a json blob in a set of filesof PGDATA you could update<br /> > > them manually there as well. That's perhaps re-inventing the wheel<br />> > with what is available with GUCs though.<br /> ><br /> > Why don't we just use GUC? If the quorum settingis not so complicated<br /> > in real scenario, GUC seems enough for that.<p dir="ltr">I agree GUC would be enough.<br/> We could also name groups in it.<p dir="ltr">I am thinking of the following format similar to JSON<p dir="ltr"><group_name>:<count>(<list>)<br /> Use of square brackets for priority.<p dir="ltr">Ex:<br />s_s_names = 'remotes: 2 (london: 1 [lndn1, lndn2], nyc: 1[nyc1,nyc2])'<br /><p dir="ltr">Regards,<p dir="ltr">Beena Emerson<br />
>
> On Thu, Jun 25, 2015 at 8:32 PM, Simon Riggs wrote:
> > Let's start with a complex, fully described use case then work out how to
> > specify what we want.
>
> Well, one of the most simple cases where quorum commit and this
> feature would be useful for is that, with 2 data centers:
> - on center 1, master A and standby B
> - on center 2, standby C and standby D
> With the current synchronous_standby_names, what we can do now is
> ensuring that one node has acknowledged the commit of master. For
> example synchronous_standby_names = 'B,C,D'. But you know that :)
> What this feature would allow use to do is for example being able to
> ensure that a node on the data center 2 has acknowledged the commit of
> master, meaning that even if data center 1 completely lost for a
> reason or another we have at least one node on center 2 that has lost
> no data at transaction commit.
>
> Now, regarding the way to express that, we need to use a concept of
> node group for each element of synchronous_standby_names. A group
> contains a set of elements, each element being a group or a single
> node. And for each group we need to know three things when a commit
> needs to be acknowledged:
> - Does my group need to acknowledge the commit?
> - If yes, how many elements in my group need to acknowledge it?
> - Does the order of my elements matter?
>
> That's where the micro-language idea makes sense to use.
Something like pg_syncinfo/ coupled with a LW lock, we already doOn Tue, Jul 7, 2015 at 2:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/06/2015 09:56 PM, Michael Paquier wrote:
>> On Tue, Jul 7, 2015 at 12:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> On 07/06/2015 06:40 PM, Michael Paquier wrote:
>>>> On Tue, Jul 7, 2015 at 2:56 AM, Josh Berkus <josh@agliodbs.com> wrote:
>>>>> pro-JSON:
>>>>>
>>>>> * standard syntax which is recognizable to sysadmins and devops.
>>>>> * can use JSON/JSONB functions with ALTER SYSTEM SET to easily make
>>>>> additions/deletions from the synch rep config.
>>>>> * can add group labels (see below)
>>>>
>>>> If we go this way, I think that managing a JSON blob with a GUC
>>>> parameter is crazy, this is way longer in character size than a simple
>>>> formula because of the key names. Hence, this JSON blob should be in a
>>>> separate place than postgresql.conf not within the catalog tables,
>>>> manageable using an SQL interface, and reloaded in backends using
>>>> SIGHUP.
>>>
>>> I'm not following this at all. What are you saying here?
>>
>> A JSON string is longer in terms of number of characters than a
>> formula because it contains key names, and those key names are usually
>> repeated several times, making it harder to read in a configuration
>> file. So what I am saying that that we do not save it as a GUC, but as
>> a separate metadata that can be accessed with a set of SQL functions
>> to manipulate it.
>
> Where, though? Someone already pointed out the issues with storing it
> in a system catalog, and adding an additional .conf file with a
> different format is too horrible to contemplate.
something similar for replication slots with pg_replslot/.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So there's two parts to this:
1. I need to ensure that data is replicated to X places.
2. I need to *know* which places data was synchronously replicated to
when the master goes down.
My entire point is that (1) alone is useless unless you also have (2).
And do note that I'm talking about information on the replica, not on
the master, since in any failure situation we don't have the old master
around to check.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm in favor of a more robust and sophisticated synch rep. But not if
nobody not on this mailing list can configure it, and not if even we
don't know what it will do in an actual failure situation.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 15, 2015 at 3:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > pg_replslot has persistent state. We are discussing permanent configuration > data for which I don't see the need to create an additional parallel > infrastructure just to store a string given stated objection that the string > is fairly long. AFAICS its not even that long. > > ... > > JSON seems the most sensible format for the string. Inventing a new one > doesn't make sense. Most important for me is the ability to programmatically > manipulate/edit the config string, which would be harder with a new custom > format. > > ... > > Group labels are essential. OK, so this is leading us to the following points: - Use a JSON object to define the quorum/priority groups for the sync state. - Store it as a GUC, and use the check hook to validate its format, which is what we have now with s_s_names - Rely on SIGHUP to maintain an in-memory image of the quorum/priority sync state - Have the possibility to define group labels in this JSON blob, and be able to use those labels in a quorum or priority sync definition. - For backward-compatibility, use for example s_s_names = 'json' to switch to the new system. Also, as a first step of the implementation, do we actually need a set of functions to manipulate the JSON blob. I mean, we could perhaps have them in contrib/ but they do not seem mandatory as long as we document correctly how to document a label group and define a quorum or priority group, no? -- Michael
OK, so this is leading us to the following points:
- Use a JSON object to define the quorum/priority groups for the sync state.
- Store it as a GUC, and use the check hook to validate its format,
which is what we have now with s_s_names
- Rely on SIGHUP to maintain an in-memory image of the quorum/priority
sync state
- Have the possibility to define group labels in this JSON blob, and
be able to use those labels in a quorum or priority sync definition.
- For backward-compatibility, use for example s_s_names = 'json' to
switch to the new system.
Also, as a first step of the implementation, do we actually need a set
of functions to manipulate the JSON blob. I mean, we could perhaps
have them in contrib/ but they do not seem mandatory as long as we
document correctly how to document a label group and define a quorum
or priority group, no?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote: > JSON seems the most sensible format for the string. Inventing a new one > doesn't make sense. Most important for me is the ability to > programmatically manipulate/edit the config string, which would be harder > with a new custom format. Do we need to keep the value consistent across all the servers in the flock? If not, is the behavior halfway sane upon failover? If we need the DBA to keep the value in sync manually, that's going to be a recipe for trouble. Which is going to bite particularly hard during those stressing moments when disaster strikes and things have to be done in emergency mode. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote:
> JSON seems the most sensible format for the string. Inventing a new one
> doesn't make sense. Most important for me is the ability to
> programmatically manipulate/edit the config string, which would be harder
> with a new custom format.
Do we need to keep the value consistent across all the servers in the
flock? If not, is the behavior halfway sane upon failover?
If we need the DBA to keep the value in sync manually, that's going to
be a recipe for trouble. Which is going to bite particularly hard
during those stressing moments when disaster strikes and things have to
be done in emergency mode.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> Group labels are essential. > > OK, so this is leading us to the following points: > - Use a JSON object to define the quorum/priority groups for the sync state. > - Store it as a GUC, and use the check hook to validate its format, > which is what we have now with s_s_names > - Rely on SIGHUP to maintain an in-memory image of the quorum/priority > sync state > - Have the possibility to define group labels in this JSON blob, and > be able to use those labels in a quorum or priority sync definition. > - For backward-compatibility, use for example s_s_names = 'json' to > switch to the new system. Personally, I think we're going to find that using JSON for this rather than a custom syntax makes the configuration strings two or three times as long for no discernable benefit. But I just work here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 15, 2015 at 5:03 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Group labels are essential.
>
> OK, so this is leading us to the following points:
> - Use a JSON object to define the quorum/priority groups for the sync state.
> - Store it as a GUC, and use the check hook to validate its format,
> which is what we have now with s_s_names
> - Rely on SIGHUP to maintain an in-memory image of the quorum/priority
> sync state
> - Have the possibility to define group labels in this JSON blob, and
> be able to use those labels in a quorum or priority sync definition.
> - For backward-compatibility, use for example s_s_names = 'json' to
> switch to the new system.
Personally, I think we're going to find that using JSON for this
rather than a custom syntax makes the configuration strings two or
three times as long for
no discernable benefit.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Personally, I think we're going to find that using JSON for this >> rather than a custom syntax makes the configuration strings two or >> three times as long for > > They may well be 2-3 times as long. Why is that a negative? In my opinion, brevity makes things easier to read and understand. We also don't support multi-line GUCs, so if your configuration takes 140 characters, you're going to have a very long line in your postgresql.conf (and in your pg_settings output, etc.) > * No additional code required in the server to support this syntax (so no > bugs) I think you'll find that this is far from true. Presumably not any arbitrary JSON object will be acceptable. You'll have to parse it as JSON, and then validate that it is of the expected form. It may not be MORE code than implementing a mini-language from scratch, but I wouldn't expect to save much. > * Developers will immediately understand the format I doubt it. I think any format that we pick will have to be carefully documented. People may know what JSON looks like in general, but they will not immediately know what bells and whistles are available in this context. > * Easy to programmatically manipulate in a range of languages I agree that JSON has that advantage, but I doubt that it is important here. I would expect that people might need to generate a new config string and dump it into postgresql.conf, but that should be easy with any reasonable format. I think it will be rare to need to parse the postgresql.conf string, manipulate it programatically, and then put it back. As we've already said, most configurations are simple and shouldn't change frequently. If they're not or they do, that's a problem of itself. However, I'm not trying to ram my idea through; I'm just telling you my opinion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > * Developers will immediately understand the format
>
> I doubt it. I think any format that we pick will have to be carefully
> documented. People may know what JSON looks like in general, but they
> will not immediately know what bells and whistles are available in
> this context.
Robert Haas wrote: > > On Thu, Jul 16, 2015 at 1:32 PM, Simon Riggs <simon@> wrote: > >> Personally, I think we're going to find that using JSON for this > >> rather than a custom syntax makes the configuration strings two or > >> three times as long for > > > > They may well be 2-3 times as long. Why is that a negative? > > In my opinion, brevity makes things easier to read and understand. We > also don't support multi-line GUCs, so if your configuration takes 140 > characters, you're going to have a very long line in your > postgresql.conf (and in your pg_settings output, etc.) > > > * No additional code required in the server to support this syntax (so > no > > bugs) > > I think you'll find that this is far from true. Presumably not any > arbitrary JSON object will be acceptable. You'll have to parse it as > JSON, and then validate that it is of the expected form. It may not > be MORE code than implementing a mini-language from scratch, but I > wouldn't expect to save much. > > > * Developers will immediately understand the format > > I doubt it. I think any format that we pick will have to be carefully > documented. People may know what JSON looks like in general, but they > will not immediately know what bells and whistles are available in > this context. > > * Easy to programmatically manipulate in a range of languages > > I agree that JSON has that advantage, but I doubt that it is important > here. I would expect that people might need to generate a new config > string and dump it into postgresql.conf, but that should be easy with > any reasonable format. I think it will be rare to need to parse the > postgresql.conf string, manipulate it programatically, and then put it > back. As we've already said, most configurations are simple and > shouldn't change frequently. If they're not or they do, that's a > problem of itself. > All points here are valid and I would prefer a new language over JSON. I agree, the new validation code would have to be properly tested to avoid bugs but it wont be too difficult. Also I think methods that generate WAL record is avoided because any attempt to change the syncrep settings will go in indefinite wait when a mandatory sync candidate (as per current settings) goes down (Explained in earlier post id: CAHGQGwE_-HCzw687B4SdMWqAkkPcu-uxmF3MKyDB9mu38cJ7Jg@mail.gmail.com) ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858255.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 7/16/15 12:40 PM, Robert Haas wrote: >> >They may well be 2-3 times as long. Why is that a negative? > In my opinion, brevity makes things easier to read and understand. We > also don't support multi-line GUCs, so if your configuration takes 140 > characters, you're going to have a very long line in your > postgresql.conf (and in your pg_settings output, etc.) Brevity goes both ways, but I don't think that's the real problem here; it's the lack of multi-line support. The JSON that's been proposed makes you work really hard to track what level of nesting you're at, while every alternative format I've seen is terse enough to be very clear on a single line. I'm guessing it'd be really ugly/hard to support at least this GUC being multi-line? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 07/17/2015 04:36 PM, Jim Nasby wrote: > On 7/16/15 12:40 PM, Robert Haas wrote: >>> >They may well be 2-3 times as long. Why is that a negative? >> In my opinion, brevity makes things easier to read and understand. We >> also don't support multi-line GUCs, so if your configuration takes 140 >> characters, you're going to have a very long line in your >> postgresql.conf (and in your pg_settings output, etc.) > > Brevity goes both ways, but I don't think that's the real problem here; > it's the lack of multi-line support. The JSON that's been proposed makes > you work really hard to track what level of nesting you're at, while > every alternative format I've seen is terse enough to be very clear on a > single line. I will point out that the proposed non-JSON syntax does not offer any ability to name consensus/priority groups. I believe that being able to name groups is vital to managing any complex synch rep, but if we add names it will make the non-JSON syntax less compact. > > I'm guessing it'd be really ugly/hard to support at least this GUC being > multi-line? Yes. Mind you, multi-line GUCs would be useful otherwise, but we don't want to hinge this feature on making that work. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 07/17/2015 04:36 PM, Jim Nasby wrote: >> I'm guessing it'd be really ugly/hard to support at least this GUC being >> multi-line? > Mind you, multi-line GUCs would be useful otherwise, but we don't want > to hinge this feature on making that work. I'm pretty sure that changing the GUC parser to allow quoted strings to continue across lines would be trivial. The problem with it is not that it's hard, it's that omitting a closing quote mark would then result in the entire file being syntactically broken, with the error message(s) almost certainly pointing somewhere else than where the actual mistake is. Do we really want such a global reduction in friendliness to make this feature easier? regards, tom lane
Josh Berkus <josh@agliodbs.com> writes:
> On 07/17/2015 04:36 PM, Jim Nasby wrote:
>> I'm guessing it'd be really ugly/hard to support at least this GUC being
>> multi-line?
> Mind you, multi-line GUCs would be useful otherwise, but we don't want
> to hinge this feature on making that work.
I'm pretty sure that changing the GUC parser to allow quoted strings to
continue across lines would be trivial.
The problem with it is not that
it's hard, it's that omitting a closing quote mark would then result in
the entire file being syntactically broken, with the error message(s)
almost certainly pointing somewhere else than where the actual mistake is.
Do we really want such a global reduction in friendliness to make this
feature easier?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote: >synchronous_standby_name= is already 25 characters, so that leaves 115 characters - are they always single byte chars? I am sorry, I did not get why there is a 140 byte limit. Can you please explain? ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858502.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Simon Riggs wrote:
>synchronous_standby_name= is already 25 characters, so that leaves 115
characters - are they always single byte chars?
I am sorry, I did not get why there is a 140 byte limit. Can you please
explain?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs wrote: > The choice between formats is not > solely predicated on whether we have multi-line support. > I still think writing down some actual use cases would help bring the > discussion to a conclusion. Inventing a general facility is hard without > some clear goals about what we need to support. We need to at least support the following: - Grouping: Specify of standbys along with the minimum number of commits required from the group. - Group Type: Groups can either be priority or quorum group. - Group names: to simplify status reporting - Nesting: At least 2 levels of nesting Using JSON, sync rep parameter to replicate in 2 different clusters could be written as: {"remotes": {"quorum": 2, "servers": [{"london": {"prioirty": 2, "servers": ["lndn1","lndn2", "lndn3"] }} , {"nyc": {"priority": 1, "servers": ["ny1","ny2"] }} ] } } The same parameter in the new language (as suggested above) could be written as:'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' Also, I was thinking the name of the main group could be optional. Internally, it can be given the name 'default group' or 'main group' for status reporting. The above could also be written as:'2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' backward compatible: In JSON, while validating we may have to check if it starts with '{' to go for JSON parsing else proceed with the current method. A,B,C => 1[A,B,C]. This can be added in the new parser code. Thoughts? ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5858571.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Simon Riggs wrote: > >> The choice between formats is not >> solely predicated on whether we have multi-line support. > >> I still think writing down some actual use cases would help bring the >> discussion to a conclusion. Inventing a general facility is hard without >> some clear goals about what we need to support. > > We need to at least support the following: > - Grouping: Specify of standbys along with the minimum number of commits > required from the group. > - Group Type: Groups can either be priority or quorum group. As far as I understood at the lowest level a group is just an alias for a list of nodes, quorum or priority are properties that can be applied to a group of nodes when this group is used in the expression to define what means synchronous commit. > - Group names: to simplify status reporting > - Nesting: At least 2 levels of nesting If I am following correctly, at the first level there is the definition of the top level objects, like groups and sync expression. > Using JSON, sync rep parameter to replicate in 2 different clusters could be > written as: > > {"remotes": > {"quorum": 2, > "servers": [{"london": > {"priority": 2, > "servers": ["lndn1", "lndn2", "lndn3"] > }} > , > {"nyc": > {"priority": 1, > "servers": ["ny1", "ny2"] > }} > ] > } > } > The same parameter in the new language (as suggested above) could be written > as: > 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases, I think that JSON makes the most sense. One of the advantage of a group is that you can use it in several places in the blob and set different properties into it, hence we should be able to define a group out of the sync expression. Hence I would think that something like that makes more sense: { "sync_standby_names": { "quorum":2, "nodes": [ {"priority":1,"group":"cluster1"}, {"quorum":2,"nodes":["node1","node2","node3"]} ] }, "groups": { "cluster1":["node11","node12","node13"], "cluster2":["node21","node22","node23"] } } > Also, I was thinking the name of the main group could be optional. > Internally, it can be given the name 'default group' or 'main group' for > status reporting. > > The above could also be written as: > '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' > > backward compatible: > In JSON, while validating we may have to check if it starts with '{' to go Something worth noticing, application_name can begin with "{". > for JSON parsing else proceed with the current method. > A,B,C => 1[A,B,C]. This can be added in the new parser code. This makes sense. We could do the same for JSON-based format as well by reusing the in-memory structure used to deparse the blob when the former grammar is used as well. -- Michael
On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> Simon Riggs wrote: >> >>> The choice between formats is not >>> solely predicated on whether we have multi-line support. >> >>> I still think writing down some actual use cases would help bring the >>> discussion to a conclusion. Inventing a general facility is hard without >>> some clear goals about what we need to support. >> >> We need to at least support the following: >> - Grouping: Specify of standbys along with the minimum number of commits >> required from the group. >> - Group Type: Groups can either be priority or quorum group. > > As far as I understood at the lowest level a group is just an alias > for a list of nodes, quorum or priority are properties that can be > applied to a group of nodes when this group is used in the expression > to define what means synchronous commit. > >> - Group names: to simplify status reporting >> - Nesting: At least 2 levels of nesting > > If I am following correctly, at the first level there is the > definition of the top level objects, like groups and sync expression. > The grouping and using same application_name different server is similar. How does the same application_name different server work? >> Using JSON, sync rep parameter to replicate in 2 different clusters could be >> written as: >> >> {"remotes": >> {"quorum": 2, >> "servers": [{"london": >> {"priority": 2, >> "servers": ["lndn1", "lndn2", "lndn3"] >> }} >> , >> {"nyc": >> {"priority": 1, >> "servers": ["ny1", "ny2"] >> }} >> ] >> } >> } >> The same parameter in the new language (as suggested above) could be written >> as: >> 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' > > OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3], > nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases, > I think that JSON makes the most sense. One of the advantage of a > group is that you can use it in several places in the blob and set > different properties into it, hence we should be able to define a > group out of the sync expression. > Hence I would think that something like that makes more sense: > { > "sync_standby_names": > { > "quorum":2, > "nodes": > [ > {"priority":1,"group":"cluster1"}, > {"quorum":2,"nodes":["node1","node2","node3"]} > ] > }, > "groups": > { > "cluster1":["node11","node12","node13"], > "cluster2":["node21","node22","node23"] > } > } > >> Also, I was thinking the name of the main group could be optional. >> Internally, it can be given the name 'default group' or 'main group' for >> status reporting. >> >> The above could also be written as: >> '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' >> >> backward compatible: >> In JSON, while validating we may have to check if it starts with '{' to go > > Something worth noticing, application_name can begin with "{". > >> for JSON parsing else proceed with the current method. > >> A,B,C => 1[A,B,C]. This can be added in the new parser code. > > This makes sense. We could do the same for JSON-based format as well > by reusing the in-memory structure used to deparse the blob when the > former grammar is used as well. If I validate s_s_name JSON syntax, I will definitely use JSONB, rather than JSON. Because JSONB has some useful operation functions for adding node, deleting node to s_s_name today. But the down side of using JSONB for s_s_name is that it could switch in key name order place.(and remove duplicate key) For example in the syntax Michael suggested, * JSON (just casting JSON) json ------------------------------------------------------------------------{ + "sync_standby_names": + { + "quorum":2, + "nodes": + [ + {"priority":1,"group":"cluster1"}, + {"quorum":2,"nodes":["node1","node2","node3"]}+ ] + }, + "groups": + { + "cluster1":["node11","node12","node13"], + "cluster2":["node21","node22","node23"] + } +} * JSONB (using jsonb_pretty) jsonb_pretty --------------------------------------{ + "groups": { + "cluster1":[ + "node11", + "node12", + "node13" + ], + "cluster2": [ + "node21", + "node22", + "node23" + ] + }, + "sync_standby_names": { + "nodes": [ + { + "group": "cluster1",+ "priority": 1 + }, + { + "nodes": [ + "node1", + "node2", + "node3" + ], + "quorum": 2 + } + ], + "quorum": 2 + } +} "group" and "sync_standby_names" has been switched place. I'm not sure it's good for the users. Regards, -- Masahiko Sawada
On Wed, Jul 29, 2015 at 9:03 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Tue, Jul 21, 2015 at 3:50 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Mon, Jul 20, 2015 at 9:59 PM, Beena Emerson <memissemerson@gmail.com> wrote: >>> Simon Riggs wrote: >>> >>>> The choice between formats is not >>>> solely predicated on whether we have multi-line support. >>> >>>> I still think writing down some actual use cases would help bring the >>>> discussion to a conclusion. Inventing a general facility is hard without >>>> some clear goals about what we need to support. >>> >>> We need to at least support the following: >>> - Grouping: Specify of standbys along with the minimum number of commits >>> required from the group. >>> - Group Type: Groups can either be priority or quorum group. >> >> As far as I understood at the lowest level a group is just an alias >> for a list of nodes, quorum or priority are properties that can be >> applied to a group of nodes when this group is used in the expression >> to define what means synchronous commit. >> >>> - Group names: to simplify status reporting >>> - Nesting: At least 2 levels of nesting >> >> If I am following correctly, at the first level there is the >> definition of the top level objects, like groups and sync expression. >> > > The grouping and using same application_name different server is similar. > How does the same application_name different server work? In the same of a priority group both nodes get the same priority, imagine for example that we need to wait for 2 nodes with lower priority: node1 with priority 1, node2 with priority 2 and again node2 with priority 2, we would wait for the first one, and then one of the second. In quorum group, any of them could be qualified for selection. >>> Using JSON, sync rep parameter to replicate in 2 different clusters could be >>> written as: >>> >>> {"remotes": >>> {"quorum": 2, >>> "servers": [{"london": >>> {"priority": 2, >>> "servers": ["lndn1", "lndn2", "lndn3"] >>> }} >>> , >>> {"nyc": >>> {"priority": 1, >>> "servers": ["ny1", "ny2"] >>> }} >>> ] >>> } >>> } >>> The same parameter in the new language (as suggested above) could be written >>> as: >>> 'remotes: 2(london: 1[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' >> >> OK, there is a typo. That's actually 2(london: 2[lndn1, lndn2, lndn3], >> nyc: 1[ny1, ny2]) in your grammar. Honestly, if we want group aliases, >> I think that JSON makes the most sense. One of the advantage of a >> group is that you can use it in several places in the blob and set >> different properties into it, hence we should be able to define a >> group out of the sync expression. >> Hence I would think that something like that makes more sense: >> { >> "sync_standby_names": >> { >> "quorum":2, >> "nodes": >> [ >> {"priority":1,"group":"cluster1"}, >> {"quorum":2,"nodes":["node1","node2","node3"]} >> ] >> }, >> "groups": >> { >> "cluster1":["node11","node12","node13"], >> "cluster2":["node21","node22","node23"] >> } >> } >> >>> Also, I was thinking the name of the main group could be optional. >>> Internally, it can be given the name 'default group' or 'main group' for >>> status reporting. >>> >>> The above could also be written as: >>> '2(london: 2[lndn1, lndn2, lndn3], nyc: 1[ny1, ny2])' >>> >>> backward compatible: >>> In JSON, while validating we may have to check if it starts with '{' to go >> >> Something worth noticing, application_name can begin with "{". >> >>> for JSON parsing else proceed with the current method. >> >>> A,B,C => 1[A,B,C]. This can be added in the new parser code. >> >> This makes sense. We could do the same for JSON-based format as well >> by reusing the in-memory structure used to deparse the blob when the >> former grammar is used as well. > > If I validate s_s_name JSON syntax, I will definitely use JSONB, > rather than JSON. > Because JSONB has some useful operation functions for adding node, > deleting node to s_s_name today. > But the down side of using JSONB for s_s_name is that it could switch > in key name order place.(and remove duplicate key) > For example in the syntax Michael suggested, > [...] > "group" and "sync_standby_names" has been switched place. I'm not sure > it's good for the users. I think that's perfectly fine. -- Michael
Hello, Just looking at how the 2 differnt methods can be used to set the s_s_names value. 1. For a simple case where quorum is required for a single group the JSON could be: { "sync_standby_names": { "quorum":2, "nodes": [ "node1","node2","node3"] }} or { "sync_standby_names": { "quorum":2, "group": "cluster1" }, "groups": { "cluster1":["node1","node2","node3"] }} Language: 2(node1, node2, node3) 2. For having quorum between different groups and node:{ "sync_standby_names": { "quorum":2, "nodes": [ {"priority":1,"nodes":["node0"]}, {"quorum":2,"group": "cluster1"} ] }, "groups": { "cluster1":["node1","node2","node3"] }} or{ "sync_standby_names": { "quorum":2, "nodes": [ {"priority":1,"group": "cluster2"}, {"quorum":2,"group": "cluster1"} ] }, "groups": { "cluster1":["node1","node2","node3"], "cluster2":["node0"] }} Language: 2 (node0, cluster1: 2(node1, node2, node3)) Since there will not be many nesting and grouping, I still prefer new language to JSON. I understand one can easily, modify/add groups in JSON using in built functions but I think changes will not be done too often. ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860197.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 07/17/2015 04:36 PM, Jim Nasby wrote: >>> I'm guessing it'd be really ugly/hard to support at least this GUC being >>> multi-line? > >> Mind you, multi-line GUCs would be useful otherwise, but we don't want >> to hinge this feature on making that work. > > I'm pretty sure that changing the GUC parser to allow quoted strings to > continue across lines would be trivial. The problem with it is not that > it's hard, it's that omitting a closing quote mark would then result in > the entire file being syntactically broken, with the error message(s) > almost certainly pointing somewhere else than where the actual mistake is. > Do we really want such a global reduction in friendliness to make this > feature easier? Maybe shoehorning this into the GUC mechanism is the wrong thing, and what we really need is a new config file for this. The information we're proposing to store seems complex enough to justify that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > Just looking at how the 2 differnt methods can be used to set the s_s_names > value. > > 1. For a simple case where quorum is required for a single group the JSON > could be: > > { > "sync_standby_names": > { > "quorum":2, > "nodes": > [ "node1","node2","node3" ] > } > } > > or > > { > "sync_standby_names": > { > "quorum":2, > "group": "cluster1" > }, > "groups": > { > "cluster1":["node1","node2","node3"] > } > } > > Language: > 2(node1, node2, node3) > > > 2. For having quorum between different groups and node: > { > "sync_standby_names": > { > "quorum":2, > "nodes": > [ > {"priority":1,"nodes":["node0"]}, > {"quorum":2,"group": "cluster1"} > ] > }, > "groups": > { > "cluster1":["node1","node2","node3"] > } > } > > or > { > "sync_standby_names": > { > "quorum":2, > "nodes": > [ > {"priority":1,"group": "cluster2"}, > {"quorum":2,"group": "cluster1"} > ] > }, > "groups": > { > "cluster1":["node1","node2","node3"], > "cluster2":["node0"] > } > } > > Language: > 2 (node0, cluster1: 2(node1, node2, node3)) > > Since there will not be many nesting and grouping, I still prefer new > language to JSON. > I understand one can easily, modify/add groups in JSON using in built > functions but I think changes will not be done too often. > If we decided to use dedicated language, the syntax checker for that language is needed, via SQL or something. Otherwise we will not be able to know whether the parsing that value will be done correctly, until reloading or restarting server. Regards, -- Masahiko Sawada
On Tue, Aug 4, 2015 at 2:57 PM, Masahiko Sawada wrote: > On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson wrote: >> Since there will not be many nesting and grouping, I still prefer new >> language to JSON. >> I understand one can easily, modify/add groups in JSON using in built >> functions but I think changes will not be done too often. >> > > If we decided to use dedicated language, the syntax checker for that > language is needed, via SQL or something. Well, sure, both approaches have downsides. > Otherwise we will not be able to know whether the parsing that value > will be done correctly, until reloading or restarting server. And this is the case of any format as well. String format validation for a GUC occurs when server is reloaded or restarted, one advantage of JSON is that the parser validator is already here, so we don't need to reinvent a new machinery for that. -- Michael
Michael Paquier wrote: > And this is the case of any format as well. String format validation > for a GUC occurs when server is reloaded or restarted, one advantage > of JSON is that the parser validator is already here, so we don't need > to reinvent a new machinery for that. IIUC correctly, we would also have to add additional code to check that that given JSON has the required keys and entries. For ex: The "group" mentioned in the "s_s_names" should be definied in the "groups" section, etc. ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860758.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Tue, Aug 4, 2015 at 3:27 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Michael Paquier wrote: >> And this is the case of any format as well. String format validation >> for a GUC occurs when server is reloaded or restarted, one advantage >> of JSON is that the parser validator is already here, so we don't need >> to reinvent a new machinery for that. > > IIUC correctly, we would also have to add additional code to check that that > given JSON has the required keys and entries. For ex: The "group" mentioned > in the "s_s_names" should be definied in the "groups" section, etc. Yep, true as well. -- Michael
Robert Haas wrote: >Maybe shoehorning this into the GUC mechanism is the wrong thing, and >what we really need is a new config file for this. The information >we're proposing to store seems complex enough to justify that. > I think the consensus is that JSON is better. And using a new file with multi line support would be good. Name of the file: how about pg_syncinfo.conf? Backward compatibility: synchronous_standby_names will be supported. synchronous_standby_names='pg_syncinfo' indicates use of new file. JSON format: It would contain 2 main keys: "sync_info" and "groups" The "sync_info" would consist of "quorum"/"priority" with the count and "nodes"/"group" with the group name or node list. The optional "groups" key would list out all the "group" mentioned within "sync_info" along with the node list. Ex: 1. { "sync_info": { "quorum":2, "nodes": [ "node1","node2", "node3" ] } } 2. { "sync_info": { "quorum":2, "nodes": [ {"priority":1,"group":"cluster1"}, {"quorum":2,"group": "cluster2"}, "node99" ] }, "groups": { "cluster1":["node11","node12"], "cluster2":["node21","node22","node23"] } } Thoughts? ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Robert Haas wrote: >>Maybe shoehorning this into the GUC mechanism is the wrong thing, and >>what we really need is a new config file for this. The information >>we're proposing to store seems complex enough to justify that. >> > > I think the consensus is that JSON is better. I guess so as well. Thanks for brainstorming the whole thread in a single post. > And using a new file with multi line support would be good. This file just contains a JSON blob, hence we just need to fetch its content entirely and then let the server parse it using the existing facilities. > Name of the file: how about pg_syncinfo.conf? > Backward compatibility: synchronous_standby_names will be supported. > synchronous_standby_names='pg_syncinfo' indicates use of new file. This strengthens the fact that parsing is done at SIGHUP, so that sounds fine to me. We may still find out an application_name that uses pg_syncinfo but well, that's unlikely to happen... > JSON format: > It would contain 2 main keys: "sync_info" and "groups" > The "sync_info" would consist of "quorum"/"priority" with the count and > "nodes"/"group" with the group name or node list. > The optional "groups" key would list out all the "group" mentioned within > "sync_info" along with the node list. > > [...] > > Thoughts? Yes, I think that's the idea. I would let a couple of days to let people time to give their opinion and objections regarding this approach though. -- Michael
On Wed, Jul 1, 2015 at 11:21:47AM -0700, Josh Berkus wrote: > All: > > Replying to multiple people below. > > On 07/01/2015 07:15 AM, Fujii Masao wrote: > > On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> You're confusing two separate things. The primary manageability problem > >> has nothing to do with altering the parameter. The main problem is: if > >> there is more than one synch candidate, how do we determine *after the > >> master dies* which candidate replica was in synch at the time of > >> failure? Currently there is no way to do that. This proposal plans to, > >> effectively, add more synch candidate configurations without addressing > >> that core design failure *at all*. That's why I say that this patch > >> decreases overall reliability of the system instead of increasing it. > > > > I agree this is a problem even today, but it's basically independent from > > the proposed feature *itself*. So I think that it's better to discuss and > > work on the problem separately. If so, we might be able to provide > > good way to find new master even if the proposed feature finally fails > > to be adopted. > > I agree that they're separate features. My argument is that the quorum > synch feature isn't materially useful if we don't create some feature to > identify which server(s) were in synch at the time the master died. I am coming in here late, but I thought the last time we talked about this that the only reasonable way to communicate that we have changed to synchronize with a secondary server (different application_name) is to allow a GUC-configured command string to be run when a change like this happens. The command string would write a status on another server or send an email. Based on the new s_s_name API, this would mean whenever we switch to a different priority level, like 1 to 2, 2 to 3, or 2 to 1. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 8/4/15 9:18 PM, Michael Paquier wrote: >> >And using a new file with multi line support would be good. > This file just contains a JSON blob, hence we just need to fetch its > content entirely and then let the server parse it using the existing > facilities. It sounds like there's other places where multiline GUCs would be useful, so I think we should just support that instead of creating something that only works for SR configuration. I also don't see the problem with supporting multi-line GUCs that are wrapped in quotes. Yes, you miss a quote and things blow up, but so what? Anyone that's done any amount of programming has faced that problem. Heck, if we wanted to be fancy we could watch for the first line that could have been another GUC and stick that in a hint. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Attachment
On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Please find attached the WIP patch for the proposed feature. It is built > based on the already discussed design. > > Changes made: > - add new parameter "sync_file" to provide the location of the pg_syncinfo > file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and > pg_ident file. I am not sure that's really necessary. We could just hardcode its location. > - pg_syncinfo file will hold the sync rep information in the approved JSON > format. OK. Have you considered as well the approach to add support for multi-line GUC parameters? This has been mentioned a couple of time above as well, with something like that I imagine: param = 'value1,' \ 'value2,' \ 'value3' and this reads as 'value1,value2,value3'. This would benefit as well for other parameters. > - synchronous_standby_names can be set to 'pg_syncinfo.conf' to read the > JSON value stored in the file. Check. > - All the standbys mentioned in the s_s_names or the pg_syncinfo file > currently get the priority as 1 and all others as 0 (async) > - Various functions in syncrep.c to read the json file and store the values > in a struct to be used in checking the quorum status of syncrep standbys > (SyncRepGetQuorumRecPtr function). > It does not support the current behavior for synchronous_standby_names = '*'. > I am yet to thoroughly test the patch. As this patch adds a whole new infrastructure, this is going to need complex test setups with many configurations that will require either bash-ing a bunch of new things, and we are not protected from bugs in those scripts either or manual manipulation mistakes during the tests. What I think looks really necessary with this patch is to have included a set of tests to prove that the patch actually does what it should with complex scenarios and that it does it correctly. So we had better perhaps move on with this patch first: https://commitfest.postgresql.org/6/197/ And it would be really nice to get the tests of this patch integrated with it as well. We are not protected from bugs in this patch as well, but if we have an infrastructure centralized this will add a level of confidence that we are doing things the right way. Your patch offers as well a good occasion to see if there would be some generic routines that would be helpful in this recovery test suite. Regards, -- Michael
Hello, I did apply the patch to HEAD and tried to setup basic async replication.But i got an error. Turned on logging for details below. Unpatched Primary Log LOG: database system was shut down at 2015-09-12 13:41:40 IST LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started Unpatched Standby log LOG: entering standby mode LOG: redo starts at 0/2000028 LOG: invalid record length at 0/20000D0 LOG: started streaming WAL from primary at 0/2000000 on timeline 1 LOG: consistent recovery state reached at 0/20000F8 LOG: database system is ready to accept read only connections Patched Primary log LOG: database system was shut down at 2015-09-12 13:50:17 IST LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: server process (PID 17317) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2015-09-12 13:50:18 IST FATAL: the database system is in recovery mode LOG: database system was not properly shut down; automatic recovery in progress LOG: invalid record length at 0/3000098 LOG: redo is not required LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: server process (PID 17343) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes Patched Standby log LOG: database system was interrupted; last known up at 2015-09-12 13:50:16 IST FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up LOG: entering standby mode LOG: redo starts at 0/2000028 LOG: invalid record length at 0/20000D0 LOG: started streaming WAL from primary at 0/2000000 on timeline 1 FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode Not sure if there is something i am missing which causes this. regards Sameer -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865685.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Hello,Please find attached the WIP patch for the proposed feature. It is built based on the already discussed design.Changes made:- add new parameter "sync_file" to provide the location of the pg_syncinfo file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and pg_ident file.- pg_syncinfo file will hold the sync rep information in the approved JSON format.- synchronous_standby_names can be set to 'pg_syncinfo.conf' to read the JSON value stored in the file.- All the standbys mentioned in the s_s_names or the pg_syncinfo file currently get the priority as 1 and all others as 0 (async)- Various functions in syncrep.c to read the json file and store the values in a struct to be used in checking the quorum status of syncrep standbys (SyncRepGetQuorumRecPtr function).It does not support the current behavior for synchronous_standby_names = '*'. I am yet to thoroughly test the patch.Thoughts?
http://www.enterprisedb.com
Hello, Continuing testing: For pg_syncinfo.conf below an error is thrown. { "sync_info": { "quorum": 3, "nodes": [ {"priority":1,"group":"cluster1"}, "A" ] }, "groups": { "cluster1":["B","C"] } } LOG: database system is ready to accept connections LOG: autovacuum launcher started TRAP: FailedAssertion("!(n < list->length)", File: "list.c", Line: 392) LOG: server process (PID 17764) was terminated by signal 6: Aborted LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2015-09-15 17:15:35 IST In the scenario here the quorum specified is 3 but there are just 2 nodes, what should the expected behaviour be? I feel the json parsing should throw an appropriate error with explanation as the sync rule does not make sense. The behaviour that the master keeps waiting for the non existent 3rd quorum node will not be helpful anyway. regards Sameer -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5865954.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
<p dir="ltr">Hello,<p dir="ltr">Thank you Thomas and Sameer for checking the patch and giving your comments!<p dir="ltr">Iwill post an updated patch soon.<br /><p dir="ltr">Regards,<p dir="ltr">Beena Emerson <br />
On Tue, Sep 15, 2015 at 3:19 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I got the following error from clang-602.0.53 on my Mac: > > walsender.c:1955:11: error: passing 'char volatile[8192]' to parameter of > type 'void *' discards qualifiers > [-Werror,-Wincompatible-pointer-types-discards-qualifiers] > memcpy(walsnd->name, application_name, > strlen(application_name)); > ^~~~~~~~~~~~ > > I think your memcpy and explicit null termination could be replaced with > strcpy, or maybe something to limit buffer overrun damage in case of sizing > bugs elsewhere. But to get rid of that warning you'd still need to cast > away volatile... I note that you do that in SyncRepGetQuorumRecPtr when you > read the string with strcmp. But is that actually safe, with respect to > load/store reordering around spinlock operations? Do we actually need > volatile-preserving cstring copy and compare functions for this type of > thing? Maybe volatile isn't even needed here at all. I have asked that question separately here: http://www.postgresql.org/message-id/CAEepm=2f-N5MD+xYYyO=yBpC9SoOdCdrdiKia9_oLTSiu1uBtA@mail.gmail.com In SyncRepGetQuorumRecPtr you have strcmp(node->name, (char *) walsnd->name): that might be more problematic. I'm not sure about casting away volatile (it's probably fine at least in practice), but it's accessing walsnd without the the spinlock. The existing syncrep.c code already did that sort of thing (and I haven't had time to grok the thinking behind that yet), but I think you may be upping the ante here by doing non-atomic reads with strcmp (whereas the code in master always read single word values). Imagine if you hit a slot that was being set up by InitWalSenderSlot concurrently, and memcpy was in the process of writing the name. strcmp would read garbage, maybe even off the end of the buffer because there is no terminator yet. That may be incredibly unlikely, but it seems fishy. Or I may have misunderstood the synchronisation at work here completely :-) -- Thomas Munro http://www.enterprisedb.com
<div style="font-family: Verdana;font-size: 12.0px;"><div><div>>On 07/16/15, Robert Haas wrote:<br /> > <br /> >>>* Developers will immediately understand the format<br /> >><br /> >>I doubt it. I think any formatthat we pick will have to be carefully<br /> >>documented. People may know what JSON looks like in general,but they<br /> >>will not immediately know what bells and whistles are available in<br /> >>this context.<br/> >><br /> >>> * Easy to programmatically manipulate in a range of languages<br /> >><br/> >> <...> I think it will be rare to need to parse the postgresql.conf string,<br /> >> manipulateit programatically, and then put it back.<br /> ><br /> >On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>wrote:<br /> >> Josh Berkus <josh(at)agliodbs(dot)com> writes:<br />>>> On 07/17/2015 04:36 PM, Jim Nasby wrote:<br /> >>>> I'm guessing it'd be really ugly/hard to supportat least this GUC being<br /> >>>> multi-line?<br /> >><br /> >>> Mind you, multi-lineGUCs would be useful otherwise, but we don't want<br /> >>> to hinge this feature on making that work.<br/> >><br /> >> Do we really want such a global reduction in friendliness to make this<br /> >>feature easier?<br /> ><br /> >Maybe shoehorning this into the GUC mechanism is the wrong thing, and<br />>what we really need is a new config file for this. The information<br /> >we're proposing to store seems complexenough to justify that.</div><div> </div><div>It seems like:</div><div>1) There's a need to support structured datain configuration for future</div><div>needs as well, it isn't specific to this feature.<br /> 2) There should/must bea better way to validate configuration then<br /> to restarting the server in search of syntax errors.</div><div> </div><div>Creatinga whole new configuration file for just one feature *and* in a different <div>formatseems suboptimal. What happens when the next 20 features need structured</div><div>config data, where does thatgo? will there be additional JSON config files *and* perhaps</div><div>new mini-language values in .conf as developmentcontinues? How many dedicated</div><div>configuration files is too many?</div></div><div>Now, about JSON....(Earlier Upthread):</div><div> <br /> On 07/01/15, Peter Eisentraut wrote:</div><div>> On 6/26/15 2:53 PM, JoshBerkus wrote:<br /> > > I would also suggest that if I lose this battle and<br /> > > we decide to go witha single stringy GUC, that we at least use JSON<br /> > > instead of defining our out, proprietary, syntax?<br/> > <br /> > Does JSON have a natural syntax for a set without order?</div><div> </div><div>No. Nor Timestamps.It doesn't even distingush integer from float</div><div>(Though parsers do it for you in dynamic languages). It'sall because</div><div>of its unsightly javascript roots.<br /> </div><div><div>The current patch is now forced by JSONto conflate sets and lists, so</div><div>un/ordered semantics are no longer tied to type but to the specific configurationkeys.</div><div>So, If a feature ever needs a key where the difference between set and list matters</div><div>andneeds to support both, you'll need seperate keys (both with lists, but meaning different things)</div><div>ora separate "mode" key or something. Not terrible, just iffy.</div><div> </div></div><div>Other have foundJSON unsatisfactory before. For example, the clojure community</div><div>has made (at least) two attempts at alternatives,complete with the meh adoption</div><div>rates you'd expect despite being more capable formats:</div><div> </div><div>http://blog.cognitect.com/blog/2014/7/22/transit<br/> https://github.com/edn-format/edn</div><div> </div><div>There'salso YAML, TOML, etc', none as universal as JSON. But to reiterate,JSON itself</div><div>has Lackluster type support (no sets, no timestamps), is verbose, iseasy to malform whenediting</div><div>(missed a curly brace, shouldn't use a single quote), isn't extensible, and my personal pet peeve</div><div>isthat it doesn't allow non-string or bare-string keys in maps (a.k.a "death by double-quotes").</div> <div>Python has the very natural {1,2,3} syntax for sets, but of course that's not part of JSON.</div><div> </div><div>If JSON wins out despite all this, one alternative not discussed is to extend</div><div>the .confparser to accept json dicts as a fundamental type. e.g.:</div><div> </div><div>###</div><div>data_directory = 'ConfigDir' <br /> port = 5432<br /> work_mem = 4MB<br /> hot_standby = off<br /> client_min_messages = notice<br /> log_error_verbosity= default<br /> autovacuum_analyze_scale_factor = 0.1<br /> synch_standby_config = {<br /> "sync_info":{<br /> "nodes": [<br /> {<br /> "priority": 1,<br /> "group": "cluster1"<br /> },<br /> "A"<br /> ],<br /> "quorum": 3<br /> },<br /> "groups": {<br /> "cluster1": [<br /> "B",<br /> "C"<br /> ]<br /> }<br /> }</div><div> </div><div>This *will* break someone's perl I would guess.Ironically, those scripts wouldn't have broken if</div><div>some structured format were in use for the configurationdata when they were written...</div><div>`postgres --describe-config` is also pretty much tied to a line-orientedconfiguration.</div><div> </div><div>Amir</div><div> </div><div>p.s.</div><div> </div><div>MIA configurationvalidation tool/switch should probably get a thread too.</div><div> </div></div></div>
On Wed, Sep 23, 2015 at 12:11 AM, Amir Rohan <amir.rohan@mail.com> wrote: > It seems like: > 1) There's a need to support structured data in configuration for future > needs as well, it isn't specific to this feature. > 2) There should/must be a better way to validate configuration then > to restarting the server in search of syntax errors. > > Creating a whole new configuration file for just one feature *and* in a > different > format seems suboptimal. What happens when the next 20 features need > structured > config data, where does that go? will there be additional JSON config files > *and* perhaps > new mini-language values in .conf as development continues? How many > dedicated > configuration files is too many? Well, I think that if we create our own mini-language, it may well be possible to make the configuration for this compact enough to fit on one line. If we use JSON, I think there's zap chance of that. But... that's just what *I* think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Well, I think that if we create our own mini-language, it may well be > possible to make the configuration for this compact enough to fit on > one line. If we use JSON, I think there's zap chance of that. But... > that's just what *I* think. Well, that depends on what you think the typical-case complexity is and on how long a line will fit in your editor window ;-). I think that we can't make much progress on this argument without a pretty concrete idea of what typical and worst-case configurations would look like. Would someone like to put forward examples? Then we could try them in any specific syntax that's suggested and see how verbose it gets. FWIW, I tend to agree that if we think common cases can be held to, say, a hundred or two hundred characters, that we're best off avoiding the challenges of dealing with multi-line postgresql.conf entries. And I'm really not much in favor of a separate file; if we go that way then we're going to have to reinvent a huge amount of infrastructure that already exists for GUCs. regards, tom lane
> Sent: Thursday, September 24, 2015 at 3:11 AM > > From: "Tom Lane" <tgl@sss.pgh.pa.us> > Robert Haas <robertmhaas@gmail.com> writes: > > Well, I think that if we create our own mini-language, it may well be > > possible to make the configuration for this compact enough to fit on > > one line. If we use JSON, I think there's zap chance of that. But... > > that's just what *I* think. >> I've implemented a parser that reads you mini-language and dumps a JSON equivalent. Once you start naming groups the line fills up quite quickly, and on the other hands the JSON is verbose and fiddely. But implementing a mechanism that can be used by other features in the future seems the deciding factor here, rather then the brevity of a bespoke mini-language. > > <...> we're best off avoiding the challenges of dealing with multi-line > postgresql.conf entries. > > And I'm really not much in favor of a separate file; if we go that way > then we're going to have to reinvent a huge amount of infrastructure > that already exists for GUCs. > > regards, tom lane Adding support for JSON objects (or some other kind of composite data type) to the .conf parser would negate the need for one, and would also solve the problem being discussed for future cases. I don't know whether that would break some tooling you care about, but if there's interest, I can probably do some of that work.
On Fri, Sep 11, 2015 at 10:15 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> Please find attached the WIP patch for the proposed feature. It is built >> based on the already discussed design. >> >> Changes made: >> - add new parameter "sync_file" to provide the location of the pg_syncinfo >> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and >> pg_ident file. > > I am not sure that's really necessary. We could just hardcode its location. > >> - pg_syncinfo file will hold the sync rep information in the approved JSON >> format. > > OK. Have you considered as well the approach to add support for > multi-line GUC parameters? This has been mentioned a couple of time > above as well, with something like that I imagine: > param = 'value1,' \ > 'value2,' \ > 'value3' > and this reads as 'value1,value2,value3'. This would benefit as well > for other parameters. > I agree with adding support for multi-line GUC parameters. But I though it is: param = 'param1, param2, param3' This reads as 'value1,value2,value3'. Regards, -- Masahiko Sawada
Amir Rohan wrote: > But implementing a mechanism that can be used by other features in > the future seems the deciding factor here, rather then the brevity of a > bespoke mini-language. One decision to be taken is which among JSON or mini-language is better for the SR setting. Mini language can fit into the postgresql.conf single line. For JSON currently a different file is used. But as said earlier, in case composite types are required in future for other parameters then having multiple .conf files does not make sense. To avoid this we can: - support multi-line GUC which would be helpful for other comma-separated conf values along with s_s_names. (This can make mini-language more readable as well) - Allow JSON support in postgresql.conf. So that other parameters in future can use JSON as well within postgresql.conf. What are the chances of future data requiring JSON? I think rare. > > And I'm really not much in favor of a separate file; if we go that way > > then we're going to have to reinvent a huge amount of infrastructure > > that already exists for GUCs. > > Adding support for JSON objects (or some other kind of composite data > type) > to the .conf parser would negate the need for one, and would also solve > the > problem being discussed for future cases. With the current pg_syncinfo file, the only code added was to check the pg_syncinfo file in the specified path and read the entire content of the file into a variable which was used for further parsing which could have been avoided with multi-line GUC. ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869285.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Hello, The JSON method was used in the patch because it seemed to be the group consensus. Requirement: - Grouping : Specify a list of node names with the required number of ACK for the group. We could have priority or quorum group. Quorum treats all the standby in same level and ACK from any k can be considered. In priority behavior, ACK must be received from the specified k lowest priority servers for a successful transaction. - Group names to enable easier status reporting for group. The topmost group may not be named. It will be assigned a default name. All the sub groups are to be compulsorily named. - Not more than 3 groups with 1 level of nesting expected Behavior in submitted patch: - The name of the top most group is named ‘Default Group”. All the other standby_names or groups will have to be listed within this. - When more than 1 connected standby has the samename then the highest LSN among them is chosen. Example: 2 priority in (X,Y,Z). If there 2 nodes X connected, even though both X have returned ACK, the server will wait for ACK from Y. - There are no “potential” standbys. In quorum behavior, there are no fixed standbys which are to be in sync, all members are equal. ACK from any specified n nodes from a set is considered success. Further: - improvements to pg_stat_replication to give the node tree and status? - Manipulate/Edit conf setting using functions. - Regression tests Mini-lang: [] - to specify prioirty () - to specify quorum Format - <name> : <count> [<list>] Not specifying count defaults to 1. Ex: s_s_names = '2(cluster1: 1(A,B), cluster2: 2[X,Y,Z], U)' JSON It would contain 2 main keys: "sync_info" and "groups" The "sync_info" would consist of "quorum"/"priority" with the count and "nodes"/"group" with the group name or node list. The optional "groups" key would list out all the "group" mentioned within "sync_info" along with the node list.Ex: { "sync_info": { "quorum":2, "nodes": [ {"quorum":1,"group":"cluster1"}, {"prioirty":2,"group": "cluster2"}, "U" ] }, "groups": { "cluster1":["A","B"], "cluster2":["X","Y","z"] } } JSON and mini-language: - JSON is more verbose - You can define a group and use it multiple times in sync settings but since no many levels or nesting is expected I am not sure how useful this will be. - Though JSON parser is inbuilt, additional code is required to check for the required format of JSON. For mini-language, new parser will have to be written. Despite all, I feel the mini-language is better mainly for its brevity. Also, it will not require additional GUC parser support (multi line). ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869286.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Sawada Masahiko wrote: > > I agree with adding support for multi-line GUC parameters. > But I though it is: > param = 'param1, > param2, > param3' > > This reads as 'value1,value2,value3'. Use of '\' ensures that omission the closing quote does not break the entire file. ----- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5869289.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
>
>
> Mini-lang:
> [] - to specify prioirty
> () - to specify quorum
> Format - <name> : <count> [<list>]
> Not specifying count defaults to 1.
> Ex: s_s_names = '2(cluster1: 1(A,B), cluster2: 2[X,Y,Z], U)'
>
> JSON
> It would contain 2 main keys: "sync_info" and "groups"
> The "sync_info" would consist of "quorum"/"priority" with the count and
> "nodes"/"group" with the group name or node list.
> The optional "groups" key would list out all the "group" mentioned within
> "sync_info" along with the node list.
> Ex: {
> "sync_info":
> {
> "quorum":2,
> "nodes":
> [
> {"quorum":1,"group":"cluster1"},
> {"prioirty":2,"group": "cluster2"},
> "U"
> ]
> },
> "groups":
> {
> "cluster1":["A","B"],
> "cluster2":["X","Y","z"]
> }
> }
>
> JSON and mini-language:
> - JSON is more verbose
> - You can define a group and use it multiple times in sync settings
> but since no many levels or nesting is expected I am not sure how useful
> this will be.
> - Though JSON parser is inbuilt, additional code is required to check
> for the required format of JSON. For mini-language, new parser will have to
> be written.
>
On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Sounds like both the approaches have some pros and cons, also there are > some people who prefer mini-language and others who prefer JSON. I think > one thing that might help, is to check how other databases support this > feature or somewhat similar to this feature (mainly with respect to User > Interface), as that can help us in knowing what users are already familiar > with. +1! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Oct 10, 2015 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Sounds like both the approaches have some pros and cons, also there are >> some people who prefer mini-language and others who prefer JSON. I think >> one thing that might help, is to check how other databases support this >> feature or somewhat similar to this feature (mainly with respect to User >> Interface), as that can help us in knowing what users are already familiar >> with. > > +1! > For example, MySQL 5.7 has similar feature, but it doesn't support quorum commit, and is simpler than postgresql attempting feature. There is one configuration parameter in MySQL 5.7 which indicates the number of sync replication node. The primary server commit when the primary server receives the specified number of ACK from standby server regardless name of standby server. And IIRC, Oracle database also doesn't support the quorum commit as well. The settings standby server sync or async is specified per standby server in configuration parameter in primary node. I think that the use of JSON format approach and dedicated language approach are different. The dedicated language format approach would be useful for simple configuration such as the one nesting, not using group. This will allow us to configure replication more simpler and easier. In contrast, The JSON format approach would be useful for complex configuration. I thought that this feature for postgresql should be simple at first implementation. It would be good even if there are some restriction such as the nesting level, the group setting. The another new approach that I came up with is, * Add new parameter synchronous_replication_method (say s_r_method) which can have two names: 'priority', 'quorum' * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') is handled using priority. It's same as '[n1,n2,n3]' in dedicated laguage. * If s_r_method = 'quorum', the value of s_s_names is handled using quorum commit, It's same as '(n1,n2,n3)' in dedicated language. * Setting of synchronous_standby_names is same as today. That is, the storing the nesting value is not supported. * If we want to support more complex syntax like what we are discussing, we can add the new value to s_r_method, for example 'complex', 'json'. Though? Regards, -- Masahiko Sawada
On 10/13/2015 11:02 AM, Masahiko Sawada wrote: > I thought that this feature for postgresql should be simple at first > implementation. > It would be good even if there are some restriction such as the > nesting level, the group setting. > The another new approach that I came up with is, > * Add new parameter synchronous_replication_method (say s_r_method) > which can have two names: 'priority', 'quorum' > * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') > is handled using priority. It's same as '[n1,n2,n3]' in dedicated > laguage. > * If s_r_method = 'quorum', the value of s_s_names is handled using > quorum commit, It's same as '(n1,n2,n3)' in dedicated language. Well, the first question is: can you implement both of these things for 9.6, realistically? If you can implement them, then we can argue about configuration format later. It's even possible that the nature of your implementation will enforce a particular syntax. For example, if your implementation requires sync groups to be named, then we have to include group names in the syntax. If you can't implement nesting in the near future, there's no reason to have a syntax for it. > * Setting of synchronous_standby_names is same as today. That is, the > storing the nesting value is not supported. > * If we want to support more complex syntax like what we are > discussing, we can add the new value to s_r_method, for example > 'complex', 'json'. I think having two different syntaxes is a bad idea. I'd rather have a wholly proprietary configuration markup than deal with two alternate ones. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Oct 14, 2015 at 3:16 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 10/13/2015 11:02 AM, Masahiko Sawada wrote: >> I thought that this feature for postgresql should be simple at first >> implementation. >> It would be good even if there are some restriction such as the >> nesting level, the group setting. >> The another new approach that I came up with is, >> * Add new parameter synchronous_replication_method (say s_r_method) >> which can have two names: 'priority', 'quorum' >> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') >> is handled using priority. It's same as '[n1,n2,n3]' in dedicated >> laguage. >> * If s_r_method = 'quorum', the value of s_s_names is handled using >> quorum commit, It's same as '(n1,n2,n3)' in dedicated language. > > Well, the first question is: can you implement both of these things for > 9.6, realistically? > If you can implement them, then we can argue about > configuration format later. It's even possible that the nature of your > implementation will enforce a particular syntax. > > For example, if your implementation requires sync groups to be named, > then we have to include group names in the syntax. If you can't > implement nesting in the near future, there's no reason to have a syntax > for it. Yes, I can implement both without nesting. The draft patch of replication using priority is already implemented by Michael, so I need to implement simple quorum commit logic and merge them. >> * Setting of synchronous_standby_names is same as today. That is, the >> storing the nesting value is not supported. >> * If we want to support more complex syntax like what we are >> discussing, we can add the new value to s_r_method, for example >> 'complex', 'json'. > > I think having two different syntaxes is a bad idea. I'd rather have a > wholly proprietary configuration markup than deal with two alternate ones. > I agree, we should choice either. Regards, -- Masahiko Sawada
On Wed, Oct 14, 2015 at 3:28 AM, Masahiko Sawada wrote: > The draft patch of replication using priority is already implemented > by Michael, so I need to implement simple quorum commit logic and > merge them. The last patch in date I know of is this one: http://www.postgresql.org/message-id/CAB7nPqRFSLmHbYonra0=p-X8MJ-XTL7oxjP_QXDJGsjpvWRXPA@mail.gmail.com It would surely need a rebase. -- Michael
On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote: > On Sat, Oct 10, 2015 at 4:35 AM, Robert Haas wrote: >> On Fri, Oct 9, 2015 at 12:00 AM, Amit Kapila wrote: >>> Sounds like both the approaches have some pros and cons, also there are >>> some people who prefer mini-language and others who prefer JSON. I think >>> one thing that might help, is to check how other databases support this >>> feature or somewhat similar to this feature (mainly with respect to User >>> Interface), as that can help us in knowing what users are already familiar >>> with. >> >> +1! Thanks for having a look at that! > For example, MySQL 5.7 has similar feature, but it doesn't support > quorum commit, and is simpler than postgresql attempting feature. > There is one configuration parameter in MySQL 5.7 which indicates the > number of sync replication node. > The primary server commit when the primary server receives the > specified number of ACK from standby server regardless name of standby > server. Hm. This is not much helpful in the case we especially mentioned upthread at some point with 2 data centers, first one has the master and a sync standby, and second one has a set of standbys. We need to be sure that the standby in DC1 acknowledges all the time, and we would only need to wait for one or more of them in DC2. I still believe that this is the main use case for this feature to ensure a proper failover without data loss if one data center blows away with a meteorite. > And IIRC, Oracle database also doesn't support the quorum commit as well. > The settings standby server sync or async is specified per standby > server in configuration parameter in primary node. And I guess that they manage standby nodes using a system catalog then, being able to change the state of a node from async to sync with something at SQL level? Is that right? > I thought that this feature for postgresql should be simple at first > implementation. And extensible. > It would be good even if there are some restriction such as the > nesting level, the group setting. > The another new approach that I came up with is, > * Add new parameter synchronous_replication_method (say s_r_method) > which can have two names: 'priority', 'quorum' > * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') > is handled using priority. It's same as '[n1,n2,n3]' in dedicated > language. > * If s_r_method = 'quorum', the value of s_s_names is handled using > quorum commit, It's same as '(n1,n2,n3)' in dedicated language. > * Setting of synchronous_standby_names is same as today. That is, the > storing the nesting value is not supported. > * If we want to support more complex syntax like what we are > discussing, we can add the new value to s_r_method, for example > 'complex', 'json'. If we go that path, I think that we still would need an extra parameter to control the number of nodes that need to be taken from the set defined in s_s_names whichever of quorum or priority is used. Let's not forget that in the current configuration the first node listed in s_s_names and *connected* to the master will be used to acknowledge the commit. -- Michael
On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote:
> It would be good even if there are some restriction such as the
> nesting level, the group setting.
> The another new approach that I came up with is,
> * Add new parameter synchronous_replication_method (say s_r_method)
> which can have two names: 'priority', 'quorum'
> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
> language.
> * If s_r_method = 'quorum', the value of s_s_names is handled using
> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
> * Setting of synchronous_standby_names is same as today. That is, the
> storing the nesting value is not supported.
> * If we want to support more complex syntax like what we are
> discussing, we can add the new value to s_r_method, for example
> 'complex', 'json'.
If we go that path, I think that we still would need an extra
parameter to control the number of nodes that need to be taken from
the set defined in s_s_names whichever of quorum or priority is used.
Let's not forget that in the current configuration the first node
listed in s_s_names and *connected* to the master will be used to
acknowledge the commit.
Reply to multiple member. > Hm. This is not much helpful in the case we especially mentioned > upthread at some point with 2 data centers, first one has the master > and a sync standby, and second one has a set of standbys. We need to > be sure that the standby in DC1 acknowledges all the time, and we > would only need to wait for one or more of them in DC2. I still > believe that this is the main use case for this feature to ensure a > proper failover without data loss if one data center blows away with a > meteorite. Yes, I think so too. In such case, the idea I posted yesterday could handle by setting the followings; * s_r_method = 'quorum' * s_s_names = 'tokyo, seattle' * s_s_nums = 2 * application_name of the first standby, which is in DC1, is 'tokyo', and application_name of other standbys, which are in DC2, is 'seattle'. > And I guess that they manage standby nodes using a system catalog > then, being able to change the state of a node from async to sync with > something at SQL level? Is that right? I think that's right. > > If we go that path, I think that we still would need an extra > parameter to control the number of nodes that need to be taken from > the set defined in s_s_names whichever of quorum or priority is used. > Let's not forget that in the current configuration the first node > listed in s_s_names and *connected* to the master will be used to > acknowledge the commit. Yeah, such parameter is needed. I've forgotten to consider that. > > > Would it be better to just use a simple language instead of 3 different > parameters? > > s_s_names = 2[X,Y,Z] # 2 priority > s_s_names = 1(A,B,C) # 1 quorum > s_s_names = R,S,T # default behavior: 1 priorty? I think that this means that we have choose dedicated language approach instead of JSON format approach. If we want to set multi sync replication more complexly, we would not have no choice other than improvement of dedicated language. Regards, -- Masahiko Sawada
On Wed, Oct 14, 2015 at 5:58 PM, Beena Emerson <memissemerson@gmail.com> wrote: > > > On Wed, Oct 14, 2015 at 10:38 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> >> On Wed, Oct 14, 2015 at 3:02 AM, Masahiko Sawada wrote: >> >> > It would be good even if there are some restriction such as the >> > nesting level, the group setting. >> > The another new approach that I came up with is, >> > * Add new parameter synchronous_replication_method (say s_r_method) >> > which can have two names: 'priority', 'quorum' >> > * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') >> > is handled using priority. It's same as '[n1,n2,n3]' in dedicated >> > language. >> > * If s_r_method = 'quorum', the value of s_s_names is handled using >> > quorum commit, It's same as '(n1,n2,n3)' in dedicated language. >> > * Setting of synchronous_standby_names is same as today. That is, the >> > storing the nesting value is not supported. >> > * If we want to support more complex syntax like what we are >> > discussing, we can add the new value to s_r_method, for example >> > 'complex', 'json'. >> >> If we go that path, I think that we still would need an extra >> parameter to control the number of nodes that need to be taken from >> the set defined in s_s_names whichever of quorum or priority is used. >> Let's not forget that in the current configuration the first node >> listed in s_s_names and *connected* to the master will be used to >> acknowledge the commit. > > > Would it be better to just use a simple language instead of 3 different > parameters? > > s_s_names = 2[X,Y,Z] # 2 priority > s_s_names = 1(A,B,C) # 1 quorum > s_s_names = R,S,T # default behavior: 1 priorty? Yeah, the main use case for this feature would just be that for most users: s_s_names = 2[dc1_standby,1(dc2_standby1, dc2_standby2)] Meaning that we wait for dc1_standby, which is a standby on data center 1, and one of the dc2_standby* set which are standbys in data center 2. So the following minimal characteristics would be needed: - support for priority selectivity for N nodes - support for quorum selectivity for N nodes - support for nested set of nodes, at least 2 level deep. The requirement to define a group of nodes also would not be needed. If we have that, I would say that we already do better than OrXXXe and MyXXL, to cite two of them. And if we can get that for 9.6 or even 9.7, that would be really great. Regards, -- Michael
On Wed, Oct 14, 2015 at 3:16 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 10/13/2015 11:02 AM, Masahiko Sawada wrote: >> I thought that this feature for postgresql should be simple at first >> implementation. >> It would be good even if there are some restriction such as the >> nesting level, the group setting. >> The another new approach that I came up with is, >> * Add new parameter synchronous_replication_method (say s_r_method) >> which can have two names: 'priority', 'quorum' >> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3') >> is handled using priority. It's same as '[n1,n2,n3]' in dedicated >> laguage. >> * If s_r_method = 'quorum', the value of s_s_names is handled using >> quorum commit, It's same as '(n1,n2,n3)' in dedicated language. > > Well, the first question is: can you implement both of these things for > 9.6, realistically? If you can implement them, then we can argue about > configuration format later. It's even possible that the nature of your > implementation will enforce a particular syntax. > Hi, Attached patch is a rough patch which supports multi sync replication by another approach I sent before. The new GUC parameters are: * synchronous_standby_num, which specifies the number of standby servers using sync rep. (default is 0) * synchronous_replication_method, which specifies replication method; priority or quorum. (default is priority) The behaviour of 'priority' and 'quorum' are same as what we've been discussing. But I write overview of these here again here. [Priority Method] The standby server has each different priority, and the active standby servers having the top N priroity are become sync standby server. If synchronous_standby_names = '*', the all active standby server would be sync standby server. If you want to set up standby like 9.5 or before, you can set synchronous_standby_num = 1. [Quorum Method] The standby servers have same priority 1, and the all the active standby servers will be sync standby server. The master server have to wait for ACK from N sync standby servers at least before COMMIT. If synchronous_standby_names = '*', the all active standby server would be sync standby server. [Use case] This patch can handle the main use case where Michael said; There are 2 data centers, first one has the master and a sync standby, and second one has a set of standbys. We need to be sure that the standby in DC1 acknowledges all the time, and we would only need to wait for one or more of them in DC2. In order to handle this use case, you set these standbys and GUC parameter as follows. * synchronous_standby_names = 'DC1, DC2' * synchronous_standby_num = 2 * synchronous_replication_method = quorum * The name of standby server in DC1 is 'DC1', and the names of two standby servers in DC2 are 'DC2'. [Extensible] By setting same application_name to different standbys, we can set up sync replication with grouping standbys. If we want to set up replication more complexly and flexibility, we could add new syntax for s_s_names (e.g., JSON format or dedicated language), and increase kind of values of synhcronous_replication_method, e.g. s_r_method = 'complex', And this patch doesn't need new parser for GUC parameter. Regards, -- Masahiko Sawada
Attachment
Hi,
Attached patch is a rough patch which supports multi sync replication
by another approach I sent before.
The new GUC parameters are:
* synchronous_standby_num, which specifies the number of standby
servers using sync rep. (default is 0)
* synchronous_replication_method, which specifies replication method;
priority or quorum. (default is priority)
The behaviour of 'priority' and 'quorum' are same as what we've been discussing.
But I write overview of these here again here.
[Priority Method]
The standby server has each different priority, and the active standby
servers having the top N priroity are become sync standby server.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.
If you want to set up standby like 9.5 or before, you can set
synchronous_standby_num = 1.
On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote: > > On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> >> Hi, >> >> Attached patch is a rough patch which supports multi sync replication >> by another approach I sent before. >> >> The new GUC parameters are: >> * synchronous_standby_num, which specifies the number of standby >> servers using sync rep. (default is 0) >> * synchronous_replication_method, which specifies replication method; >> priority or quorum. (default is priority) >> >> The behaviour of 'priority' and 'quorum' are same as what we've been >> discussing. >> But I write overview of these here again here. >> >> [Priority Method] >> The standby server has each different priority, and the active standby >> servers having the top N priroity are become sync standby server. >> If synchronous_standby_names = '*', the all active standby server >> would be sync standby server. >> If you want to set up standby like 9.5 or before, you can set >> synchronous_standby_num = 1. >> > > > I used the following setting with 2 servers A and D connected: > > synchronous_standby_names = 'A,B,C,D' > synchronous_standby_num = 2 > synchronous_replication_method = 'priority' > > Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused > segmentation fault. > Thank you for taking a look! This patch is a tool for discussion, so I'm not going to fix this bug until getting consensus. We are still under the discussion to find solution that can get consensus. I felt that it's difficult to select from the two approaches within this development cycle, and there would not be time to implement such big feature even if we selected. But this feature is obviously needed by many users. So I'm considering more simple and extensible something solution, the idea I posted is one of them. The another worth considering approach is that just specifying the number of sync standby. It also can cover the main use cases in some-cases. Regards, -- Masahiko Sawada
On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> >> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> wrote: >>> >>> >>> Hi, >>> >>> Attached patch is a rough patch which supports multi sync replication >>> by another approach I sent before. >>> >>> The new GUC parameters are: >>> * synchronous_standby_num, which specifies the number of standby >>> servers using sync rep. (default is 0) >>> * synchronous_replication_method, which specifies replication method; >>> priority or quorum. (default is priority) >>> >>> The behaviour of 'priority' and 'quorum' are same as what we've been >>> discussing. >>> But I write overview of these here again here. >>> >>> [Priority Method] >>> The standby server has each different priority, and the active standby >>> servers having the top N priroity are become sync standby server. >>> If synchronous_standby_names = '*', the all active standby server >>> would be sync standby server. >>> If you want to set up standby like 9.5 or before, you can set >>> synchronous_standby_num = 1. >>> >> >> >> I used the following setting with 2 servers A and D connected: >> >> synchronous_standby_names = 'A,B,C,D' >> synchronous_standby_num = 2 >> synchronous_replication_method = 'priority' >> >> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused >> segmentation fault. >> > > Thank you for taking a look! > This patch is a tool for discussion, so I'm not going to fix this bug > until getting consensus. > > We are still under the discussion to find solution that can get consensus. > I felt that it's difficult to select from the two approaches within > this development cycle, and there would not be time to implement such > big feature even if we selected. > But this feature is obviously needed by many users. > So I'm considering more simple and extensible something solution, the > idea I posted is one of them. > The another worth considering approach is that just specifying the > number of sync standby. It also can cover the main use cases in > some-cases. Yes, it covers main and simple use case like "I want to have multiple synchronous replicas!". Even if we miss quorum commit at the first version, the feature is still very useful. Regards, -- Fujii Masao
On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson <memissemerson@gmail.com> wrote: >>> >>> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada <sawada.mshk@gmail.com> >>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> Attached patch is a rough patch which supports multi sync replication >>>> by another approach I sent before. >>>> >>>> The new GUC parameters are: >>>> * synchronous_standby_num, which specifies the number of standby >>>> servers using sync rep. (default is 0) >>>> * synchronous_replication_method, which specifies replication method; >>>> priority or quorum. (default is priority) >>>> >>>> The behaviour of 'priority' and 'quorum' are same as what we've been >>>> discussing. >>>> But I write overview of these here again here. >>>> >>>> [Priority Method] >>>> The standby server has each different priority, and the active standby >>>> servers having the top N priroity are become sync standby server. >>>> If synchronous_standby_names = '*', the all active standby server >>>> would be sync standby server. >>>> If you want to set up standby like 9.5 or before, you can set >>>> synchronous_standby_num = 1. >>>> >>> >>> >>> I used the following setting with 2 servers A and D connected: >>> >>> synchronous_standby_names = 'A,B,C,D' >>> synchronous_standby_num = 2 >>> synchronous_replication_method = 'priority' >>> >>> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused >>> segmentation fault. >>> >> >> Thank you for taking a look! >> This patch is a tool for discussion, so I'm not going to fix this bug >> until getting consensus. >> >> We are still under the discussion to find solution that can get consensus. >> I felt that it's difficult to select from the two approaches within >> this development cycle, and there would not be time to implement such >> big feature even if we selected. >> But this feature is obviously needed by many users. >> So I'm considering more simple and extensible something solution, the >> idea I posted is one of them. >> The another worth considering approach is that just specifying the >> number of sync standby. It also can cover the main use cases in >> some-cases. > > Yes, it covers main and simple use case like "I want to have multiple > synchronous replicas!". Even if we miss quorum commit at the first > version, the feature is still very useful. It can cover not only the case you mentioned but also main use case Michael mentioned by setting same application_name. And that first version patch is almost implemented, so just needs to be reviewed. I think that it would be good to implement the simple feature at the first version, and then coordinate the design based on opinion and feed backs from more user, use-case. Regards, -- Masahiko Sawada
Hello, At Fri, 13 Nov 2015 09:07:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC9Vi8wOGtXio3Z1NwoVfXBJPNFtt7+5jadVHKn17uHOg@mail.gmail.com> > On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: ... > >> This patch is a tool for discussion, so I'm not going to fix this bug > >> until getting consensus. > >> > >> We are still under the discussion to find solution that can get consensus. > >> I felt that it's difficult to select from the two approaches within > >> this development cycle, and there would not be time to implement such > >> big feature even if we selected. > >> But this feature is obviously needed by many users. > >> So I'm considering more simple and extensible something solution, the > >> idea I posted is one of them. > >> The another worth considering approach is that just specifying the > >> number of sync standby. It also can cover the main use cases in > >> some-cases. > > > > Yes, it covers main and simple use case like "I want to have multiple > > synchronous replicas!". Even if we miss quorum commit at the first > > version, the feature is still very useful. +1 > It can cover not only the case you mentioned but also main use case > Michael mentioned by setting same application_name. > And that first version patch is almost implemented, so just needs to > be reviewed. > > I think that it would be good to implement the simple feature at the > first version, and then coordinate the design based on opinion and > feed backs from more user, use-case. Yeah. I agree with it. And I have two proposals in this direction. - Notation synchronous_standby_names, and synchronous_replication_method asa variable to provide other syntax is probably no argumentexceptits name. But I feel synchronous_standby_num looks bittoo specific. I'd like to propose if this doesn't reprise the argument onnotation for replication definitions:p The following two GUCs would be enough to bear future expansionof notation syntax and/or method. synchronous_standby_names : as it is synchronous_replication_method: default is "1-priority", which means the same with the current meaning. possible additional values so far would be, "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...", where n is the number of requiredacknowledges. "n-quorum": the format of s_s_names is the same as above, but it is read in quorum context. These can be expanded, for example, as follows, but in future. "complex" : Michael's format. "json" : JSON? "json-ext": specify JSON in external file. Even after we have complex notations, I suppose that many use cases are coverd by the first tree notations. - Internal design What should be done in SyncRepReleaseWaiters() is calculating apair of LSNs that can be regarded as synced and decide whether*this*walsender have advanced the LSN pair, then trying torelease backends that wait for the LSNs *if* this walsenderhasadvanced them. From such point, the proposed patch will make redundant trialsto release backens. Addition to that, the patch looks to be a mixture of the currentimplement and the new feature. These are for the same objectivesothey cannot coexist each other, I think. As the result, codesfor both quorum/priority judgement appear at multiplelevel incall tree. This would be an obstacle for future (possible)expansion. So, I think this feature should be implemented as following, SyncRepInitConfig reads the configuration and stores the resultstructure into elsewhere such like WalSnd->syncrepset_definitioninsteadof WalSnd->sync_standby_priority, which should beremoved. Nothing would be stored ifthe current wal sender isnot a member of the defined replication set. Storing a pointerto matching function there wouldincrease the flexibility butsuch implement in contrast will make the code difficult to beread.. (I often look for theentity of xlogreader->read_page();) Then SyncRepSyncedLsnAdvancedTo() instead ofSyncRepGetSynchronousStandbys() returns an LSN pair that can beregarded as 'synced'according to specified definition ofreplication set and whether this walsender have advanced theLSNs. Finally, SyncRepReleaseWaiters() uses it to release backends ifneeded. The differences among quorum/priority or others are confined inSyncRepSyncedLsnAdvancedTo(). As the result,SyncRepReleaseWaiterswould look as following. | SyncRepReleaseWaiters(void)| {| if (MyWalSnd->syncrepset_definition == NULL || ...)| return;| ...| if (!SyncRepSyncedLsnAdvancedTo(&flush_pos,&write_pos))| {| /* I haven't advanced the synced LSNs */| LWLockRelease(SyncRepLock);| rerturn;| }| /* Set the lsn first so that when we wake backends they will relase... I'm not thought concretely about what SyncRepSyncedLsnAdvancedTodoes but perhaps yes we can:p in effective manner.. What do you think about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Nov 13, 2015 at 12:52 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Fri, 13 Nov 2015 09:07:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC9Vi8wOGtXio3Z1NwoVfXBJPNFtt7+5jadVHKn17uHOg@mail.gmail.com> >> On Thu, Oct 29, 2015 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > On Thu, Oct 22, 2015 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > ... >> >> This patch is a tool for discussion, so I'm not going to fix this bug >> >> until getting consensus. >> >> >> >> We are still under the discussion to find solution that can get consensus. >> >> I felt that it's difficult to select from the two approaches within >> >> this development cycle, and there would not be time to implement such >> >> big feature even if we selected. >> >> But this feature is obviously needed by many users. >> >> So I'm considering more simple and extensible something solution, the >> >> idea I posted is one of them. >> >> The another worth considering approach is that just specifying the >> >> number of sync standby. It also can cover the main use cases in >> >> some-cases. >> > >> > Yes, it covers main and simple use case like "I want to have multiple >> > synchronous replicas!". Even if we miss quorum commit at the first >> > version, the feature is still very useful. > > +1 > >> It can cover not only the case you mentioned but also main use case >> Michael mentioned by setting same application_name. >> And that first version patch is almost implemented, so just needs to >> be reviewed. >> >> I think that it would be good to implement the simple feature at the >> first version, and then coordinate the design based on opinion and >> feed backs from more user, use-case. > > Yeah. I agree with it. And I have two proposals in this > direction. > > - Notation > > synchronous_standby_names, and synchronous_replication_method as > a variable to provide other syntax is probably no argument > except its name. But I feel synchronous_standby_num looks bit > too specific. > > I'd like to propose if this doesn't reprise the argument on > notation for replication definitions:p > > The following two GUCs would be enough to bear future expansion > of notation syntax and/or method. > > synchronous_standby_names : as it is > > synchronous_replication_method: > > default is "1-priority", which means the same with the current > meaning. possible additional values so far would be, > > "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...", > where n is the number of required acknowledges. One question is that what is different between the leading "n" in s_s_names and the leading "n" of "n-priority"? > > "n-quorum": the format of s_s_names is the same as above, but > it is read in quorum context. > > These can be expanded, for example, as follows, but in future. > > "complex" : Michael's format. > "json" : JSON? > "json-ext": specify JSON in external file. > > Even after we have complex notations, I suppose that many use > cases are coverd by the first tree notations. I'm not sure it's desirable to implement the all kind of methods into core. I think it's better to extend replication in order to be more extensibility like adding hook function. And then other approach is implemented as a contrib module. > > - Internal design > > What should be done in SyncRepReleaseWaiters() is calculating a > pair of LSNs that can be regarded as synced and decide whether > *this* walsender have advanced the LSN pair, then trying to > release backends that wait for the LSNs *if* this walsender has > advanced them. > > From such point, the proposed patch will make redundant trials > to release backens. > > Addition to that, the patch looks to be a mixture of the current > implement and the new feature. These are for the same objective > so they cannot coexist each other, I think. As the result, codes > for both quorum/priority judgement appear at multiple level in > call tree. This would be an obstacle for future (possible) > expansion. > > So, I think this feature should be implemented as following, > > SyncRepInitConfig reads the configuration and stores the result > structure into elsewhere such like WalSnd->syncrepset_definition > instead of WalSnd->sync_standby_priority, which should be > removed. Nothing would be stored if the current wal sender is > not a member of the defined replication set. Storing a pointer > to matching function there would increase the flexibility but > such implement in contrast will make the code difficult to be > read.. (I often look for the entity of xlogreader->read_page() > ;) > > Then SyncRepSyncedLsnAdvancedTo() instead of > SyncRepGetSynchronousStandbys() returns an LSN pair that can be > regarded as 'synced' according to specified definition of > replication set and whether this walsender have advanced the > LSNs. > > Finally, SyncRepReleaseWaiters() uses it to release backends if > needed. > > The differences among quorum/priority or others are confined in > SyncRepSyncedLsnAdvancedTo(). As the result, > SyncRepReleaseWaiters would look as following. > > | SyncRepReleaseWaiters(void) > | { > | if (MyWalSnd->syncrepset_definition == NULL || ...) > | return; > | ... > | if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos)) > | { > | /* I haven't advanced the synced LSNs */ > | LWLockRelease(SyncRepLock); > | rerturn; > | } > | /* Set the lsn first so that when we wake backends they will relase... > > I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo > does but perhaps yes we can:p in effective manner.. > > What do you think about this? I agree with this design. What SyncRepSyncedLsnAdvancedTo() does would be different for each method, so we can implement "n-priority" style multiple sync replication at first version. Regards, -- Masahiko Sawada
Hello, At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDhqGB=EtBfqnkHxR8T53d+8qMs4DPm5HVyq4bA2oR5eQ@mail.gmail.com> > > - Notation > > > > synchronous_standby_names, and synchronous_replication_method as > > a variable to provide other syntax is probably no argument > > except its name. But I feel synchronous_standby_num looks bit > > too specific. > > > > I'd like to propose if this doesn't reprise the argument on > > notation for replication definitions:p > > > > The following two GUCs would be enough to bear future expansion > > of notation syntax and/or method. > > > > synchronous_standby_names : as it is > > > > synchronous_replication_method: > > > > default is "1-priority", which means the same with the current > > meaning. possible additional values so far would be, > > > > "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...", > > where n is the number of required acknowledges. > > One question is that what is different between the leading "n" in > s_s_names and the leading "n" of "n-priority"? Ah. Sorry for the ambiguous description. 'n' in s_s_names representing an arbitrary integer number and that in "n-priority" is literally an "n", meaning "a format with any number of priority hosts" as a whole. As an instance, synchronous_replication_method = "n-priority" synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" I added "n-" of "n-priority" to distinguish with "1-priority" so if we won't provide "1-priority" for backward compatibility, "priority" would be enough to represent the type. By the way, s_r_method is not essentially necessary but it would be important to avoid complexity of autodetection of formats including currently undefined ones. > > "n-quorum": the format of s_s_names is the same as above, but > > it is read in quorum context. The "n" of this is the same as above. > > These can be expanded, for example, as follows, but in future. > > > > "complex" : Michael's format. > > "json" : JSON? > > "json-ext": specify JSON in external file. > > > > Even after we have complex notations, I suppose that many use > > cases are coverd by the first tree notations. > > I'm not sure it's desirable to implement the all kind of methods into core. > I think it's better to extend replication in order to be more > extensibility like adding hook function. > And then other approach is implemented as a contrib module. I agree with you. I proposed the following internal design having that in mind. > > - Internal design > > > > What should be done in SyncRepReleaseWaiters() is calculating a > > pair of LSNs that can be regarded as synced and decide whether > > *this* walsender have advanced the LSN pair, then trying to > > release backends that wait for the LSNs *if* this walsender has > > advanced them. > > > > From such point, the proposed patch will make redundant trials > > to release backens. > > > > Addition to that, the patch looks to be a mixture of the current > > implement and the new feature. These are for the same objective > > so they cannot coexist each other, I think. As the result, codes > > for both quorum/priority judgement appear at multiple level in > > call tree. This would be an obstacle for future (possible) > > expansion. > > > > So, I think this feature should be implemented as following, > > > > SyncRepInitConfig reads the configuration and stores the result > > structure into elsewhere such like WalSnd->syncrepset_definition > > instead of WalSnd->sync_standby_priority, which should be > > removed. Nothing would be stored if the current wal sender is > > not a member of the defined replication set. Storing a pointer > > to matching function there would increase the flexibility but > > such implement in contrast will make the code difficult to be > > read.. (I often look for the entity of xlogreader->read_page() > > ;) > > > > Then SyncRepSyncedLsnAdvancedTo() instead of > > SyncRepGetSynchronousStandbys() returns an LSN pair that can be > > regarded as 'synced' according to specified definition of > > replication set and whether this walsender have advanced the > > LSNs. > > > > Finally, SyncRepReleaseWaiters() uses it to release backends if > > needed. > > > > The differences among quorum/priority or others are confined in > > SyncRepSyncedLsnAdvancedTo(). As the result, > > SyncRepReleaseWaiters would look as following. > > > > | SyncRepReleaseWaiters(void) > > | { > > | if (MyWalSnd->syncrepset_definition == NULL || ...) > > | return; > > | ... > > | if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos)) > > | { > > | /* I haven't advanced the synced LSNs */ > > | LWLockRelease(SyncRepLock); > > | rerturn; > > | } > > | /* Set the lsn first so that when we wake backends they will relase... > > > > I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo > > does but perhaps yes we can:p in effective manner.. > > > > What do you think about this? > > I agree with this design. > What SyncRepSyncedLsnAdvancedTo() does would be different for each > method, so we can implement "n-priority" style multiple sync > replication at first version. Maybe the first *additional* one if we decide to keep backward compatibility, as the discussion above. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Nov 17, 2015 at 9:57 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDhqGB=EtBfqnkHxR8T53d+8qMs4DPm5HVyq4bA2oR5eQ@mail.gmail.com> >> > - Notation >> > >> > synchronous_standby_names, and synchronous_replication_method as >> > a variable to provide other syntax is probably no argument >> > except its name. But I feel synchronous_standby_num looks bit >> > too specific. >> > >> > I'd like to propose if this doesn't reprise the argument on >> > notation for replication definitions:p >> > >> > The following two GUCs would be enough to bear future expansion >> > of notation syntax and/or method. >> > >> > synchronous_standby_names : as it is >> > >> > synchronous_replication_method: >> > >> > default is "1-priority", which means the same with the current >> > meaning. possible additional values so far would be, >> > >> > "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...", >> > where n is the number of required acknowledges. >> >> One question is that what is different between the leading "n" in >> s_s_names and the leading "n" of "n-priority"? > > Ah. Sorry for the ambiguous description. 'n' in s_s_names > representing an arbitrary integer number and that in "n-priority" > is literally an "n", meaning "a format with any number of > priority hosts" as a whole. As an instance, > > synchronous_replication_method = "n-priority" > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" > > I added "n-" of "n-priority" to distinguish with "1-priority" so > if we won't provide "1-priority" for backward compatibility, > "priority" would be enough to represent the type. > > By the way, s_r_method is not essentially necessary but it would > be important to avoid complexity of autodetection of formats > including currently undefined ones. Than you for your explanation, I understood that. It means that the format of s_s_names will be changed, which would be not good. So, how about the adding just s_r_method parameter and the number of required ACK is represented in the leading of s_r_method? For example, the following setting is same as above. synchronous_replication_method = "2-priority" synchronous_standby_names = "mercury, venus, earth, mars, jupiter" In quorum method, we can set; synchronous_replication_method = "2-quorum" synchronous_standby_names = "mercury, venus, earth, mars, jupiter" Thought? > > >> > "n-quorum": the format of s_s_names is the same as above, but >> > it is read in quorum context. > > The "n" of this is the same as above. > >> > These can be expanded, for example, as follows, but in future. >> > >> > "complex" : Michael's format. >> > "json" : JSON? >> > "json-ext": specify JSON in external file. >> > >> > Even after we have complex notations, I suppose that many use >> > cases are coverd by the first tree notations. >> >> I'm not sure it's desirable to implement the all kind of methods into core. >> I think it's better to extend replication in order to be more >> extensibility like adding hook function. >> And then other approach is implemented as a contrib module. > > I agree with you. I proposed the following internal design having > that in mind. > >> > - Internal design >> > >> > What should be done in SyncRepReleaseWaiters() is calculating a >> > pair of LSNs that can be regarded as synced and decide whether >> > *this* walsender have advanced the LSN pair, then trying to >> > release backends that wait for the LSNs *if* this walsender has >> > advanced them. >> > >> > From such point, the proposed patch will make redundant trials >> > to release backens. >> > >> > Addition to that, the patch looks to be a mixture of the current >> > implement and the new feature. These are for the same objective >> > so they cannot coexist each other, I think. As the result, codes >> > for both quorum/priority judgement appear at multiple level in >> > call tree. This would be an obstacle for future (possible) >> > expansion. >> > >> > So, I think this feature should be implemented as following, >> > >> > SyncRepInitConfig reads the configuration and stores the result >> > structure into elsewhere such like WalSnd->syncrepset_definition >> > instead of WalSnd->sync_standby_priority, which should be >> > removed. Nothing would be stored if the current wal sender is >> > not a member of the defined replication set. Storing a pointer >> > to matching function there would increase the flexibility but >> > such implement in contrast will make the code difficult to be >> > read.. (I often look for the entity of xlogreader->read_page() >> > ;) >> > >> > Then SyncRepSyncedLsnAdvancedTo() instead of >> > SyncRepGetSynchronousStandbys() returns an LSN pair that can be >> > regarded as 'synced' according to specified definition of >> > replication set and whether this walsender have advanced the >> > LSNs. >> > >> > Finally, SyncRepReleaseWaiters() uses it to release backends if >> > needed. >> > >> > The differences among quorum/priority or others are confined in >> > SyncRepSyncedLsnAdvancedTo(). As the result, >> > SyncRepReleaseWaiters would look as following. >> > >> > | SyncRepReleaseWaiters(void) >> > | { >> > | if (MyWalSnd->syncrepset_definition == NULL || ...) >> > | return; >> > | ... >> > | if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos)) >> > | { >> > | /* I haven't advanced the synced LSNs */ >> > | LWLockRelease(SyncRepLock); >> > | rerturn; >> > | } >> > | /* Set the lsn first so that when we wake backends they will relase... >> > >> > I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo >> > does but perhaps yes we can:p in effective manner.. >> > >> > What do you think about this? >> >> I agree with this design. >> What SyncRepSyncedLsnAdvancedTo() does would be different for each >> method, so we can implement "n-priority" style multiple sync >> replication at first version. > > Maybe the first *additional* one if we decide to keep backward > compatibility, as the discussion above. > Regards, -- Masahiko Sawada
Hello, At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com> > >> One question is that what is different between the leading "n" in > >> s_s_names and the leading "n" of "n-priority"? > > > > Ah. Sorry for the ambiguous description. 'n' in s_s_names > > representing an arbitrary integer number and that in "n-priority" > > is literally an "n", meaning "a format with any number of > > priority hosts" as a whole. As an instance, > > > > synchronous_replication_method = "n-priority" > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" > > > > I added "n-" of "n-priority" to distinguish with "1-priority" so > > if we won't provide "1-priority" for backward compatibility, > > "priority" would be enough to represent the type. > > > > By the way, s_r_method is not essentially necessary but it would > > be important to avoid complexity of autodetection of formats > > including currently undefined ones. > > Than you for your explanation, I understood that. > > It means that the format of s_s_names will be changed, which would be not good. I believe that the format of definition of "replication set"(?) is not fixed and it would be more complex format to support nested definition. This should be in very different format from the current simple list of names. This is a selection among three or possiblly more disigns in order to be tolerable for future changes, I suppose. 1. Additional formats of definition in future will be stored in elsewhere of s_s_names. 2. Additional format will be stored in s_s_names, the format will be automatically detected. 3. (ditto), the format is designated by s_r_method. 4. Any other way? I choosed the third way. What do you think about future expansion of the format? > So, how about the adding just s_r_method parameter and the number of > required ACK is represented in the leading of s_r_method? > For example, the following setting is same as above. > > synchronous_replication_method = "2-priority" > synchronous_standby_names = "mercury, venus, earth, mars, jupiter" I *feel* it is the same or worse as having the third parameter s_s_num as your previous design. > In quorum method, we can set; > synchronous_replication_method = "2-quorum" > synchronous_standby_names = "mercury, venus, earth, mars, jupiter" > > Thought? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Oops. At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, > > At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com> > > >> One question is that what is different between the leading "n" in > > >> s_s_names and the leading "n" of "n-priority"? > > > > > > Ah. Sorry for the ambiguous description. 'n' in s_s_names > > > representing an arbitrary integer number and that in "n-priority" > > > is literally an "n", meaning "a format with any number of > > > priority hosts" as a whole. As an instance, > > > > > > synchronous_replication_method = "n-priority" > > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" > > > > > > I added "n-" of "n-priority" to distinguish with "1-priority" so > > > if we won't provide "1-priority" for backward compatibility, > > > "priority" would be enough to represent the type. > > > > > > By the way, s_r_method is not essentially necessary but it would > > > be important to avoid complexity of autodetection of formats > > > including currently undefined ones. > > > > Than you for your explanation, I understood that. > > > > It means that the format of s_s_names will be changed, which would be not good. > > I believe that the format of definition of "replication set"(?) > is not fixed and it would be more complex format to support > nested definition. This should be in very different format from > the current simple list of names. This is a selection among three > or possiblly more disigns in order to be tolerable for future > changes, I suppose. > > 1. Additional formats of definition in future will be stored in > elsewhere of s_s_names. > > 2. Additional format will be stored in s_s_names, the format will > be automatically detected. > > 3. (ditto), the format is designated by s_r_method. > > 4. Any other way? > > I choosed the third way. What do you think about future expansion > of the format? > > > So, how about the adding just s_r_method parameter and the number of > > required ACK is represented in the leading of s_r_method? > > For example, the following setting is same as above. > > > > synchronous_replication_method = "2-priority" > > synchronous_standby_names = "mercury, venus, earth, mars, jupiter" > > I *feel* it is the same or worse as having the third parameter > s_s_num as your previous design. I feel it is the same or worse *than* having the third parameter s_s_num as your previous design. > > In quorum method, we can set; > > synchronous_replication_method = "2-quorum" > > synchronous_standby_names = "mercury, venus, earth, mars, jupiter" > > > > Thought? > > > regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Oops. > > At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> >> Hello, >> >> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com> >> > >> One question is that what is different between the leading "n" in >> > >> s_s_names and the leading "n" of "n-priority"? >> > > >> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names >> > > representing an arbitrary integer number and that in "n-priority" >> > > is literally an "n", meaning "a format with any number of >> > > priority hosts" as a whole. As an instance, >> > > >> > > synchronous_replication_method = "n-priority" >> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" >> > > >> > > I added "n-" of "n-priority" to distinguish with "1-priority" so >> > > if we won't provide "1-priority" for backward compatibility, >> > > "priority" would be enough to represent the type. >> > > >> > > By the way, s_r_method is not essentially necessary but it would >> > > be important to avoid complexity of autodetection of formats >> > > including currently undefined ones. >> > >> > Than you for your explanation, I understood that. >> > >> > It means that the format of s_s_names will be changed, which would be not good. >> >> I believe that the format of definition of "replication set"(?) >> is not fixed and it would be more complex format to support >> nested definition. This should be in very different format from >> the current simple list of names. This is a selection among three >> or possiblly more disigns in order to be tolerable for future >> changes, I suppose. >> >> 1. Additional formats of definition in future will be stored in >> elsewhere of s_s_names. >> >> 2. Additional format will be stored in s_s_names, the format will >> be automatically detected. >> >> 3. (ditto), the format is designated by s_r_method. >> >> 4. Any other way? >> >> I choosed the third way. What do you think about future expansion >> of the format? >> I agree with #3 way and the s_s_name format you suggested. I think that It's extensible and is tolerable for future changes. I'm going to implement the patch based on this idea if other hackers agree with this design. Regards, -- Masahiko Sawada
On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Oops. >> >> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> >>> Hello, >>> >>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com> >>> > >> One question is that what is different between the leading "n" in >>> > >> s_s_names and the leading "n" of "n-priority"? >>> > > >>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names >>> > > representing an arbitrary integer number and that in "n-priority" >>> > > is literally an "n", meaning "a format with any number of >>> > > priority hosts" as a whole. As an instance, >>> > > >>> > > synchronous_replication_method = "n-priority" >>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" >>> > > >>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so >>> > > if we won't provide "1-priority" for backward compatibility, >>> > > "priority" would be enough to represent the type. >>> > > >>> > > By the way, s_r_method is not essentially necessary but it would >>> > > be important to avoid complexity of autodetection of formats >>> > > including currently undefined ones. >>> > >>> > Than you for your explanation, I understood that. >>> > >>> > It means that the format of s_s_names will be changed, which would be not good. >>> >>> I believe that the format of definition of "replication set"(?) >>> is not fixed and it would be more complex format to support >>> nested definition. This should be in very different format from >>> the current simple list of names. This is a selection among three >>> or possiblly more disigns in order to be tolerable for future >>> changes, I suppose. >>> >>> 1. Additional formats of definition in future will be stored in >>> elsewhere of s_s_names. >>> >>> 2. Additional format will be stored in s_s_names, the format will >>> be automatically detected. >>> >>> 3. (ditto), the format is designated by s_r_method. >>> >>> 4. Any other way? >>> >>> I choosed the third way. What do you think about future expansion >>> of the format? >>> > > I agree with #3 way and the s_s_name format you suggested. > I think that It's extensible and is tolerable for future changes. > I'm going to implement the patch based on this idea if other hackers > agree with this design. > Please find the attached draft patch which supports multi sync replication. This patch adds a GUC parameter synchronous_replication_method, which represent the method of synchronous replication. [Design of replication method] synchronous_replication_method has two values; 'priority' and '1-priority' for now. We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future. * s_r_method = '1-priority' This method is for backward compatibility, so the syntax of s_s_names is same as today. The behavior is same as well. * s_r_method = 'priority' This method is for multiple synchronous replication using priority method. The syntax of s_s_names is, <number of sync standbys>, <standby name> [, ...] For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, node3' means that the master waits for acknowledge from at least 2 lowest priority servers. If 4 standbys(node1 - node4) are available, the master server waits acknowledge from 'node1' and 'node2. The each status of wal senders are; =# select application_name, sync_state from pg_stat_replication order by application_name; application_name | sync_state ------------------+------------ node1 | sync node2 | sync node3 | potential node4 | async (4 rows) After 'node2' crashed, the master will wait for acknowledge from 'node1' and 'node3'. The each status of wal senders are; =# select application_name, sync_state from pg_stat_replication order by application_name; application_name | sync_state ------------------+------------ node1 | sync node3 | sync node4 | async (3 rows) [Changing replication method] When we want to change the replication method, we have to change the s_r_method at first, and then do pg_reload_conf(). After changing replication method, we can change the s_s_names. [Expanding replication method] If we want to expand new replication method additionally, we need to implement two functions for each replication method: * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) This function obtains the list of standbys considered as synchronous at that time, and return its length. * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) This function obtains LSNs(write, flush) considered as synced. Also, this patch debug code is remain yet, you can debug this behavior using by enable DEBUG_REPLICATION macro. Please give me feedbacks. Regards, -- Masahiko Sawada
Attachment
On Wed, Dec 9, 2015 at 8:59 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> Oops. >>> >>> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20151117.194010.17198448.horiguchi.kyotaro@lab.ntt.co.jp> >>>> Hello, >>>> >>>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC=AN+DKYNwsJp6COZ-6qmHXxuENxVPisxgPXcuXmPEvw@mail.gmail.com> >>>> > >> One question is that what is different between the leading "n" in >>>> > >> s_s_names and the leading "n" of "n-priority"? >>>> > > >>>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names >>>> > > representing an arbitrary integer number and that in "n-priority" >>>> > > is literally an "n", meaning "a format with any number of >>>> > > priority hosts" as a whole. As an instance, >>>> > > >>>> > > synchronous_replication_method = "n-priority" >>>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter" >>>> > > >>>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so >>>> > > if we won't provide "1-priority" for backward compatibility, >>>> > > "priority" would be enough to represent the type. >>>> > > >>>> > > By the way, s_r_method is not essentially necessary but it would >>>> > > be important to avoid complexity of autodetection of formats >>>> > > including currently undefined ones. >>>> > >>>> > Than you for your explanation, I understood that. >>>> > >>>> > It means that the format of s_s_names will be changed, which would be not good. >>>> >>>> I believe that the format of definition of "replication set"(?) >>>> is not fixed and it would be more complex format to support >>>> nested definition. This should be in very different format from >>>> the current simple list of names. This is a selection among three >>>> or possiblly more disigns in order to be tolerable for future >>>> changes, I suppose. >>>> >>>> 1. Additional formats of definition in future will be stored in >>>> elsewhere of s_s_names. >>>> >>>> 2. Additional format will be stored in s_s_names, the format will >>>> be automatically detected. >>>> >>>> 3. (ditto), the format is designated by s_r_method. >>>> >>>> 4. Any other way? >>>> >>>> I choosed the third way. What do you think about future expansion >>>> of the format? >>>> >> >> I agree with #3 way and the s_s_name format you suggested. >> I think that It's extensible and is tolerable for future changes. >> I'm going to implement the patch based on this idea if other hackers >> agree with this design. >> > > Please find the attached draft patch which supports multi sync replication. > This patch adds a GUC parameter synchronous_replication_method, which > represent the method of synchronous replication. > > [Design of replication method] > synchronous_replication_method has two values; 'priority' and > '1-priority' for now. > We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future. > > * s_r_method = '1-priority' > This method is for backward compatibility, so the syntax of s_s_names > is same as today. > The behavior is same as well. > > * s_r_method = 'priority' > This method is for multiple synchronous replication using priority method. > The syntax of s_s_names is, > <number of sync standbys>, <standby name> [, ...] > > For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, > node3' means that the master waits for acknowledge from at least 2 > lowest priority servers. > If 4 standbys(node1 - node4) are available, the master server waits > acknowledge from 'node1' and 'node2. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node2 | sync > node3 | potential > node4 | async > (4 rows) > > After 'node2' crashed, the master will wait for acknowledge from > 'node1' and 'node3'. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node3 | sync > node4 | async > (3 rows) > > [Changing replication method] > When we want to change the replication method, we have to change the > s_r_method at first, and then do pg_reload_conf(). > After changing replication method, we can change the s_s_names. > > [Expanding replication method] > If we want to expand new replication method additionally, we need to > implement two functions for each replication method: > * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) > This function obtains the list of standbys considered as synchronous > at that time, and return its length. > * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > This function obtains LSNs(write, flush) considered as synced. > > Also, this patch debug code is remain yet, you can debug this behavior > using by enable DEBUG_REPLICATION macro. > > Please give me feedbacks. > I've attached updated patch. Please give me feedbacks. Regards, -- Masahiko Sawada
Attachment
Thank you for the new patch. At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg@mail.gmail.com> > On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I agree with #3 way and the s_s_name format you suggested. > > I think that It's extensible and is tolerable for future changes. > > I'm going to implement the patch based on this idea if other hackers > > agree with this design. > > Please find the attached draft patch which supports multi sync replication. > This patch adds a GUC parameter synchronous_replication_method, which > represent the method of synchronous replication. > > [Design of replication method] > synchronous_replication_method has two values; 'priority' and > '1-priority' for now. > We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future. > > * s_r_method = '1-priority' > This method is for backward compatibility, so the syntax of s_s_names > is same as today. > The behavior is same as well. > > * s_r_method = 'priority' > This method is for multiple synchronous replication using priority method. > The syntax of s_s_names is, > <number of sync standbys>, <standby name> [, ...] Is there anyone opposed to this? > For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, > node3' means that the master waits for acknowledge from at least 2 > lowest priority servers. > If 4 standbys(node1 - node4) are available, the master server waits > acknowledge from 'node1' and 'node2. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node2 | sync > node3 | potential > node4 | async > (4 rows) > > After 'node2' crashed, the master will wait for acknowledge from > 'node1' and 'node3'. > The each status of wal senders are; > > =# select application_name, sync_state from pg_stat_replication order > by application_name; > application_name | sync_state > ------------------+------------ > node1 | sync > node3 | sync > node4 | async > (3 rows) > > [Changing replication method] > When we want to change the replication method, we have to change the > s_r_method at first, and then do pg_reload_conf(). > After changing replication method, we can change the s_s_names. Mmm. I should be able to be changed at once, because s_r_method and s_s_names contradict each other during the intermediate state. > [Expanding replication method] > If we want to expand new replication method additionally, we need to > implement two functions for each replication method: > * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) > This function obtains the list of standbys considered as synchronous > at that time, and return its length. > * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > This function obtains LSNs(write, flush) considered as synced. > > Also, this patch debug code is remain yet, you can debug this behavior > using by enable DEBUG_REPLICATION macro. > > Please give me feedbacks. I haven't looked into this fully (sorry) but I'm concerned about several points. - I feel that some function names looks too long. For example SyncRepGetSynchronousStandbysOnePriority occupies more thanthe half of a line. (However, the replication code alrady has many long function names..) - The comment below of SyncRepGetSynchronousStandbyOnePriority, > /* Find lowest priority standby */ The code where the comment is for is doing the correct thing. Howerver, the comment is confusing. A lower priority *value*means a higher priority. - SyncRepGetSynchronousStandbys checks all if()s even when the first one matches. Use switch or "else if" there if you theyare exclusive each other. - Do you intende the DEBUG_REPLICATION code in SyncRepGetSynchronousStandbys*() to be the final shape? The same code blockswhich can work for both method should be in their common caller but SyncRepGetSyncLsns*() are headache. Although itmight need more refactoring, I'm sorry but I don't see a desirable shape for now. By the way, palloc(20)/free() in such short term looks ineffective. - SyncRepGetSyncLsnsPriority For the comment "/* Find lowest XLogRecPtr of both write and flush from sync_nodes */", LSN is compared as early or lateso the comment would be better to be something like "Keep/Collect the earliest write and flush LSNs among prioritizedstandbys". And what is more important, this block handles write and flush LSN jumbled and it reults in missing the earliest(= mostdelayed) LSN for certain cases. The following is an example. Standby 1: write LSN = 10, flush LSN = 5 Standby 2: write LSN = 8 , flush LSN = 6 For this case, finally we get tmp_write = 10 and tmp_flush = 5 from the current code, where tmp_write has wrong value sinceLSN = 10 has *not* been written yet on standby 2. (the names "tmp_*" don't seem appropriate here) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Thank you for the new patch. > > At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg@mail.gmail.com> >> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > I agree with #3 way and the s_s_name format you suggested. >> > I think that It's extensible and is tolerable for future changes. >> > I'm going to implement the patch based on this idea if other hackers >> > agree with this design. >> >> Please find the attached draft patch which supports multi sync replication. >> This patch adds a GUC parameter synchronous_replication_method, which >> represent the method of synchronous replication. >> >> [Design of replication method] >> synchronous_replication_method has two values; 'priority' and >> '1-priority' for now. >> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future. >> >> * s_r_method = '1-priority' >> This method is for backward compatibility, so the syntax of s_s_names >> is same as today. >> The behavior is same as well. >> >> * s_r_method = 'priority' >> This method is for multiple synchronous replication using priority method. >> The syntax of s_s_names is, >> <number of sync standbys>, <standby name> [, ...] > > Is there anyone opposed to this? > >> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2, >> node3' means that the master waits for acknowledge from at least 2 >> lowest priority servers. >> If 4 standbys(node1 - node4) are available, the master server waits >> acknowledge from 'node1' and 'node2. >> The each status of wal senders are; >> >> =# select application_name, sync_state from pg_stat_replication order >> by application_name; >> application_name | sync_state >> ------------------+------------ >> node1 | sync >> node2 | sync >> node3 | potential >> node4 | async >> (4 rows) >> >> After 'node2' crashed, the master will wait for acknowledge from >> 'node1' and 'node3'. >> The each status of wal senders are; >> >> =# select application_name, sync_state from pg_stat_replication order >> by application_name; >> application_name | sync_state >> ------------------+------------ >> node1 | sync >> node3 | sync >> node4 | async >> (3 rows) >> >> [Changing replication method] >> When we want to change the replication method, we have to change the >> s_r_method at first, and then do pg_reload_conf(). >> After changing replication method, we can change the s_s_names. Thank you for reviewing the patch! Please find attached latest patch. > Mmm. I should be able to be changed at once, because s_r_method > and s_s_names contradict each other during the intermediate > state. Sorry to confuse you. I meant the case where we want to change the replication method using ALTER SYSTEM. >> [Expanding replication method] >> If we want to expand new replication method additionally, we need to >> implement two functions for each replication method: >> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys) >> This function obtains the list of standbys considered as synchronous >> at that time, and return its length. >> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> This function obtains LSNs(write, flush) considered as synced. >> >> Also, this patch debug code is remain yet, you can debug this behavior >> using by enable DEBUG_REPLICATION macro. >> >> Please give me feedbacks. > > I haven't looked into this fully (sorry) but I'm concerned about > several points. > > > - I feel that some function names looks too long. For example > SyncRepGetSynchronousStandbysOnePriority occupies more than the > half of a line. (However, the replication code alrady has many > long function names..) Yeah, it would be better to change 'Synchronous' to 'Sync' at least. > - The comment below of SyncRepGetSynchronousStandbyOnePriority, > > /* Find lowest priority standby */ > > The code where the comment is for is doing the correct > thing. Howerver, the comment is confusing. A lower priority > *value* means a higher priority. Fixed. > - SyncRepGetSynchronousStandbys checks all if()s even when the > first one matches. Use switch or "else if" there if you they > are exclusive each other. Fixed. > - Do you intende the DEBUG_REPLICATION code in > SyncRepGetSynchronousStandbys*() to be the final shape? The > same code blocks which can work for both method should be in > their common caller but SyncRepGetSyncLsns*() are > headache. Although it might need more refactoring, I'm sorry > but I don't see a desirable shape for now. I'm not going to DEBUG_REPLICAION code to be the final shape. These codes are removed from this version patch. > By the way, palloc(20)/free() in such short term looks > ineffective. > > - SyncRepGetSyncLsnsPriority > > For the comment "/* Find lowest XLogRecPtr of both write and > flush from sync_nodes */", LSN is compared as early or late so > the comment would be better to be something like "Keep/Collect > the earliest write and flush LSNs among prioritized standbys". Fixed. > And what is more important, this block handles write and flush > LSN jumbled and it reults in missing the earliest(= most > delayed) LSN for certain cases. The following is an example. > > Standby 1: write LSN = 10, flush LSN = 5 > Standby 2: write LSN = 8 , flush LSN = 6 > > For this case, finally we get tmp_write = 10 and tmp_flush = 5 > from the current code, where tmp_write has wrong value since > LSN = 10 has *not* been written yet on standby 2. (the names > "tmp_*" don't seem appropriate here) > You are right. We have to handle write and flush LSN individually, and to get each lowest LSN. For example in this case, we have to get write = 8, flush = 5. I've change the logic so that. Regards, -- Masahiko Sawada
Attachment
On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: [000-_multi_sync_replication_v3.patch] Hi Masahiko, I haven't tested this version of the patch but I have some comments on the code. +/* Is this wal sender considerable one? */ +bool +SyncRepActiveListedWalSender(int num) Maybe "Is this wal sender managing a standby that is streaming and listed as a synchronous standby?" +/* + * Obtain three palloc'd arrays containing position of standbys currently + * considered as synchronous, and its length. + */ +int +SyncRepGetSyncStandbys(int *sync_standbys) This comment seems to be out of date. I would say "Populate a caller-supplied array which much have enough space for ... Returns ...". +/* + * Obtain standby currently considered as synchronous using + * '1-priority' method. + */ +int +SyncRepGetSyncStandbysOnePriority(int *sync_standbys) + ... code ... Why do we need a separate function and code path for this case? If you used SyncRepGetSyncStandbysPriority with a size of 1, should it not produce the same result in the same time complexity? +/* + * Obtain standby currently considered as synchronous using + * 'priority' method. + */ +int +SyncRepGetSyncStandbysPriority(int *sync_standbys) I would say something more descriptive, maybe like this: "Populates a caller-supplied buffer with the walsnds indexes of the highest priority active synchronous standbys, up to the a limit of 'synchronous_standby_num'. The order of the results is undefined. Returns the number of results actually written." If you got rid of SyncRepGetSyncStandbysOnePriority as suggested above, then this function could be renamed to SyncRepGetSyncStandbys. I think it would be a tiny bit nicer if it also took a Size n argument along with the output buffer pointer. As for the body of that function (which I won't paste here), it contains an algorithm to find the top K elements in an array of N elements. It does that with a linear search through the top K seen so far for each value in the input array, so its worst case is O(KN) comparisons. Some of the sorting gurus on this list might have something to say about that but my take is that it seems fine for the tiny values of K and N that we're dealing with here, and it's nice that it doesn't need any space other than the output buffer, unlike some other top-K algorithms which would win for larger inputs. + /* Found sync standby */ This comment would be clearer as "Found lowest priority standby, so replace it". + if (walsndloc->sync_standby_priority == priority && + walsnd->sync_standby_priority < priority) + sync_standbys[j] = i; In this case, couldn't you also update 'priority' directly, and break out of the loop immediately? Wouldn't "lowest_priority" be a better variable name than "priority"? It might be good to say "lowest" rather than "highest" in the nearby comments, to be consistent with other parts of the code including the function name (lower priority number means higher priority!). +/* + * Obtain currently synced LSN: write and flush, + * using '1-prioirty' method. s/prioirty/priority/ + */ +bool +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) Similar to the earlier case, why have a special case for 1-priority? Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is synchronous_standby_num == 1? +/* + * Obtain currently synced LSN: write and flush, + * using 'prioirty' method. s/prioirty/priority/ +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) +{ + int *sync_standbys = NULL; + int num_sync; + int i; + XLogRecPtr synced_write = InvalidXLogRecPtr; + XLogRecPtr synced_flush = InvalidXLogRecPtr; + + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num); Would a fixed size buffer on the stack (of compile time constant size) be better than palloc/free in here and elsewhere? + /* + for (i = 0; i < num_sync; i++) + { + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; + if (walsndloc == MyWalSnd) + { + found = true; + break; + } + } + */ Dead code. + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY) + synchronous_standby_num = 1; + else + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)), sizeof(int), 0); Should we detect if synchronous_standby_num > the number of listed servers, which would be a nonsensical configuration? Should we also impose some other kind of constant limits, like must be >= 0 (I haven't tried but I wonder if -1 leads to very large palloc) and must be <= MAX_XXX (smallish sanity check number like 256, rather than the INT_MAX limit imposed by pg_atoi), so that we could use that constant to size stack buffers in the places where you currently palloc? Could 1-priority mode be inferred from the use of a non-number in the leading position, and if so, does the mode concept even need to exist, especially if SyncRepGetSyncLsnsOnePriority and SyncRepGetSyncStandbysOnePriority aren't really needed either way? Is there any difference in behaviour between the following configurations? (Sorry if that particular question has already been duked out in the long thread about GUCs.) synchronous_replication_method = 1-priority synchronous_standby_names = foo, bar synchronous_replication_method = priority synchronous_standby_names = 1, foo, bar (Apologies for the missing leading whitespace in patch fragments pasted above, it seems that my mail client has eaten it). -- Thomas Munro http://www.enterprisedb.com
On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > [000-_multi_sync_replication_v3.patch] > > Hi Masahiko, > > I haven't tested this version of the patch but I have some comments on the code. > > +/* Is this wal sender considerable one? */ > +bool > +SyncRepActiveListedWalSender(int num) > > Maybe "Is this wal sender managing a standby that is streaming and > listed as a synchronous standby?" > > +/* > + * Obtain three palloc'd arrays containing position of standbys currently > + * considered as synchronous, and its length. > + */ > +int > +SyncRepGetSyncStandbys(int *sync_standbys) > > This comment seems to be out of date. I would say "Populate a > caller-supplied array which much have enough space for ... Returns > ...". > > +/* > + * Obtain standby currently considered as synchronous using > + * '1-priority' method. > + */ > +int > +SyncRepGetSyncStandbysOnePriority(int *sync_standbys) > + ... code ... > > Why do we need a separate function and code path for this case? If > you used SyncRepGetSyncStandbysPriority with a size of 1, should it > not produce the same result in the same time complexity? > > +/* > + * Obtain standby currently considered as synchronous using > + * 'priority' method. > + */ > +int > +SyncRepGetSyncStandbysPriority(int *sync_standbys) > > I would say something more descriptive, maybe like this: "Populates a > caller-supplied buffer with the walsnds indexes of the highest > priority active synchronous standbys, up to the a limit of > 'synchronous_standby_num'. The order of the results is undefined. > Returns the number of results actually written." > > If you got rid of SyncRepGetSyncStandbysOnePriority as suggested > above, then this function could be renamed to SyncRepGetSyncStandbys. > I think it would be a tiny bit nicer if it also took a Size n argument > along with the output buffer pointer. > > As for the body of that function (which I won't paste here), it > contains an algorithm to find the top K elements in an array of N > elements. It does that with a linear search through the top K seen so > far for each value in the input array, so its worst case is O(KN) > comparisons. Some of the sorting gurus on this list might have > something to say about that but my take is that it seems fine for the > tiny values of K and N that we're dealing with here, and it's nice > that it doesn't need any space other than the output buffer, unlike > some other top-K algorithms which would win for larger inputs. > > + /* Found sync standby */ > > This comment would be clearer as "Found lowest priority standby, so replace it". > > + if (walsndloc->sync_standby_priority == priority && > + walsnd->sync_standby_priority < priority) > + sync_standbys[j] = i; > > In this case, couldn't you also update 'priority' directly, and break > out of the loop immediately? Oops, I didn't think that though: you can't break from the loop, you still need to find the new lowest priority, so I retract that bit. > Wouldn't "lowest_priority" be a better > variable name than "priority"? It might be good to say "lowest" > rather than "highest" in the nearby comments, to be consistent with > other parts of the code including the function name (lower priority > number means higher priority!). > > +/* > + * Obtain currently synced LSN: write and flush, > + * using '1-prioirty' method. > > s/prioirty/priority/ > > + */ > +bool > +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > > Similar to the earlier case, why have a special case for 1-priority? > Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is > synchronous_standby_num == 1? > > +/* > + * Obtain currently synced LSN: write and flush, > + * using 'prioirty' method. > > s/prioirty/priority/ > > +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > +{ > + int *sync_standbys = NULL; > + int num_sync; > + int i; > + XLogRecPtr synced_write = InvalidXLogRecPtr; > + XLogRecPtr synced_flush = InvalidXLogRecPtr; > + > + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num); > > Would a fixed size buffer on the stack (of compile time constant size) > be better than palloc/free in here and elsewhere? > > + /* > + for (i = 0; i < num_sync; i++) > + { > + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; > + if (walsndloc == MyWalSnd) > + { > + found = true; > + break; > + } > + } > + */ > > Dead code. > > + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY) > + synchronous_standby_num = 1; > + else > + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)), > sizeof(int), 0); > > Should we detect if synchronous_standby_num > the number of listed > servers, which would be a nonsensical configuration? Should we also > impose some other kind of constant limits, like must be >= 0 (I > haven't tried but I wonder if -1 leads to very large palloc) and must > be <= MAX_XXX (smallish sanity check number like 256, rather than the > INT_MAX limit imposed by pg_atoi), so that we could use that constant > to size stack buffers in the places where you currently palloc? > > Could 1-priority mode be inferred from the use of a non-number in the > leading position, and if so, does the mode concept even need to exist, > especially if SyncRepGetSyncLsnsOnePriority and > SyncRepGetSyncStandbysOnePriority aren't really needed either way? Is > there any difference in behaviour between the following > configurations? (Sorry if that particular question has already been > duked out in the long thread about GUCs.) > > synchronous_replication_method = 1-priority > synchronous_standby_names = foo, bar > > synchronous_replication_method = priority > synchronous_standby_names = 1, foo, bar > > (Apologies for the missing leading whitespace in patch fragments > pasted above, it seems that my mail client has eaten it). -- Thomas Munro http://www.enterprisedb.com
On Wed, Dec 23, 2015 at 12:15 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Review stuff I have moved this entry to next CF as review is quite recent. -- Michael
On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Dec 18, 2015 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> [000-_multi_sync_replication_v3.patch] >> >> Hi Masahiko, >> >> I haven't tested this version of the patch but I have some comments on the code. >> >> +/* Is this wal sender considerable one? */ >> +bool >> +SyncRepActiveListedWalSender(int num) >> >> Maybe "Is this wal sender managing a standby that is streaming and >> listed as a synchronous standby?" Fixed. >> +/* >> + * Obtain three palloc'd arrays containing position of standbys currently >> + * considered as synchronous, and its length. >> + */ >> +int >> +SyncRepGetSyncStandbys(int *sync_standbys) >> >> This comment seems to be out of date. I would say "Populate a >> caller-supplied array which much have enough space for ... Returns >> ...". Fixed. >> +/* >> + * Obtain standby currently considered as synchronous using >> + * '1-priority' method. >> + */ >> +int >> +SyncRepGetSyncStandbysOnePriority(int *sync_standbys) >> + ... code ... >> >> Why do we need a separate function and code path for this case? If >> you used SyncRepGetSyncStandbysPriority with a size of 1, should it >> not produce the same result in the same time complexity? I was thinking that we could add new function like SyncRepGetSyncStandbysXXXXX function (XXXXX is replication method name) if we want to expand the kind of repliaction method. So I include replication method name into function name. But it's enough to add one function for 2 replication method; priority, 1-priority >> +/* >> + * Obtain standby currently considered as synchronous using >> + * 'priority' method. >> + */ >> +int >> +SyncRepGetSyncStandbysPriority(int *sync_standbys) >> >> I would say something more descriptive, maybe like this: "Populates a >> caller-supplied buffer with the walsnds indexes of the highest >> priority active synchronous standbys, up to the a limit of >> 'synchronous_standby_num'. The order of the results is undefined. >> Returns the number of results actually written." Fixed. >> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >> above, then this function could be renamed to SyncRepGetSyncStandbys. >> I think it would be a tiny bit nicer if it also took a Size n argument >> along with the output buffer pointer. Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() function uses synchronous_standby_num which is global variable. But you mean that the number of synchronous standbys is given as function argument? >> As for the body of that function (which I won't paste here), it >> contains an algorithm to find the top K elements in an array of N >> elements. It does that with a linear search through the top K seen so >> far for each value in the input array, so its worst case is O(KN) >> comparisons. Some of the sorting gurus on this list might have >> something to say about that but my take is that it seems fine for the >> tiny values of K and N that we're dealing with here, and it's nice >> that it doesn't need any space other than the output buffer, unlike >> some other top-K algorithms which would win for larger inputs. Yeah, it's improvement point. But I'm assumed that the number of synchronous replication is not large, so I use this algorithm as first version. And I think that its worst case is O(K(N-K)). Am I missing something? >> + /* Found sync standby */ >> >> This comment would be clearer as "Found lowest priority standby, so replace it". Fixed. >> + if (walsndloc->sync_standby_priority == priority && >> + walsnd->sync_standby_priority < priority) >> + sync_standbys[j] = i; >> >> In this case, couldn't you also update 'priority' directly, and break >> out of the loop immediately? > > Oops, I didn't think that though: you can't break from the loop, you > still need to find the new lowest priority, so I retract that bit. > >> Wouldn't "lowest_priority" be a better >> variable name than "priority"? It might be good to say "lowest" >> rather than "highest" in the nearby comments, to be consistent with >> other parts of the code including the function name (lower priority >> number means higher priority!). >> >> +/* >> + * Obtain currently synced LSN: write and flush, >> + * using '1-prioirty' method. >> >> s/prioirty/priority/ >> >> + */ >> +bool >> +SyncRepGetSyncLsnsOnePriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> >> Similar to the earlier case, why have a special case for 1-priority? >> Wouldn't SyncRepGetSyncLsnsPriority produce the same result when is >> synchronous_standby_num == 1? >> >> +/* >> + * Obtain currently synced LSN: write and flush, >> + * using 'prioirty' method. >> >> s/prioirty/priority/ >> >> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> +{ >> + int *sync_standbys = NULL; >> + int num_sync; >> + int i; >> + XLogRecPtr synced_write = InvalidXLogRecPtr; >> + XLogRecPtr synced_flush = InvalidXLogRecPtr; >> + >> + sync_standbys = (int *) palloc(sizeof(int) * synchronous_standby_num); >> >> Would a fixed size buffer on the stack (of compile time constant size) >> be better than palloc/free in here and elsewhere? >> >> + /* >> + for (i = 0; i < num_sync; i++) >> + { >> + volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]]; >> + if (walsndloc == MyWalSnd) >> + { >> + found = true; >> + break; >> + } >> + } >> + */ >> >> Dead code. >> >> + if (synchronous_replication_method == SYNC_REP_METHOD_1_PRIORITY) >> + synchronous_standby_num = 1; >> + else >> + synchronous_standby_num = pg_atoi(lfirst(list_head(elemlist)), >> sizeof(int), 0); Fixed. >> Should we detect if synchronous_standby_num > the number of listed >> servers, which would be a nonsensical configuration? Should we also >> impose some other kind of constant limits, like must be >= 0 (I >> haven't tried but I wonder if -1 leads to very large palloc) and must >> be <= MAX_XXX (smallish sanity check number like 256, rather than the >> INT_MAX limit imposed by pg_atoi), so that we could use that constant >> to size stack buffers in the places where you currently palloc? Yeah, I add validation check for s_s_num. >> Could 1-priority mode be inferred from the use of a non-number in the >> leading position, and if so, does the mode concept even need to exist, >> especially if SyncRepGetSyncLsnsOnePriority and >> SyncRepGetSyncStandbysOnePriority aren't really needed either way? Is >> there any difference in behaviour between the following >> configurations? (Sorry if that particular question has already been >> duked out in the long thread about GUCs.) >> >> synchronous_replication_method = 1-priority >> synchronous_standby_names = foo, bar >> >> synchronous_replication_method = priority >> synchronous_standby_names = 1, foo, bar The behaviour under the both configuration are the same. I added '1-priority' method for backward compatibility. The default value of s_r_method is '1-priority', so user who is using sync replicatoin can continues to use after upgrading smoothly. >> (Apologies for the missing leading whitespace in patch fragments >> pasted above, it seems that my mail client has eaten it). No problem. Thank you for reviewing! > I have moved this entry to next CF as review is quite recent. Thanks! Attached latest version patch. Please review it. Regards, -- Masahiko Sawada
Attachment
On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>> I think it would be a tiny bit nicer if it also took a Size n argument >>> along with the output buffer pointer. > > Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() > function uses synchronous_standby_num which is global variable. > But you mean that the number of synchronous standbys is given as > function argument? Yeah, I was thinking of it as the output buffer size which I would be inclined to make more explicit (I am still coming to terms with the use of global variables in Postgres) but it doesn't matter, please disregard that suggestion. >>> As for the body of that function (which I won't paste here), it >>> contains an algorithm to find the top K elements in an array of N >>> elements. It does that with a linear search through the top K seen so >>> far for each value in the input array, so its worst case is O(KN) >>> comparisons. Some of the sorting gurus on this list might have >>> something to say about that but my take is that it seems fine for the >>> tiny values of K and N that we're dealing with here, and it's nice >>> that it doesn't need any space other than the output buffer, unlike >>> some other top-K algorithms which would win for larger inputs. > > Yeah, it's improvement point. > But I'm assumed that the number of synchronous replication is not > large, so I use this algorithm as first version. > And I think that its worst case is O(K(N-K)). Am I missing something? You're right, I was dropping that detail, in the tradition of the hand-wavy school of big-O notation. (I suppose you could skip the inner loop when the priority is lower than the current lowest priority, giving a O(N) best case when the walsenders are perfectly ordered by coincidence. Probably a bad idea or just not worth worrying about.) > Attached latest version patch. +/* + * Obtain currently synced LSN location: write and flush, using priority - * In 9.1 we support only a single synchronous standby, chosen from a - * priority list of synchronous_standby_names. Before it can become the + * In 9.6 we support multiple synchronous standby, chosen from a priority s/standby/standbys/ + * list of synchronous_standby_names. Before it can become the s/Before it can become the/Before any standby can become a/ * synchronous standby it must have caught up with the primary; that may * take some time. Once caught up, the current highestpriority standby s/standby/standbys/ * will release waiters from the queue. +bool +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) +{ + int sync_standbys[synchronous_standby_num]; I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. (Variable sized arrays are a feature of C99 and PostgreSQL is written in C89.) +/* + * Populate a caller-supplied array which much have enough space for + * synchronous_standby_num. Returns position of standbys currently + * considered as synchronous, and its length. + */ +int +SyncRepGetSyncStandbys(int *sync_standbys) s/much/must/ (my bad, in previous email). + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("The number of synchronous standbys must be smaller than the number of listed : %d", + synchronous_standby_num))); How about "the number of synchronous standbys exceeds the length of the standby list: %d"? Error messages usually start with lower case, ':' is not usually preceded by a space. + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("The number of synchronous standbys must be between 1 and %d : %d", s/The/the/, s/ : /: / -- Thomas Munro http://www.enterprisedb.com
On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>> <thomas.munro@enterprisedb.com> wrote: >>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>> along with the output buffer pointer. >> >> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >> function uses synchronous_standby_num which is global variable. >> But you mean that the number of synchronous standbys is given as >> function argument? > > Yeah, I was thinking of it as the output buffer size which I would be > inclined to make more explicit (I am still coming to terms with the > use of global variables in Postgres) but it doesn't matter, please > disregard that suggestion. > >>>> As for the body of that function (which I won't paste here), it >>>> contains an algorithm to find the top K elements in an array of N >>>> elements. It does that with a linear search through the top K seen so >>>> far for each value in the input array, so its worst case is O(KN) >>>> comparisons. Some of the sorting gurus on this list might have >>>> something to say about that but my take is that it seems fine for the >>>> tiny values of K and N that we're dealing with here, and it's nice >>>> that it doesn't need any space other than the output buffer, unlike >>>> some other top-K algorithms which would win for larger inputs. >> >> Yeah, it's improvement point. >> But I'm assumed that the number of synchronous replication is not >> large, so I use this algorithm as first version. >> And I think that its worst case is O(K(N-K)). Am I missing something? > > You're right, I was dropping that detail, in the tradition of the > hand-wavy school of big-O notation. (I suppose you could skip the > inner loop when the priority is lower than the current lowest > priority, giving a O(N) best case when the walsenders are perfectly > ordered by coincidence. Probably a bad idea or just not worth > worrying about.) Thank you for reviewing the patch. Yeah, I added the logic that skip the inner loop. > >> Attached latest version patch. > > +/* > + * Obtain currently synced LSN location: write and flush, using priority > - * In 9.1 we support only a single synchronous standby, chosen from a > - * priority list of synchronous_standby_names. Before it can become the > + * In 9.6 we support multiple synchronous standby, chosen from a priority > > s/standby/standbys/ > > + * list of synchronous_standby_names. Before it can become the > > s/Before it can become the/Before any standby can become a/ > > * synchronous standby it must have caught up with the primary; that may > * take some time. Once caught up, the current highest priority standby > > s/standby/standbys/ > > * will release waiters from the queue. > > +bool > +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) > +{ > + int sync_standbys[synchronous_standby_num]; > > I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. > (Variable sized arrays are a feature of C99 and PostgreSQL is written > in C89.) > > +/* > + * Populate a caller-supplied array which much have enough space for > + * synchronous_standby_num. Returns position of standbys currently > + * considered as synchronous, and its length. > + */ > +int > +SyncRepGetSyncStandbys(int *sync_standbys) > > s/much/must/ (my bad, in previous email). > > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("The number of synchronous standbys must be smaller than the > number of listed : %d", > + synchronous_standby_num))); > > How about "the number of synchronous standbys exceeds the length of > the standby list: %d"? Error messages usually start with lower case, > ':' is not usually preceded by a space. > > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("The number of synchronous standbys must be between 1 and %d : %d", > > s/The/the/, s/ : /: / Fixed you mentioned. Attached latest v5 patch. Please review it. Regards, -- Masahiko Sawada
Attachment
On Sun, Jan 3, 2016 at 10:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >>> <thomas.munro@enterprisedb.com> wrote: >>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>>> <thomas.munro@enterprisedb.com> wrote: >>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>>> along with the output buffer pointer. >>> >>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >>> function uses synchronous_standby_num which is global variable. >>> But you mean that the number of synchronous standbys is given as >>> function argument? >> >> Yeah, I was thinking of it as the output buffer size which I would be >> inclined to make more explicit (I am still coming to terms with the >> use of global variables in Postgres) but it doesn't matter, please >> disregard that suggestion. >> >>>>> As for the body of that function (which I won't paste here), it >>>>> contains an algorithm to find the top K elements in an array of N >>>>> elements. It does that with a linear search through the top K seen so >>>>> far for each value in the input array, so its worst case is O(KN) >>>>> comparisons. Some of the sorting gurus on this list might have >>>>> something to say about that but my take is that it seems fine for the >>>>> tiny values of K and N that we're dealing with here, and it's nice >>>>> that it doesn't need any space other than the output buffer, unlike >>>>> some other top-K algorithms which would win for larger inputs. >>> >>> Yeah, it's improvement point. >>> But I'm assumed that the number of synchronous replication is not >>> large, so I use this algorithm as first version. >>> And I think that its worst case is O(K(N-K)). Am I missing something? >> >> You're right, I was dropping that detail, in the tradition of the >> hand-wavy school of big-O notation. (I suppose you could skip the >> inner loop when the priority is lower than the current lowest >> priority, giving a O(N) best case when the walsenders are perfectly >> ordered by coincidence. Probably a bad idea or just not worth >> worrying about.) > > Thank you for reviewing the patch. > Yeah, I added the logic that skip the inner loop. > >> >>> Attached latest version patch. >> >> +/* >> + * Obtain currently synced LSN location: write and flush, using priority >> - * In 9.1 we support only a single synchronous standby, chosen from a >> - * priority list of synchronous_standby_names. Before it can become the >> + * In 9.6 we support multiple synchronous standby, chosen from a priority >> >> s/standby/standbys/ >> >> + * list of synchronous_standby_names. Before it can become the >> >> s/Before it can become the/Before any standby can become a/ >> >> * synchronous standby it must have caught up with the primary; that may >> * take some time. Once caught up, the current highest priority standby >> >> s/standby/standbys/ >> >> * will release waiters from the queue. >> >> +bool >> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> +{ >> + int sync_standbys[synchronous_standby_num]; >> >> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. >> (Variable sized arrays are a feature of C99 and PostgreSQL is written >> in C89.) >> >> +/* >> + * Populate a caller-supplied array which much have enough space for >> + * synchronous_standby_num. Returns position of standbys currently >> + * considered as synchronous, and its length. >> + */ >> +int >> +SyncRepGetSyncStandbys(int *sync_standbys) >> >> s/much/must/ (my bad, in previous email). >> >> + ereport(ERROR, >> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> + errmsg("The number of synchronous standbys must be smaller than the >> number of listed : %d", >> + synchronous_standby_num))); >> >> How about "the number of synchronous standbys exceeds the length of >> the standby list: %d"? Error messages usually start with lower case, >> ':' is not usually preceded by a space. >> >> + ereport(ERROR, >> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> + errmsg("The number of synchronous standbys must be between 1 and %d : %d", >> >> s/The/the/, s/ : /: / > > Fixed you mentioned. > > Attached latest v5 patch. > Please review it. Something that I find rather scary with this patch: could it be possible to get actual regression tests now that there is more machinery with PostgresNode.pm? As syncrep code paths get more and more complex, so are debugging and maintenance. -- Michael
Hello, At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com> > > Attached latest v5 patch. > > Please review it. > > Something that I find rather scary with this patch: could it be > possible to get actual regression tests now that there is more > machinery with PostgresNode.pm? As syncrep code paths get more and > more complex, so are debugging and maintenance. The test on the whole replication system will very likely to be too complex and hard to stabilize, and would be disproportionately large to other tests. This patch mainly changes the logic to choose the next syncrep standbys and calculate the 'synched' LSNs, so performing separate module tests for the logics, then perform the test for the behavior according to the result of that by, perhaps, PostgresNode.pm would remarkably reduce the labor for testing. Could we have some tapping point for individual testing of the logics in appropriate way? In order to do so, the logics should be able to be fed arbitrary complete set of parameters, in other words, defining a kind of API to use the logics from the core side, even though it is not an extension. Then we will *somehow* kick the API with some set of parameters in regest. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com> >> > Attached latest v5 patch. >> > Please review it. >> >> Something that I find rather scary with this patch: could it be >> possible to get actual regression tests now that there is more >> machinery with PostgresNode.pm? As syncrep code paths get more and >> more complex, so are debugging and maintenance. > > The test on the whole replication system will very likely to be > too complex and hard to stabilize, and would be > disproportionately large to other tests. I don't buy that much. Mind you, there is in this commit fest a patch introducing a basic regression test suite for recovery using the new infrastructure that has been committed last month. You may want to look at it. > This patch mainly changes the logic to choose the next syncrep > standbys and calculate the 'synched' LSNs, so performing separate > module tests for the logics, then perform the test for the > behavior according to the result of that by, perhaps, > PostgresNode.pm would remarkably reduce the labor for > testing. > Could we have some tapping point for individual testing of the > logics in appropriate way? Isn't pg_stat_replication enough for this purpose? What you basically need to do is set up a master, a set of slaves and then look at the WAL sender status. Am I getting that wrong? > In order to do so, the logics should be able to be fed arbitrary > complete set of parameters, in other words, defining a kind of > API to use the logics from the core side, even though it is not > an extension. Then we will *somehow* kick the API with some set > of parameters in regest. Well, you will need to craft in the syncrep test suite associated in this patch a set of routines that allows to set up appropriately s_s_names and the other parameters that this patch introduces. I does not sound like a barrier impossible to cross. -- Michael
Michael Paquier wrote: > On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, > > > > At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com> > >> > >> Something that I find rather scary with this patch: could it be > >> possible to get actual regression tests now that there is more > >> machinery with PostgresNode.pm? As syncrep code paths get more and > >> more complex, so are debugging and maintenance. > > > > The test on the whole replication system will very likely to be > > too complex and hard to stabilize, and would be > > disproportionately large to other tests. > > I don't buy that much. Mind you, there is in this commit fest a patch > introducing a basic regression test suite for recovery using the new > infrastructure that has been committed last month. You may want to > look at it. Kyotaro, please have a look at this patch: https://commitfest.postgresql.org/8/438/ which is the recovery test framework Michael is talking about. Is it possible to use that framework to write tests for this feature? If so, then my preferred course of action would be to commit that patch and then introduce in this patch some additional tests for the N-sync-standby feature. Can you please have a look at the test framework patch and provide your feedback on how usable it is for this? Thanks, -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: >> On Fri, Jan 8, 2016 at 1:53 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Hello, >> > >> > At Mon, 4 Jan 2016 15:29:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTp5RoHxcp8YxejGMjRjjtLaXCa8=-BEr7ZnBNbPzPdWA@mail.gmail.com> >> >> >> >> Something that I find rather scary with this patch: could it be >> >> possible to get actual regression tests now that there is more >> >> machinery with PostgresNode.pm? As syncrep code paths get more and >> >> more complex, so are debugging and maintenance. >> > >> > The test on the whole replication system will very likely to be >> > too complex and hard to stabilize, and would be >> > disproportionately large to other tests. >> >> I don't buy that much. Mind you, there is in this commit fest a patch >> introducing a basic regression test suite for recovery using the new >> infrastructure that has been committed last month. You may want to >> look at it. > > Kyotaro, please have a look at this patch: > https://commitfest.postgresql.org/8/438/ > which is the recovery test framework Michael is talking about. Is it > possible to use that framework to write tests for this feature? If so, > then my preferred course of action would be to commit that patch and > then introduce in this patch some additional tests for the N-sync-standby > feature. Can you please have a look at the test framework patch and > provide your feedback on how usable it is for this? > I had a look that patch. I'm planning to have at least following tests for multiple synchronous replication. * Confirm value of pg_stat_replication.sync_state (sync, async or potential) * Confirm that the data is synchronously replicated to multiple standbys in same cases. * case 1 : The standby which is not listed in s_s_name, is down * case 2 : The standby which is listedin s_s_names but potential standby, is down * case 3 : The standby which is considered as sync standby, is down. * Standby promotion In order to confirm that the commit isn't done in case #3 forever unless new sync standby is up, I think we need the framework that cancels executing query. That is, what I'm planning is, 1. Set up master server (s_s_name = '2, standby1, standby2) 2. Set up two standby servers 3. Standby1 is down 4. Create some contents on master (But transaction is not committed) 5. Cancel the #4 query. (Also confirm that the flush location of only standby2 makes progress) Regards, -- Masahiko Sawada
On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote: > * Confirm value of pg_stat_replication.sync_state (sync, async or potential) > * Confirm that the data is synchronously replicated to multiple > standbys in same cases. > * case 1 : The standby which is not listed in s_s_name, is down > * case 2 : The standby which is listed in s_s_names but potential > standby, is down > * case 3 : The standby which is considered as sync standby, is down. > * Standby promotion > > In order to confirm that the commit isn't done in case #3 forever > unless new sync standby is up, I think we need the framework that > cancels executing query. > That is, what I'm planning is, > 1. Set up master server (s_s_name = '2, standby1, standby2) > 2. Set up two standby servers > 3. Standby1 is down > 4. Create some contents on master (But transaction is not committed) > 5. Cancel the #4 query. (Also confirm that the flush location of only > standby2 makes progress) This will need some thinking and is not as easy as it sounds. There is no way to hold on a connection after executing a query in the current TAP infrastructure. You are just mentioning case 3, but actually cases 1 and 2 are falling into the same need: if there is a failure we want to be able to not be stuck in the test forever and have a way to cancel a query execution at will. TAP uses psql -c to execute any sql queries, but we would need something that is far lower-level, and that would be basically using the perl driver for Postgres or an equivalent here. Honestly for those tests I just thought that we could get to something reliable by just looking at how each sync replication setup reflects in pg_stat_replication as the flow is really getting complicated, giving to the user a clear representation at SQL level of what is actually occurring in the server depending on the configuration used being important here. -- Michael
On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote: >> * Confirm value of pg_stat_replication.sync_state (sync, async or potential) >> * Confirm that the data is synchronously replicated to multiple >> standbys in same cases. >> * case 1 : The standby which is not listed in s_s_name, is down >> * case 2 : The standby which is listed in s_s_names but potential >> standby, is down >> * case 3 : The standby which is considered as sync standby, is down. >> * Standby promotion >> >> In order to confirm that the commit isn't done in case #3 forever >> unless new sync standby is up, I think we need the framework that >> cancels executing query. >> That is, what I'm planning is, >> 1. Set up master server (s_s_name = '2, standby1, standby2) >> 2. Set up two standby servers >> 3. Standby1 is down >> 4. Create some contents on master (But transaction is not committed) >> 5. Cancel the #4 query. (Also confirm that the flush location of only >> standby2 makes progress) > > This will need some thinking and is not as easy as it sounds. There is > no way to hold on a connection after executing a query in the current > TAP infrastructure. You are just mentioning case 3, but actually cases > 1 and 2 are falling into the same need: if there is a failure we want > to be able to not be stuck in the test forever and have a way to > cancel a query execution at will. TAP uses psql -c to execute any sql > queries, but we would need something that is far lower-level, and that > would be basically using the perl driver for Postgres or an equivalent > here. > > Honestly for those tests I just thought that we could get to something > reliable by just looking at how each sync replication setup reflects > in pg_stat_replication as the flow is really getting complicated, > giving to the user a clear representation at SQL level of what is > actually occurring in the server depending on the configuration used > being important here. I see. We could check the transition of sync_state in pg_stat_replication. I think it means that it tests for each replication method (switching state) rather than synchronization of replication. What I'm planning to have are, * Confirm value of pg_stat_replication.sync_state (sync, async or potential) * Standby promotion * Standby catching up master And each replication method has above tests. Are these enough? Regards, -- Masahiko Sawada
On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >>> <thomas.munro@enterprisedb.com> wrote: >>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>>> <thomas.munro@enterprisedb.com> wrote: >>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>>> along with the output buffer pointer. >>> >>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >>> function uses synchronous_standby_num which is global variable. >>> But you mean that the number of synchronous standbys is given as >>> function argument? >> >> Yeah, I was thinking of it as the output buffer size which I would be >> inclined to make more explicit (I am still coming to terms with the >> use of global variables in Postgres) but it doesn't matter, please >> disregard that suggestion. >> >>>>> As for the body of that function (which I won't paste here), it >>>>> contains an algorithm to find the top K elements in an array of N >>>>> elements. It does that with a linear search through the top K seen so >>>>> far for each value in the input array, so its worst case is O(KN) >>>>> comparisons. Some of the sorting gurus on this list might have >>>>> something to say about that but my take is that it seems fine for the >>>>> tiny values of K and N that we're dealing with here, and it's nice >>>>> that it doesn't need any space other than the output buffer, unlike >>>>> some other top-K algorithms which would win for larger inputs. >>> >>> Yeah, it's improvement point. >>> But I'm assumed that the number of synchronous replication is not >>> large, so I use this algorithm as first version. >>> And I think that its worst case is O(K(N-K)). Am I missing something? >> >> You're right, I was dropping that detail, in the tradition of the >> hand-wavy school of big-O notation. (I suppose you could skip the >> inner loop when the priority is lower than the current lowest >> priority, giving a O(N) best case when the walsenders are perfectly >> ordered by coincidence. Probably a bad idea or just not worth >> worrying about.) > > Thank you for reviewing the patch. > Yeah, I added the logic that skip the inner loop. > >> >>> Attached latest version patch. >> >> +/* >> + * Obtain currently synced LSN location: write and flush, using priority >> - * In 9.1 we support only a single synchronous standby, chosen from a >> - * priority list of synchronous_standby_names. Before it can become the >> + * In 9.6 we support multiple synchronous standby, chosen from a priority >> >> s/standby/standbys/ >> >> + * list of synchronous_standby_names. Before it can become the >> >> s/Before it can become the/Before any standby can become a/ >> >> * synchronous standby it must have caught up with the primary; that may >> * take some time. Once caught up, the current highest priority standby >> >> s/standby/standbys/ >> >> * will release waiters from the queue. >> >> +bool >> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >> +{ >> + int sync_standbys[synchronous_standby_num]; >> >> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. >> (Variable sized arrays are a feature of C99 and PostgreSQL is written >> in C89.) >> >> +/* >> + * Populate a caller-supplied array which much have enough space for >> + * synchronous_standby_num. Returns position of standbys currently >> + * considered as synchronous, and its length. >> + */ >> +int >> +SyncRepGetSyncStandbys(int *sync_standbys) >> >> s/much/must/ (my bad, in previous email). >> >> + ereport(ERROR, >> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> + errmsg("The number of synchronous standbys must be smaller than the >> number of listed : %d", >> + synchronous_standby_num))); >> >> How about "the number of synchronous standbys exceeds the length of >> the standby list: %d"? Error messages usually start with lower case, >> ':' is not usually preceded by a space. >> >> + ereport(ERROR, >> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> + errmsg("The number of synchronous standbys must be between 1 and %d : %d", >> >> s/The/the/, s/ : /: / > > Fixed you mentioned. > > Attached latest v5 patch. > Please review it. synchronous_standby_num doesn't appear to be a valid GUC name: LOG: unrecognized configuration parameter "synchronous_standby_num" in file "/home/thom/Development/test/primary/postgresql.conf" line 244 All I did was uncomment it and set it to a value. Thom
On Tue, Jan 19, 2016 at 1:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote: >>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential) >>> * Confirm that the data is synchronously replicated to multiple >>> standbys in same cases. >>> * case 1 : The standby which is not listed in s_s_name, is down >>> * case 2 : The standby which is listed in s_s_names but potential >>> standby, is down >>> * case 3 : The standby which is considered as sync standby, is down. >>> * Standby promotion >>> >>> In order to confirm that the commit isn't done in case #3 forever >>> unless new sync standby is up, I think we need the framework that >>> cancels executing query. >>> That is, what I'm planning is, >>> 1. Set up master server (s_s_name = '2, standby1, standby2) >>> 2. Set up two standby servers >>> 3. Standby1 is down >>> 4. Create some contents on master (But transaction is not committed) >>> 5. Cancel the #4 query. (Also confirm that the flush location of only >>> standby2 makes progress) >> >> This will need some thinking and is not as easy as it sounds. There is >> no way to hold on a connection after executing a query in the current >> TAP infrastructure. You are just mentioning case 3, but actually cases >> 1 and 2 are falling into the same need: if there is a failure we want >> to be able to not be stuck in the test forever and have a way to >> cancel a query execution at will. TAP uses psql -c to execute any sql >> queries, but we would need something that is far lower-level, and that >> would be basically using the perl driver for Postgres or an equivalent >> here. >> >> Honestly for those tests I just thought that we could get to something >> reliable by just looking at how each sync replication setup reflects >> in pg_stat_replication as the flow is really getting complicated, >> giving to the user a clear representation at SQL level of what is >> actually occurring in the server depending on the configuration used >> being important here. > > I see. > We could check the transition of sync_state in pg_stat_replication. > I think it means that it tests for each replication method (switching > state) rather than synchronization of replication. > > What I'm planning to have are, > * Confirm value of pg_stat_replication.sync_state (sync, async or potential) > * Standby promotion > * Standby catching up master > And each replication method has above tests. > > Are these enough? Does promoting the standby and checking that it caught really have value in this context of this patch? What we just want to know is on a master, which nodes need to be waited for when s_s_names or any other method is used, no? -- Michael
On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote: > On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >>>> <thomas.munro@enterprisedb.com> wrote: >>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>>>> <thomas.munro@enterprisedb.com> wrote: >>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>>>> along with the output buffer pointer. >>>> >>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >>>> function uses synchronous_standby_num which is global variable. >>>> But you mean that the number of synchronous standbys is given as >>>> function argument? >>> >>> Yeah, I was thinking of it as the output buffer size which I would be >>> inclined to make more explicit (I am still coming to terms with the >>> use of global variables in Postgres) but it doesn't matter, please >>> disregard that suggestion. >>> >>>>>> As for the body of that function (which I won't paste here), it >>>>>> contains an algorithm to find the top K elements in an array of N >>>>>> elements. It does that with a linear search through the top K seen so >>>>>> far for each value in the input array, so its worst case is O(KN) >>>>>> comparisons. Some of the sorting gurus on this list might have >>>>>> something to say about that but my take is that it seems fine for the >>>>>> tiny values of K and N that we're dealing with here, and it's nice >>>>>> that it doesn't need any space other than the output buffer, unlike >>>>>> some other top-K algorithms which would win for larger inputs. >>>> >>>> Yeah, it's improvement point. >>>> But I'm assumed that the number of synchronous replication is not >>>> large, so I use this algorithm as first version. >>>> And I think that its worst case is O(K(N-K)). Am I missing something? >>> >>> You're right, I was dropping that detail, in the tradition of the >>> hand-wavy school of big-O notation. (I suppose you could skip the >>> inner loop when the priority is lower than the current lowest >>> priority, giving a O(N) best case when the walsenders are perfectly >>> ordered by coincidence. Probably a bad idea or just not worth >>> worrying about.) >> >> Thank you for reviewing the patch. >> Yeah, I added the logic that skip the inner loop. >> >>> >>>> Attached latest version patch. >>> >>> +/* >>> + * Obtain currently synced LSN location: write and flush, using priority >>> - * In 9.1 we support only a single synchronous standby, chosen from a >>> - * priority list of synchronous_standby_names. Before it can become the >>> + * In 9.6 we support multiple synchronous standby, chosen from a priority >>> >>> s/standby/standbys/ >>> >>> + * list of synchronous_standby_names. Before it can become the >>> >>> s/Before it can become the/Before any standby can become a/ >>> >>> * synchronous standby it must have caught up with the primary; that may >>> * take some time. Once caught up, the current highest priority standby >>> >>> s/standby/standbys/ >>> >>> * will release waiters from the queue. >>> >>> +bool >>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >>> +{ >>> + int sync_standbys[synchronous_standby_num]; >>> >>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. >>> (Variable sized arrays are a feature of C99 and PostgreSQL is written >>> in C89.) >>> >>> +/* >>> + * Populate a caller-supplied array which much have enough space for >>> + * synchronous_standby_num. Returns position of standbys currently >>> + * considered as synchronous, and its length. >>> + */ >>> +int >>> +SyncRepGetSyncStandbys(int *sync_standbys) >>> >>> s/much/must/ (my bad, in previous email). >>> >>> + ereport(ERROR, >>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>> + errmsg("The number of synchronous standbys must be smaller than the >>> number of listed : %d", >>> + synchronous_standby_num))); >>> >>> How about "the number of synchronous standbys exceeds the length of >>> the standby list: %d"? Error messages usually start with lower case, >>> ':' is not usually preceded by a space. >>> >>> + ereport(ERROR, >>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d", >>> >>> s/The/the/, s/ : /: / >> >> Fixed you mentioned. >> >> Attached latest v5 patch. >> Please review it. > > synchronous_standby_num doesn't appear to be a valid GUC name: > > LOG: unrecognized configuration parameter "synchronous_standby_num" > in file "/home/thom/Development/test/primary/postgresql.conf" line 244 > > All I did was uncomment it and set it to a value. > Thank you for having a look it. Yeah, synchronous_standby_num should not exists in postgresql.conf. Please test for multiple sync replication with latest patch. Regards, -- Masahiko Sawada
Attachment
On Tue, Jan 19, 2016 at 2:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jan 19, 2016 at 1:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Jan 18, 2016 at 1:20 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Sun, Jan 17, 2016 at 11:09 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Wed, Jan 13, 2016 at 1:54 AM, Alvaro Herrera wrote: >>>> * Confirm value of pg_stat_replication.sync_state (sync, async or potential) >>>> * Confirm that the data is synchronously replicated to multiple >>>> standbys in same cases. >>>> * case 1 : The standby which is not listed in s_s_name, is down >>>> * case 2 : The standby which is listed in s_s_names but potential >>>> standby, is down >>>> * case 3 : The standby which is considered as sync standby, is down. >>>> * Standby promotion >>>> >>>> In order to confirm that the commit isn't done in case #3 forever >>>> unless new sync standby is up, I think we need the framework that >>>> cancels executing query. >>>> That is, what I'm planning is, >>>> 1. Set up master server (s_s_name = '2, standby1, standby2) >>>> 2. Set up two standby servers >>>> 3. Standby1 is down >>>> 4. Create some contents on master (But transaction is not committed) >>>> 5. Cancel the #4 query. (Also confirm that the flush location of only >>>> standby2 makes progress) >>> >>> This will need some thinking and is not as easy as it sounds. There is >>> no way to hold on a connection after executing a query in the current >>> TAP infrastructure. You are just mentioning case 3, but actually cases >>> 1 and 2 are falling into the same need: if there is a failure we want >>> to be able to not be stuck in the test forever and have a way to >>> cancel a query execution at will. TAP uses psql -c to execute any sql >>> queries, but we would need something that is far lower-level, and that >>> would be basically using the perl driver for Postgres or an equivalent >>> here. >>> >>> Honestly for those tests I just thought that we could get to something >>> reliable by just looking at how each sync replication setup reflects >>> in pg_stat_replication as the flow is really getting complicated, >>> giving to the user a clear representation at SQL level of what is >>> actually occurring in the server depending on the configuration used >>> being important here. >> >> I see. >> We could check the transition of sync_state in pg_stat_replication. >> I think it means that it tests for each replication method (switching >> state) rather than synchronization of replication. >> >> What I'm planning to have are, >> * Confirm value of pg_stat_replication.sync_state (sync, async or potential) >> * Standby promotion >> * Standby catching up master >> And each replication method has above tests. >> >> Are these enough? > > Does promoting the standby and checking that it caught really have > value in this context of this patch? What we just want to know is on a > master, which nodes need to be waited for when s_s_names or any other > method is used, no? Yeah, these 2 tests are not in this context of this patch. If test framework could have the facility that allows us to execute query(psql) as another process, we could use pg_cancel_backend() function to waiting process when master server waiting for standbys. In order to check whether the master server would wait for the standby or not, we need test framework to have such facility, I think. Regards, -- Masahiko Sawada
On Wed, Jan 20, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote: >> On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro >>> <thomas.munro@enterprisedb.com> wrote: >>>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >>>>> <thomas.munro@enterprisedb.com> wrote: >>>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>>>>> <thomas.munro@enterprisedb.com> wrote: >>>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>>>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>>>>> along with the output buffer pointer. >>>>> >>>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >>>>> function uses synchronous_standby_num which is global variable. >>>>> But you mean that the number of synchronous standbys is given as >>>>> function argument? >>>> >>>> Yeah, I was thinking of it as the output buffer size which I would be >>>> inclined to make more explicit (I am still coming to terms with the >>>> use of global variables in Postgres) but it doesn't matter, please >>>> disregard that suggestion. >>>> >>>>>>> As for the body of that function (which I won't paste here), it >>>>>>> contains an algorithm to find the top K elements in an array of N >>>>>>> elements. It does that with a linear search through the top K seen so >>>>>>> far for each value in the input array, so its worst case is O(KN) >>>>>>> comparisons. Some of the sorting gurus on this list might have >>>>>>> something to say about that but my take is that it seems fine for the >>>>>>> tiny values of K and N that we're dealing with here, and it's nice >>>>>>> that it doesn't need any space other than the output buffer, unlike >>>>>>> some other top-K algorithms which would win for larger inputs. >>>>> >>>>> Yeah, it's improvement point. >>>>> But I'm assumed that the number of synchronous replication is not >>>>> large, so I use this algorithm as first version. >>>>> And I think that its worst case is O(K(N-K)). Am I missing something? >>>> >>>> You're right, I was dropping that detail, in the tradition of the >>>> hand-wavy school of big-O notation. (I suppose you could skip the >>>> inner loop when the priority is lower than the current lowest >>>> priority, giving a O(N) best case when the walsenders are perfectly >>>> ordered by coincidence. Probably a bad idea or just not worth >>>> worrying about.) >>> >>> Thank you for reviewing the patch. >>> Yeah, I added the logic that skip the inner loop. >>> >>>> >>>>> Attached latest version patch. >>>> >>>> +/* >>>> + * Obtain currently synced LSN location: write and flush, using priority >>>> - * In 9.1 we support only a single synchronous standby, chosen from a >>>> - * priority list of synchronous_standby_names. Before it can become the >>>> + * In 9.6 we support multiple synchronous standby, chosen from a priority >>>> >>>> s/standby/standbys/ >>>> >>>> + * list of synchronous_standby_names. Before it can become the >>>> >>>> s/Before it can become the/Before any standby can become a/ >>>> >>>> * synchronous standby it must have caught up with the primary; that may >>>> * take some time. Once caught up, the current highest priority standby >>>> >>>> s/standby/standbys/ >>>> >>>> * will release waiters from the queue. >>>> >>>> +bool >>>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >>>> +{ >>>> + int sync_standbys[synchronous_standby_num]; >>>> >>>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. >>>> (Variable sized arrays are a feature of C99 and PostgreSQL is written >>>> in C89.) >>>> >>>> +/* >>>> + * Populate a caller-supplied array which much have enough space for >>>> + * synchronous_standby_num. Returns position of standbys currently >>>> + * considered as synchronous, and its length. >>>> + */ >>>> +int >>>> +SyncRepGetSyncStandbys(int *sync_standbys) >>>> >>>> s/much/must/ (my bad, in previous email). >>>> >>>> + ereport(ERROR, >>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>>> + errmsg("The number of synchronous standbys must be smaller than the >>>> number of listed : %d", >>>> + synchronous_standby_num))); >>>> >>>> How about "the number of synchronous standbys exceeds the length of >>>> the standby list: %d"? Error messages usually start with lower case, >>>> ':' is not usually preceded by a space. >>>> >>>> + ereport(ERROR, >>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d", >>>> >>>> s/The/the/, s/ : /: / >>> >>> Fixed you mentioned. >>> >>> Attached latest v5 patch. >>> Please review it. >> >> synchronous_standby_num doesn't appear to be a valid GUC name: >> >> LOG: unrecognized configuration parameter "synchronous_standby_num" >> in file "/home/thom/Development/test/primary/postgresql.conf" line 244 >> >> All I did was uncomment it and set it to a value. >> > > Thank you for having a look it. > > Yeah, synchronous_standby_num should not exists in postgresql.conf. > Please test for multiple sync replication with latest patch. In synchronous_replication_method = 'priority' case, when I set synchronous_standby_names to invalid value like 'hoge,foo' and reloaded the configuration file, the server crashed with the following error. This crash should not happen. FATAL: invalid input syntax for integer: "hoge" + /* + * After read all synchronous replication configuration parameter, we apply + * settings according to replication method. + */ + ProcessSynchronousReplicationConfig(); Why does the above function need to be called in ProcessConfigFile(), i.e., by every postgres processes? I was thinking that only walsender should call that to check which walsender is synchronous according to the setting. When synchronous_replication_method = '1-priority' and synchronous_standby_names = '*', I started one synchronous standby. Then, when I ran "SELECT * FROM pg_stat_replication", I got the following WARNING message. WARNING: detected write past chunk end in ExprContext 0x2acb3c0 I don't think that it's good design to specify the number of sync replicas to wait for, in synchronous_standby_names. It's confusing for the users. It's better to add separate parameter (synchronous_standby_num) for specifying that number. Which increases the number of GUC parameters, though. Are we really planning to implement synchronous_replication_method=quorum at the first version? If not, I'd like to remove s_r_method parameter because it's meaningless. We can add it later when we implement "quorum". Regards, -- Fujii Masao
On Thu, Jan 28, 2016 at 8:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Jan 20, 2016 at 2:35 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Jan 19, 2016 at 1:52 AM, Thom Brown <thom@linux.com> wrote: >>> On 3 January 2016 at 13:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro >>>> <thomas.munro@enterprisedb.com> wrote: >>>>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro >>>>>> <thomas.munro@enterprisedb.com> wrote: >>>>>>> On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro >>>>>>> <thomas.munro@enterprisedb.com> wrote: >>>>>>>> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested >>>>>>>> above, then this function could be renamed to SyncRepGetSyncStandbys. >>>>>>>> I think it would be a tiny bit nicer if it also took a Size n argument >>>>>>>> along with the output buffer pointer. >>>>>> >>>>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority() >>>>>> function uses synchronous_standby_num which is global variable. >>>>>> But you mean that the number of synchronous standbys is given as >>>>>> function argument? >>>>> >>>>> Yeah, I was thinking of it as the output buffer size which I would be >>>>> inclined to make more explicit (I am still coming to terms with the >>>>> use of global variables in Postgres) but it doesn't matter, please >>>>> disregard that suggestion. >>>>> >>>>>>>> As for the body of that function (which I won't paste here), it >>>>>>>> contains an algorithm to find the top K elements in an array of N >>>>>>>> elements. It does that with a linear search through the top K seen so >>>>>>>> far for each value in the input array, so its worst case is O(KN) >>>>>>>> comparisons. Some of the sorting gurus on this list might have >>>>>>>> something to say about that but my take is that it seems fine for the >>>>>>>> tiny values of K and N that we're dealing with here, and it's nice >>>>>>>> that it doesn't need any space other than the output buffer, unlike >>>>>>>> some other top-K algorithms which would win for larger inputs. >>>>>> >>>>>> Yeah, it's improvement point. >>>>>> But I'm assumed that the number of synchronous replication is not >>>>>> large, so I use this algorithm as first version. >>>>>> And I think that its worst case is O(K(N-K)). Am I missing something? >>>>> >>>>> You're right, I was dropping that detail, in the tradition of the >>>>> hand-wavy school of big-O notation. (I suppose you could skip the >>>>> inner loop when the priority is lower than the current lowest >>>>> priority, giving a O(N) best case when the walsenders are perfectly >>>>> ordered by coincidence. Probably a bad idea or just not worth >>>>> worrying about.) >>>> >>>> Thank you for reviewing the patch. >>>> Yeah, I added the logic that skip the inner loop. >>>> >>>>> >>>>>> Attached latest version patch. >>>>> >>>>> +/* >>>>> + * Obtain currently synced LSN location: write and flush, using priority >>>>> - * In 9.1 we support only a single synchronous standby, chosen from a >>>>> - * priority list of synchronous_standby_names. Before it can become the >>>>> + * In 9.6 we support multiple synchronous standby, chosen from a priority >>>>> >>>>> s/standby/standbys/ >>>>> >>>>> + * list of synchronous_standby_names. Before it can become the >>>>> >>>>> s/Before it can become the/Before any standby can become a/ >>>>> >>>>> * synchronous standby it must have caught up with the primary; that may >>>>> * take some time. Once caught up, the current highest priority standby >>>>> >>>>> s/standby/standbys/ >>>>> >>>>> * will release waiters from the queue. >>>>> >>>>> +bool >>>>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) >>>>> +{ >>>>> + int sync_standbys[synchronous_standby_num]; >>>>> >>>>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM]. >>>>> (Variable sized arrays are a feature of C99 and PostgreSQL is written >>>>> in C89.) >>>>> >>>>> +/* >>>>> + * Populate a caller-supplied array which much have enough space for >>>>> + * synchronous_standby_num. Returns position of standbys currently >>>>> + * considered as synchronous, and its length. >>>>> + */ >>>>> +int >>>>> +SyncRepGetSyncStandbys(int *sync_standbys) >>>>> >>>>> s/much/must/ (my bad, in previous email). >>>>> >>>>> + ereport(ERROR, >>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>>>> + errmsg("The number of synchronous standbys must be smaller than the >>>>> number of listed : %d", >>>>> + synchronous_standby_num))); >>>>> >>>>> How about "the number of synchronous standbys exceeds the length of >>>>> the standby list: %d"? Error messages usually start with lower case, >>>>> ':' is not usually preceded by a space. >>>>> >>>>> + ereport(ERROR, >>>>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >>>>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d", >>>>> >>>>> s/The/the/, s/ : /: / >>>> >>>> Fixed you mentioned. >>>> >>>> Attached latest v5 patch. >>>> Please review it. >>> >>> synchronous_standby_num doesn't appear to be a valid GUC name: >>> >>> LOG: unrecognized configuration parameter "synchronous_standby_num" >>> in file "/home/thom/Development/test/primary/postgresql.conf" line 244 >>> >>> All I did was uncomment it and set it to a value. >>> >> >> Thank you for having a look it. >> >> Yeah, synchronous_standby_num should not exists in postgresql.conf. >> Please test for multiple sync replication with latest patch. > > In synchronous_replication_method = 'priority' case, when I set > synchronous_standby_names to invalid value like 'hoge,foo' and > reloaded the configuration file, the server crashed with > the following error. This crash should not happen. > > FATAL: invalid input syntax for integer: "hoge" > > + /* > + * After read all synchronous replication configuration parameter, we apply > + * settings according to replication method. > + */ > + ProcessSynchronousReplicationConfig(); > > Why does the above function need to be called in ProcessConfigFile(), i.e., > by every postgres processes? I was thinking that only walsender should > call that to check which walsender is synchronous according to the setting. > > When synchronous_replication_method = '1-priority' and > synchronous_standby_names = '*', I started one synchronous standby. > Then, when I ran "SELECT * FROM pg_stat_replication", I got the > following WARNING message. > > WARNING: detected write past chunk end in ExprContext 0x2acb3c0 > > I don't think that it's good design to specify the number of sync replicas > to wait for, in synchronous_standby_names. It's confusing for the users. > It's better to add separate parameter (synchronous_standby_num) for > specifying that number. Which increases the number of GUC parameters, > though. > > Are we really planning to implement synchronous_replication_method=quorum > at the first version? If not, I'd like to remove s_r_method parameter > because it's meaningless. We can add it later when we implement "quorum". Thank you for your comment. By the discussions so far, I'm planning to have several replication methods such as 'quorum', 'complex' in the feature, and the each replication method specifies the syntax of s_s_names. It means that s_s_names could have the number of sync standbys like what current patch does. If we have additional GUC like synchronous_standby_num then it will look oddly, I think. Even if we don't have 'quorum' method in first version, the synctax of s_s_names is completely different between 'priority' and '1-priority'. So we will need to have new GUC parameter like s_r_method in order to specify the syntax of s_s_names, I think. Regards, -- Masahiko Sawada
On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: > By the discussions so far, I'm planning to have several replication > methods such as 'quorum', 'complex' in the feature, and the each > replication method specifies the syntax of s_s_names. > It means that s_s_names could have the number of sync standbys like > what current patch does. What if the application_name of a standby node has the format of an integer? -- Michael
On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >> By the discussions so far, I'm planning to have several replication >> methods such as 'quorum', 'complex' in the feature, and the each >> replication method specifies the syntax of s_s_names. >> It means that s_s_names could have the number of sync standbys like >> what current patch does. > > What if the application_name of a standby node has the format of an integer? Even if the standby has an integer as application_name, we can set s_s_names like '2,1,2,3'. The leading '2' is always handled as the number of sync standbys when s_r_method = 'priority'. Regards, -- Masahiko Sawada
On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>> By the discussions so far, I'm planning to have several replication >>> methods such as 'quorum', 'complex' in the feature, and the each >>> replication method specifies the syntax of s_s_names. >>> It means that s_s_names could have the number of sync standbys like >>> what current patch does. >> >> What if the application_name of a standby node has the format of an integer? > > Even if the standby has an integer as application_name, we can set > s_s_names like '2,1,2,3'. > The leading '2' is always handled as the number of sync standbys when > s_r_method = 'priority'. Hm. I agree with Fujii-san here, having the number of sync standbys defined in a parameter that should have a list of names is a bit confusing. I'd rather have a separate GUC, which brings us back to one of the first patches that I came up with, and a couple of people, including Josh were not happy with that because this did not support real quorum. Perhaps the final answer would be really to get a set of hooks, and a contrib module making use of that. -- Michael
On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>>> By the discussions so far, I'm planning to have several replication >>>> methods such as 'quorum', 'complex' in the feature, and the each >>>> replication method specifies the syntax of s_s_names. >>>> It means that s_s_names could have the number of sync standbys like >>>> what current patch does. >>> >>> What if the application_name of a standby node has the format of an integer? >> >> Even if the standby has an integer as application_name, we can set >> s_s_names like '2,1,2,3'. >> The leading '2' is always handled as the number of sync standbys when >> s_r_method = 'priority'. > > Hm. I agree with Fujii-san here, having the number of sync standbys > defined in a parameter that should have a list of names is a bit > confusing. I'd rather have a separate GUC, which brings us back to one > of the first patches that I came up with, and a couple of people, > including Josh were not happy with that because this did not support > real quorum. Perhaps the final answer would be really to get a set of > hooks, and a contrib module making use of that. Yeah, I agree with having set of hooks, and postgres core has simple multi sync replication mechanism like you suggested at first version. Regards, -- Masahiko Sawada
On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>>>> By the discussions so far, I'm planning to have several replication >>>>> methods such as 'quorum', 'complex' in the feature, and the each >>>>> replication method specifies the syntax of s_s_names. >>>>> It means that s_s_names could have the number of sync standbys like >>>>> what current patch does. >>>> >>>> What if the application_name of a standby node has the format of an integer? >>> >>> Even if the standby has an integer as application_name, we can set >>> s_s_names like '2,1,2,3'. >>> The leading '2' is always handled as the number of sync standbys when >>> s_r_method = 'priority'. >> >> Hm. I agree with Fujii-san here, having the number of sync standbys >> defined in a parameter that should have a list of names is a bit >> confusing. I'd rather have a separate GUC, which brings us back to one >> of the first patches that I came up with, and a couple of people, >> including Josh were not happy with that because this did not support >> real quorum. Perhaps the final answer would be really to get a set of >> hooks, and a contrib module making use of that. > > Yeah, I agree with having set of hooks, and postgres core has simple > multi sync replication mechanism like you suggested at first version. If there are hooks, I don't think that we should really bother about having in core anything more complicated than what we have now. The trick will be to come up with a hook design modular enough to support the kind of configurations mentioned on this thread. Roughly perhaps a refactoring of the syncrep code so as it is possible to wait for multiple targets some of them being optional,, one modular way in pg_stat_get_wal_senders to represent the status of a node to user, and another hook to return to decide which are the nodes to wait for. Some of the nodes being waited for may be based on conditions for quorum support. That's a hard problem to do that in a flexible enough way. -- Michael
On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>>>>> By the discussions so far, I'm planning to have several replication >>>>>> methods such as 'quorum', 'complex' in the feature, and the each >>>>>> replication method specifies the syntax of s_s_names. >>>>>> It means that s_s_names could have the number of sync standbys like >>>>>> what current patch does. >>>>> >>>>> What if the application_name of a standby node has the format of an integer? >>>> >>>> Even if the standby has an integer as application_name, we can set >>>> s_s_names like '2,1,2,3'. >>>> The leading '2' is always handled as the number of sync standbys when >>>> s_r_method = 'priority'. >>> >>> Hm. I agree with Fujii-san here, having the number of sync standbys >>> defined in a parameter that should have a list of names is a bit >>> confusing. I'd rather have a separate GUC, which brings us back to one >>> of the first patches that I came up with, and a couple of people, >>> including Josh were not happy with that because this did not support >>> real quorum. Perhaps the final answer would be really to get a set of >>> hooks, and a contrib module making use of that. >> >> Yeah, I agree with having set of hooks, and postgres core has simple >> multi sync replication mechanism like you suggested at first version. > > If there are hooks, I don't think that we should really bother about > having in core anything more complicated than what we have now. The > trick will be to come up with a hook design modular enough to support > the kind of configurations mentioned on this thread. Roughly perhaps a > refactoring of the syncrep code so as it is possible to wait for > multiple targets some of them being optional,, one modular way in > pg_stat_get_wal_senders to represent the status of a node to user, and > another hook to return to decide which are the nodes to wait for. Some > of the nodes being waited for may be based on conditions for quorum > support. That's a hard problem to do that in a flexible enough way. Hm, I think not-nested quorum and priority are not complicated, and we should support at least both or either simple method in core of postgres. More complicated method like using json-style, or dedicated language would be supported by external module. Regards, -- Masahiko Sawada
On Mon, Feb 1, 2016 at 5:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier >>>>> <michael.paquier@gmail.com> wrote: >>>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>>>>>> By the discussions so far, I'm planning to have several replication >>>>>>> methods such as 'quorum', 'complex' in the feature, and the each >>>>>>> replication method specifies the syntax of s_s_names. >>>>>>> It means that s_s_names could have the number of sync standbys like >>>>>>> what current patch does. >>>>>> >>>>>> What if the application_name of a standby node has the format of an integer? >>>>> >>>>> Even if the standby has an integer as application_name, we can set >>>>> s_s_names like '2,1,2,3'. >>>>> The leading '2' is always handled as the number of sync standbys when >>>>> s_r_method = 'priority'. >>>> >>>> Hm. I agree with Fujii-san here, having the number of sync standbys >>>> defined in a parameter that should have a list of names is a bit >>>> confusing. I'd rather have a separate GUC, which brings us back to one >>>> of the first patches that I came up with, and a couple of people, >>>> including Josh were not happy with that because this did not support >>>> real quorum. Perhaps the final answer would be really to get a set of >>>> hooks, and a contrib module making use of that. >>> >>> Yeah, I agree with having set of hooks, and postgres core has simple >>> multi sync replication mechanism like you suggested at first version. >> >> If there are hooks, I don't think that we should really bother about >> having in core anything more complicated than what we have now. The >> trick will be to come up with a hook design modular enough to support >> the kind of configurations mentioned on this thread. Roughly perhaps a >> refactoring of the syncrep code so as it is possible to wait for >> multiple targets some of them being optional,, one modular way in >> pg_stat_get_wal_senders to represent the status of a node to user, and >> another hook to return to decide which are the nodes to wait for. Some >> of the nodes being waited for may be based on conditions for quorum >> support. That's a hard problem to do that in a flexible enough way. > > Hm, I think not-nested quorum and priority are not complicated, and we > should support at least both or either simple method in core of > postgres. > More complicated method like using json-style, or dedicated language > would be supported by external module. So what about the following plan? [first version] Add only synchronous_standby_num which specifies the number of standbys that the master must wait for before marking sync replication as completed. This version supports simple use cases like "I want to have two synchronous standbys". [second version] Add synchronous_replication_method: 'prioriry' and 'quorum'. This version additionally supports simple quorum commit case like "I want to ensure that WAL is replicated synchronously to at least two standbys from five ones listed in s_s_names". Or Add something like quorum_replication_num and quorum_standby_names, i.e., the master must wait for at least q_r_num standbys from ones listed in q_s_names before marking sync replication as completed. Also the master must wait for sync replication according to s_s_num and s_s_num. That is, this approach separates 'priority' and 'quorum' to each parameters. This increases the number of GUC parameters, but ISTM less confusing, and it supports a bit complicated case like "there is one local standby and three remote standbys, then I want to ensure that WAL is replicated synchronously to the local standby and at least two remote one", e.g., s_s_num = 1, s_s_names = 'local' q_s_num = 2, q_s_names = 'remote1, remote2, remote3' [third version] Add the hooks for more complicated sync replication cases. I'm thinking that the realistic target for 9.6 might be the first one. Regards, -- Fujii Masao
[first version]
Add only synchronous_standby_num which specifies the number of standbys
that the master must wait for before marking sync replication as completed.
This version supports simple use cases like "I want to have two synchronous
standbys".
[second version]
Add synchronous_replication_method: 'prioriry' and 'quorum'. This version
additionally supports simple quorum commit case like "I want to ensure
that WAL is replicated synchronously to at least two standbys from five
ones listed in s_s_names".
Or
Add something like quorum_replication_num and quorum_standby_names, i.e.,
the master must wait for at least q_r_num standbys from ones listed in
q_s_names before marking sync replication as completed. Also the master
must wait for sync replication according to s_s_num and s_s_num.
That is, this approach separates 'priority' and 'quorum' to each parameters.
This increases the number of GUC parameters, but ISTM less confusing, and
it supports a bit complicated case like "there is one local standby and three
remote standbys, then I want to ensure that WAL is replicated synchronously
to the local standby and at least two remote one", e.g.,
s_s_num = 1, s_s_names = 'local'
q_s_num = 2, q_s_names = 'remote1, remote2, remote3'
[third version]
Add the hooks for more complicated sync replication cases.
I'm thinking that the realistic target for 9.6 might be the first one.
Now I would not mind if we actually jump into the 3rd case if we are fine with doing nothing for this release, but this requires a lot of design and background work, so that's not plausible for 9.6. Of course if there are voices against the scenario proposed by Fujii-san others feel free to speak up.
--
On Mon, Feb 1, 2016 at 11:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Feb 1, 2016 at 5:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sun, Jan 31, 2016 at 8:58 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Sun, Jan 31, 2016 at 5:28 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Sun, Jan 31, 2016 at 5:18 PM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>>> On Sun, Jan 31, 2016 at 5:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> On Sun, Jan 31, 2016 at 1:17 PM, Michael Paquier >>>>>> <michael.paquier@gmail.com> wrote: >>>>>>> On Thu, Jan 28, 2016 at 10:10 PM, Masahiko Sawada wrote: >>>>>>>> By the discussions so far, I'm planning to have several replication >>>>>>>> methods such as 'quorum', 'complex' in the feature, and the each >>>>>>>> replication method specifies the syntax of s_s_names. >>>>>>>> It means that s_s_names could have the number of sync standbys like >>>>>>>> what current patch does. >>>>>>> >>>>>>> What if the application_name of a standby node has the format of an integer? >>>>>> >>>>>> Even if the standby has an integer as application_name, we can set >>>>>> s_s_names like '2,1,2,3'. >>>>>> The leading '2' is always handled as the number of sync standbys when >>>>>> s_r_method = 'priority'. >>>>> >>>>> Hm. I agree with Fujii-san here, having the number of sync standbys >>>>> defined in a parameter that should have a list of names is a bit >>>>> confusing. I'd rather have a separate GUC, which brings us back to one >>>>> of the first patches that I came up with, and a couple of people, >>>>> including Josh were not happy with that because this did not support >>>>> real quorum. Perhaps the final answer would be really to get a set of >>>>> hooks, and a contrib module making use of that. >>>> >>>> Yeah, I agree with having set of hooks, and postgres core has simple >>>> multi sync replication mechanism like you suggested at first version. >>> >>> If there are hooks, I don't think that we should really bother about >>> having in core anything more complicated than what we have now. The >>> trick will be to come up with a hook design modular enough to support >>> the kind of configurations mentioned on this thread. Roughly perhaps a >>> refactoring of the syncrep code so as it is possible to wait for >>> multiple targets some of them being optional,, one modular way in >>> pg_stat_get_wal_senders to represent the status of a node to user, and >>> another hook to return to decide which are the nodes to wait for. Some >>> of the nodes being waited for may be based on conditions for quorum >>> support. That's a hard problem to do that in a flexible enough way. >> >> Hm, I think not-nested quorum and priority are not complicated, and we >> should support at least both or either simple method in core of >> postgres. >> More complicated method like using json-style, or dedicated language >> would be supported by external module. > > So what about the following plan? > > [first version] > Add only synchronous_standby_num which specifies the number of standbys > that the master must wait for before marking sync replication as completed. > This version supports simple use cases like "I want to have two synchronous > standbys". > > [second version] > Add synchronous_replication_method: 'prioriry' and 'quorum'. This version > additionally supports simple quorum commit case like "I want to ensure > that WAL is replicated synchronously to at least two standbys from five > ones listed in s_s_names". > > Or > > Add something like quorum_replication_num and quorum_standby_names, i.e., > the master must wait for at least q_r_num standbys from ones listed in > q_s_names before marking sync replication as completed. Also the master > must wait for sync replication according to s_s_num and s_s_num. > That is, this approach separates 'priority' and 'quorum' to each parameters. > This increases the number of GUC parameters, but ISTM less confusing, and > it supports a bit complicated case like "there is one local standby and three > remote standbys, then I want to ensure that WAL is replicated synchronously > to the local standby and at least two remote one", e.g., > > s_s_num = 1, s_s_names = 'local' > q_s_num = 2, q_s_names = 'remote1, remote2, remote3' > > [third version] > Add the hooks for more complicated sync replication cases. > > I'm thinking that the realistic target for 9.6 might be the first one. > Thank you for suggestion. I agree with first version, and attached the updated patch which are modified so that it supports simple multiple sync replication you suggested. (but test cases are not included yet.) Regards, -- Masahiko Sawada
Attachment
On Mon, Feb 1, 2016 at 9:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > So what about the following plan? > > [first version] > Add only synchronous_standby_num which specifies the number of standbys > that the master must wait for before marking sync replication as completed. > This version supports simple use cases like "I want to have two synchronous > standbys". > > [second version] > Add synchronous_replication_method: 'prioriry' and 'quorum'. This version > additionally supports simple quorum commit case like "I want to ensure > that WAL is replicated synchronously to at least two standbys from five > ones listed in s_s_names". > > Or > > Add something like quorum_replication_num and quorum_standby_names, i.e., > the master must wait for at least q_r_num standbys from ones listed in > q_s_names before marking sync replication as completed. Also the master > must wait for sync replication according to s_s_num and s_s_num. > That is, this approach separates 'priority' and 'quorum' to each parameters. > This increases the number of GUC parameters, but ISTM less confusing, and > it supports a bit complicated case like "there is one local standby and three > remote standbys, then I want to ensure that WAL is replicated synchronously > to the local standby and at least two remote one", e.g., > > s_s_num = 1, s_s_names = 'local' > q_s_num = 2, q_s_names = 'remote1, remote2, remote3' > > [third version] > Add the hooks for more complicated sync replication cases. -1. We're wrapping ourselves around the axle here and ending up with a design that will not let someone say "the local standby and at least one remote standby" without writing C code. I understand nobody likes the mini-language I proposed and nobody likes a JSON configuration file either. I also understand that either of those things would allow ridiculously complicated configurations that nobody will ever need in the real world. But I think "one local and one remote" is a fairly common case and that you shouldn't need a PhD in PostgreSQLology to configure it. Also, to be frank, I think we ought to be putting more effort into another patch in this same area, specifically Thomas Munro's causal reads patch. I think a lot of people today are trying to use synchronous replication to build load-balancing clusters and avoid the problem where you write some data and then read back stale data from a standby server. Of course, our current synchronous replication facilities make no such guarantees - his patch does, and I think that's pretty important. I'm not saying that we shouldn't do this too, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Feb 1, 2016 at 9:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> So what about the following plan? >> >> [first version] >> Add only synchronous_standby_num which specifies the number of standbys >> that the master must wait for before marking sync replication as completed. >> This version supports simple use cases like "I want to have two synchronous >> standbys". >> >> [second version] >> Add synchronous_replication_method: 'prioriry' and 'quorum'. This version >> additionally supports simple quorum commit case like "I want to ensure >> that WAL is replicated synchronously to at least two standbys from five >> ones listed in s_s_names". >> >> Or >> >> Add something like quorum_replication_num and quorum_standby_names, i.e., >> the master must wait for at least q_r_num standbys from ones listed in >> q_s_names before marking sync replication as completed. Also the master >> must wait for sync replication according to s_s_num and s_s_num. >> That is, this approach separates 'priority' and 'quorum' to each parameters. >> This increases the number of GUC parameters, but ISTM less confusing, and >> it supports a bit complicated case like "there is one local standby and three >> remote standbys, then I want to ensure that WAL is replicated synchronously >> to the local standby and at least two remote one", e.g., >> >> s_s_num = 1, s_s_names = 'local' >> q_s_num = 2, q_s_names = 'remote1, remote2, remote3' >> >> [third version] >> Add the hooks for more complicated sync replication cases. > > -1. We're wrapping ourselves around the axle here and ending up with > a design that will not let someone say "the local standby and at least > one remote standby" without writing C code. I understand nobody likes > the mini-language I proposed and nobody likes a JSON configuration > file either. I also understand that either of those things would > allow ridiculously complicated configurations that nobody will ever > need in the real world. But I think "one local and one remote" is a > fairly common case and that you shouldn't need a PhD in > PostgreSQLology to configure it. So you disagree with only third version that I proposed, i.e., adding some hooks for sync replication? If yes and you're OK with the first and second versions, ISTM that we almost reached consensus on the direction of multiple sync replication feature. The first version can cover "one local and one remote sync standbys" case, and the second can cover "one local and at least one from several remote standbys" case. I'm thinking to focus on the first version now, and then we can work on the second to support the quorum commit Regards, -- Fujii Masao
On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > So you disagree with only third version that I proposed, i.e., > adding some hooks for sync replication? If yes and you're OK > with the first and second versions, ISTM that we almost reached > consensus on the direction of multiple sync replication feature. > The first version can cover "one local and one remote sync standbys" case, > and the second can cover "one local and at least one from several remote > standbys" case. I'm thinking to focus on the first version now, > and then we can work on the second to support the quorum commit Well, I think the only hard part of the third problem is deciding on what syntax to use. It seems like a waste of time to me to go to a bunch of trouble to implement #1 and #2 using one syntax and then have to invent a whole new syntax for #3. Seriously, this isn't that hard: it's not a technical problem. It's just that we've got a bunch of people who can't agree on what syntax to use. IMO, you should just pick something. You're presumably the committer for this patch, and I think you should just decide which of the 47,123 things proposed so far is best and insist on that. I trust that you will make a good decision even if it's different than the decision that I would have made. Now, if it's easier to implement a subset of that syntax first and then extend it later, fine. But it makes no sense to me to implement the easy cases without having some idea of how you're go to extend that to the hard cases. Then you'll just end up with a mishmash. Pick something that can be extended to handle all of the plausible cases, whether it's a mini-language or a JSON blob or a pg_hba.conf-type file or some other crazy thing that you invent, and just do it and be done with it. We've wasted far too much time trying to reach consensus on this: it's time for you to exercise your vast dictatorial power. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> So you disagree with only third version that I proposed, i.e., >> adding some hooks for sync replication? If yes and you're OK >> with the first and second versions, ISTM that we almost reached >> consensus on the direction of multiple sync replication feature. >> The first version can cover "one local and one remote sync standbys" case, >> and the second can cover "one local and at least one from several remote >> standbys" case. I'm thinking to focus on the first version now, >> and then we can work on the second to support the quorum commit > > Well, I think the only hard part of the third problem is deciding on > what syntax to use. It seems like a waste of time to me to go to a > bunch of trouble to implement #1 and #2 using one syntax and then have > to invent a whole new syntax for #3. Seriously, this isn't that hard: > it's not a technical problem. It's just that we've got a bunch of > people who can't agree on what syntax to use. IMO, you should just > pick something. You're presumably the committer for this patch, and I > think you should just decide which of the 47,123 things proposed so > far is best and insist on that. I trust that you will make a good > decision even if it's different than the decision that I would have > made. If we use one syntax for every cases, possible approaches that we can choose are mini-language, json, etc. Since my previous proposal covers only very simple cases, extra syntax needs to be supported for more complicated cases. My plan was to add the hooks so that the developers can choose their own syntax. But which might confuse users. Now I'm thinking that mini-language is better choice. A json has some good points, but its big problem is that the setting value is likely to be very long. For example, when the master needs to wait for one local standby and at least one from three remote standbys in London data center, the setting value (synchronous_standby_names) would be s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1, "nodes":["london1", "london2", "london3"]}]}' OTOH, the value with mini-language is simple and not so long as follows. s_s_names = '2[local1, 1(london1, london2, london3)]' This is why I'm now thinking that mini-language is better. But it's not easy to completely implement mini-language. There seems to be many problems that we need to resolve. For example, please imagine the case where the master needs to wait for at least one from two standbys "tokyo1", "tokyo2" in Tokyo data center. If Tokyo data center fails, the master needs to wait for at least one from two standbys "london1", "london2" in London data center, instead. This case can be configured as follows in mini-language. s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]' One problem here is; what pg_stat_replication.sync_state value should be shown for each standbys? Which standby should be marked as sync? potential? any other value like quorum? The current design of pg_stat_replication doesn't fit complicated sync replication cases, so maybe we need to separate it into several views. It's almost impossible to complete those problems. My current plan for 9.6 is to support the minimal subset of mini-language; simple syntax of "<number>[name, ...]". "<number>" specifies the number of sync standbys that the master needs to wait for. "[name, ...]" specifies the priorities of the listed standbys. This first version supports neither quorum commit nor nested sync replication configuration like "<number>[name, <number>[name, ...]]". It just supports very simple "1-level" configuration. Regards, -- Fujii Masao
On 4 February 2016 at 14:34, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> So you disagree with only third version that I proposed, i.e., >>> adding some hooks for sync replication? If yes and you're OK >>> with the first and second versions, ISTM that we almost reached >>> consensus on the direction of multiple sync replication feature. >>> The first version can cover "one local and one remote sync standbys" case, >>> and the second can cover "one local and at least one from several remote >>> standbys" case. I'm thinking to focus on the first version now, >>> and then we can work on the second to support the quorum commit >> >> Well, I think the only hard part of the third problem is deciding on >> what syntax to use. It seems like a waste of time to me to go to a >> bunch of trouble to implement #1 and #2 using one syntax and then have >> to invent a whole new syntax for #3. Seriously, this isn't that hard: >> it's not a technical problem. It's just that we've got a bunch of >> people who can't agree on what syntax to use. IMO, you should just >> pick something. You're presumably the committer for this patch, and I >> think you should just decide which of the 47,123 things proposed so >> far is best and insist on that. I trust that you will make a good >> decision even if it's different than the decision that I would have >> made. > > If we use one syntax for every cases, possible approaches that we can choose > are mini-language, json, etc. Since my previous proposal covers only very > simple cases, extra syntax needs to be supported for more complicated cases. > My plan was to add the hooks so that the developers can choose their own > syntax. But which might confuse users. > > Now I'm thinking that mini-language is better choice. A json has some good > points, but its big problem is that the setting value is likely to be very long. > For example, when the master needs to wait for one local standby and > at least one from three remote standbys in London data center, the setting > value (synchronous_standby_names) would be > > s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1, > "nodes":["london1", "london2", "london3"]}]}' > > OTOH, the value with mini-language is simple and not so long as follows. > > s_s_names = '2[local1, 1(london1, london2, london3)]' > > This is why I'm now thinking that mini-language is better. But it's not easy > to completely implement mini-language. There seems to be many problems > that we need to resolve. For example, please imagine the case where > the master needs to wait for at least one from two standbys "tokyo1", "tokyo2" > in Tokyo data center. If Tokyo data center fails, the master needs to > wait for at least one from two standbys "london1", "london2" in London > data center, instead. This case can be configured as follows in mini-language. > > s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]' > > One problem here is; what pg_stat_replication.sync_state value should be > shown for each standbys? Which standby should be marked as sync? potential? > any other value like quorum? The current design of pg_stat_replication > doesn't fit complicated sync replication cases, so maybe we need to separate > it into several views. It's almost impossible to complete those problems. > > My current plan for 9.6 is to support the minimal subset of mini-language; > simple syntax of "<number>[name, ...]". "<number>" specifies the number of > sync standbys that the master needs to wait for. "[name, ...]" specifies > the priorities of the listed standbys. This first version supports neither > quorum commit nor nested sync replication configuration like > "<number>[name, <number>[name, ...]]". It just supports very simple > "1-level" configuration. Whatever the solution, I'm really don't like the idea of changing the definition of s_s_names based on the value of another GUC, mainly because it seems hacky, but also because the name of the GUC stops making sense. Thom
On Thu, Feb 4, 2016 at 9:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Now I'm thinking that mini-language is better choice. A json has some good > points, but its big problem is that the setting value is likely to be very long. > For example, when the master needs to wait for one local standby and > at least one from three remote standbys in London data center, the setting > value (synchronous_standby_names) would be > > s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1, > "nodes":["london1", "london2", "london3"]}]}' > > OTOH, the value with mini-language is simple and not so long as follows. > > s_s_names = '2[local1, 1(london1, london2, london3)]' Yeah, that was my thought also. Another idea which was suggested is to create a completely new configuration file for this. Most people would only have simple stuff in there, of course, but then you could have the information spread across multiple lines. I don't in the end care very much about how we solve this problem. But I'm glad you agree that whatever we do to solve the simple problem should be a logical subset of what the full solution will eventually look like, not a completely different design. I think that's important. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 4, 2016 at 7:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't in the end care very much about how we solve this problem. > But I'm glad you agree that whatever we do to solve the simple problem > should be a logical subset of what the full solution will eventually > look like, not a completely different design. I think that's > important. Yes, please let's use the custom language, and let's not care of not more than 1 level of nesting so as it is possible to represent pg_stat_replication in a simple way for the user. -- Michael
On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Yes, please let's use the custom language, and let's not care of not > more than 1 level of nesting so as it is possible to represent > pg_stat_replication in a simple way for the user. "not" is used twice in this sentence in a way that renders me not able to be sure that I'm not understanding it not properly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Yes, please let's use the custom language, and let's not care of not >> more than 1 level of nesting so as it is possible to represent >> pg_stat_replication in a simple way for the user. > > "not" is used twice in this sentence in a way that renders me not able > to be sure that I'm not understanding it not properly. 4 times here. Score beaten. Sorry. Perhaps I am tired... I was just wondering if it would be fine to only support configurations up to one level of nested objects, like that: 2[node1, node2, node3] node1, 2[node2, node3], node3 In short, we could restrict things so as we cannot define a group of nodes within an existing group. -- Michael
On Thu, Feb 4, 2016 at 2:49 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Yes, please let's use the custom language, and let's not care of not >>> more than 1 level of nesting so as it is possible to represent >>> pg_stat_replication in a simple way for the user. >> >> "not" is used twice in this sentence in a way that renders me not able >> to be sure that I'm not understanding it not properly. > > 4 times here. Score beaten. > > Sorry. Perhaps I am tired... I was just wondering if it would be fine > to only support configurations up to one level of nested objects, like > that: > 2[node1, node2, node3] > node1, 2[node2, node3], node3 > In short, we could restrict things so as we cannot define a group of > nodes within an existing group. I see. Such a restriction doesn't seem likely to me to prevent people from doing anything actually useful. But I don't know that it buys very much either. It's often not very much simpler to handle 2 levels than n levels. However, I ain't writing the code so... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Yes, please let's use the custom language, and let's not care of not >>> more than 1 level of nesting so as it is possible to represent >>> pg_stat_replication in a simple way for the user. >> >> "not" is used twice in this sentence in a way that renders me not able >> to be sure that I'm not understanding it not properly. > > 4 times here. Score beaten. > > Sorry. Perhaps I am tired... I was just wondering if it would be fine > to only support configurations up to one level of nested objects, like > that: > 2[node1, node2, node3] > node1, 2[node2, node3], node3 > In short, we could restrict things so as we cannot define a group of > nodes within an existing group. No, actually, that's stupid. Having up to two nested levels makes more sense, a quite common case for this feature being something like that: 2{node1,[node2,node3]} In short, sync confirmation is waited from node1 and (node2 or node3). Flattening groups of nodes with a new catalog will be necessary to ease the view of this data to users: - group name? - array of members with nodes/groups - group type: quorum or priority - number of items to wait for in this group -- Michael
Hello, At Thu, 4 Feb 2016 23:06:45 +0300, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTMV5sZkemGf=SWMyA8QpzV2VW9bRrysXtKzuSVk99ocw@mail.gmail.com> > On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > Sorry. Perhaps I am tired... I was just wondering if it would be fine > > to only support configurations up to one level of nested objects, like > > that: > > 2[node1, node2, node3] > > node1, 2[node2, node3], node3 > > In short, we could restrict things so as we cannot define a group of > > nodes within an existing group. > > No, actually, that's stupid. Having up to two nested levels makes more > sense, a quite common case for this feature being something like that: > 2{node1,[node2,node3]} > In short, sync confirmation is waited from node1 and (node2 or node3). > > Flattening groups of nodes with a new catalog will be necessary to > ease the view of this data to users: > - group name? > - array of members with nodes/groups > - group type: quorum or priority > - number of items to wait for in this group Though I personally love the format, I don't fully recognize what the upcoming consensus is and the discussion looks to be looping back to the past, so please forgive me to confirm the current discussion status. We are coming to agree to have configuration manner including syntax which is compatible with future possible use, I think this is correct. (Though I haven't seen it explicitly written upthread, ) we regard it as important to keep validity of previous setting using s_s_names as 1-priority method. Is this correct? The most promising syntax is now considered as n-level quorum/priority nesting as Michael's proposal above. Correct? But aiming to 9.6, we are to support (1 or 2)-levels quorum *or* priority setup with the subset of the syntax. I don't think this is fully agreed yet. We don't consider using extension or some plugin mechanism for additional configuration method for this feature at least as of 9.6. Correct? I proposed that s_s_method for backward compatibility, but there is a voice that such a way of changing the semantics of s_s_names is confising. I can be in sympathy with him. If so, do we have another variable (named standbys_definition or likewise?) which is to be set alternatively with s_s_names? Or take another way? Sorry for the maybe-noise in advance. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> Yes, please let's use the custom language, and let's not care of not >>>> more than 1 level of nesting so as it is possible to represent >>>> pg_stat_replication in a simple way for the user. >>> >>> "not" is used twice in this sentence in a way that renders me not able >>> to be sure that I'm not understanding it not properly. >> >> 4 times here. Score beaten. >> >> Sorry. Perhaps I am tired... I was just wondering if it would be fine >> to only support configurations up to one level of nested objects, like >> that: >> 2[node1, node2, node3] >> node1, 2[node2, node3], node3 >> In short, we could restrict things so as we cannot define a group of >> nodes within an existing group. > > No, actually, that's stupid. Having up to two nested levels makes more > sense, a quite common case for this feature being something like that: > 2{node1,[node2,node3]} > In short, sync confirmation is waited from node1 and (node2 or node3). > > Flattening groups of nodes with a new catalog will be necessary to > ease the view of this data to users: > - group name? > - array of members with nodes/groups > - group type: quorum or priority > - number of items to wait for in this group So, here are some thoughts to make that more user-friendly. I think that the critical issue here is to properly flatten the meta data in the custom language and represent it properly in a new catalog, without messing up too much with the existing pg_stat_replication that people are now used to for 5 releases since 9.0. So, I would think that we will need to have a new catalog, say pg_stat_replication_groups with the following things: - One line of this catalog represents the status of a group or of a single node. - The status of a node/group is either sync or potential, if a node/group is specified more than once, it may be possible that it would be sync and potential depending on where it is defined, in which case setting its status to 'sync' has the most sense. If it is in sync state I guess. - Move sync_priority and sync_state, actually an equivalent from pg_stat_replication into this new catalog, because those represent the status of a node or group of nodes. - group name, and by that I think that we had perhaps better make mandatory the need to append a name with a quorum or priority group. The group at the highest level is forcibly named as 'top', 'main', or whatever if not directly specified by the user. If the entry is directly a node, use the application_name. - Type of group, quorum or priority - Elements in this group, an element can be a group name or a node name, aka application_name. If group is of type priority, the elements are listed in increasing order. So the elements with lower priority get first, etc. We could have one column listing explicitly a list of integers that map with the elements of a group but it does not seem worth it, what users would like to know is what are the nodes that are prioritized. This covers the former 'priority' field of pg_stat_replication. We may have a good idea of how to define a custom language, still we are going to need to design a clean interface at catalog level more or less close to what is written here. If we can get a clean interface, the custom language implemented, and TAP tests that take advantage of this user interface to check the node/group statuses, I guess that we would be in good shape for this patch. Anyway that's not a small project, and perhaps I am over-complicating the whole thing. Thoughts? -- Michael
> We may have a good idea of how to define a custom language, still we > are going to need to design a clean interface at catalog level more or > less close to what is written here. If we can get a clean interface, > the custom language implemented, and TAP tests that take advantage of > this user interface to check the node/group statuses, I guess that we > would be in good shape for this patch. > > Anyway that's not a small project, and perhaps I am over-complicating > the whole thing. Yes. The more I look at this, the worse the idea of custom syntax looks. Yes, I realize there are drawbacks to using JSON,but this is worse. Further, there's a lot of horse-cart inversion here. This proposal involves letting the syntax for sync_list configurationdetermine the feature set for N-sync. That's backwards; we should decide the total list of features we wantto support, and then adopt a syntax which will make it possible to have them. -- Josh Berkus Red Hat OSAS (opinions are my own)
On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>>> Yes, please let's use the custom language, and let's not care of not >>>>> more than 1 level of nesting so as it is possible to represent >>>>> pg_stat_replication in a simple way for the user. >>>> >>>> "not" is used twice in this sentence in a way that renders me not able >>>> to be sure that I'm not understanding it not properly. >>> >>> 4 times here. Score beaten. >>> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine >>> to only support configurations up to one level of nested objects, like >>> that: >>> 2[node1, node2, node3] >>> node1, 2[node2, node3], node3 >>> In short, we could restrict things so as we cannot define a group of >>> nodes within an existing group. >> >> No, actually, that's stupid. Having up to two nested levels makes more >> sense, a quite common case for this feature being something like that: >> 2{node1,[node2,node3]} >> In short, sync confirmation is waited from node1 and (node2 or node3). >> >> Flattening groups of nodes with a new catalog will be necessary to >> ease the view of this data to users: >> - group name? >> - array of members with nodes/groups >> - group type: quorum or priority >> - number of items to wait for in this group > > So, here are some thoughts to make that more user-friendly. I think > that the critical issue here is to properly flatten the meta data in > the custom language and represent it properly in a new catalog, > without messing up too much with the existing pg_stat_replication that > people are now used to for 5 releases since 9.0. So, I would think > that we will need to have a new catalog, say > pg_stat_replication_groups with the following things: > - One line of this catalog represents the status of a group or of a single node. > - The status of a node/group is either sync or potential, if a > node/group is specified more than once, it may be possible that it > would be sync and potential depending on where it is defined, in which > case setting its status to 'sync' has the most sense. If it is in sync > state I guess. > - Move sync_priority and sync_state, actually an equivalent from > pg_stat_replication into this new catalog, because those represent the > status of a node or group of nodes. > - group name, and by that I think that we had perhaps better make > mandatory the need to append a name with a quorum or priority group. > The group at the highest level is forcibly named as 'top', 'main', or > whatever if not directly specified by the user. If the entry is > directly a node, use the application_name. > - Type of group, quorum or priority > - Elements in this group, an element can be a group name or a node > name, aka application_name. If group is of type priority, the elements > are listed in increasing order. So the elements with lower priority > get first, etc. We could have one column listing explicitly a list of > integers that map with the elements of a group but it does not seem > worth it, what users would like to know is what are the nodes that are > prioritized. This covers the former 'priority' field of > pg_stat_replication. > > We may have a good idea of how to define a custom language, still we > are going to need to design a clean interface at catalog level more or > less close to what is written here. If we can get a clean interface, > the custom language implemented, and TAP tests that take advantage of > this user interface to check the node/group statuses, I guess that we > would be in good shape for this patch. > > Anyway that's not a small project, and perhaps I am over-complicating > the whole thing. > I agree with adding new system catalog to easily checking replication status for user. And group name will needed for this. What about adding group name with ":" to immediately after set of standbys like follows? 2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo] Also, regarding sync replication according to configuration, the view I'm thinking is following definition. =# \d pg_synchronous_replication Column | Type | Modifiers -------------------------+-----------+-----------name | text |sync_type | text |wait_num | integer |sync_priority | inteter |sync_state | text |member | text[] |level | integer |write_location | pg_lsn |flush_location | pg_lsn |apply_location | pg_lsn | - "name" : node name or group name, or "main" meaning top level node. - "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL. - "wait_num" : number of nodes/groups to wait for in this group. - "sync_priority" : priority of node/group in this group. "main" node has "0". - the standby is inquorum group always has priority 1. - the standby is in priority group has priority according to definition order. - "sync_state" : 'sync' or 'potential' or 'quorum'. - the standby is in quorum group is always 'quorum'. - the standby is in priority group is 'sync' / 'potential'. - "member" : array of members for group node, otherwise NULL. - "level" : nested level. "main" node is level 0. - "write/flush/apply_location" : group/node calculated LSN according to configuration. When sync replication is set as above, the new system view shows, =# select * from pg_stat_replication_group; name | sync_type | wait_num | sync_priority | sync_state |member | level | write_location | flush_location | apply_location -------------+---------------+---------------+-------------------+-----------------+---------------------------------------+-------+---------------------+---------------------+----------------main | priority | 2 | 0 | sync | {local,london,tokyo} | 0 | | |local | | 0 | 1 | sync | | 1 | | |london | quorum | 2 | 2 | potential | {london1,london2,london3} | 1 | | |london1 | | 0 | 1 | potential | | 2 | | |london2 | | 0 | 2 | potential | | 2 | | |london3 | | 0 | 3 | potential | | 2 | | |tokyo | quorum | 1 | 3 | potential | {tokyo1,tokyo2} | 1 | | |tokyo1 | | 0 | 1 | quorum | | 2 | | |tokyo2 | | 0 | 1 | quorum | | 2 | | | (9 rows) Thought? Regards, -- Masahiko Sawada
On Fri, Feb 5, 2016 at 12:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > I agree with adding new system catalog to easily checking replication > status for user. And group name will needed for this. > What about adding group name with ":" to immediately after set of > standbys like follows? This way is fine for me. > 2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo] > > Also, regarding sync replication according to configuration, the view > I'm thinking is following definition. > > =# \d pg_synchronous_replication > Column | Type | Modifiers > -------------------------+-----------+----------- > name | text | > sync_type | text | > wait_num | integer | > sync_priority | inteter | > sync_state | text | > member | text[] | > level | integer | > write_location | pg_lsn | > flush_location | pg_lsn | > apply_location | pg_lsn | > > - "name" : node name or group name, or "main" meaning top level node. Check. > - "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL. That would be one or the other. > - "wait_num" : number of nodes/groups to wait for in this group. Check. This is taken directly from the meta data. > - "sync_priority" : priority of node/group in this group. "main" node has "0". > - the standby is in quorum group always has > priority 1. > - the standby is in priority group has > priority according to definition order. This is a bit confusing if the same node or group in in multiple groups. My previous suggestion was to list the elements of the group in increasing order of priority. That's an important point. > - "sync_state" : 'sync' or 'potential' or 'quorum'. > - the standby is in quorum group is always 'quorum'. > - the standby is in priority group is 'sync' > / 'potential'. potential and quorum are the same thing, no? The only difference is based on the group type here. > - "member" : array of members for group node, otherwise NULL. This can be NULL only when the entry is a node. > - "level" : nested level. "main" node is level 0. Not sure this one is necessary. > - "write/flush/apply_location" : group/node calculated LSN according > to configuration. This does not need to be part of this catalog, that's a representation of the data that is part of the WAL sender. -- Michael
hello, I have tested v7 patch. but i think you forgot to remove some debug points in patch from src/backend/replication/syncrep.c file. for (i = 0; i < num_sync; i++) + { + elog(WARNING, "sync_standbys[%d] = %d", i, sync_standbys[i]); + } + elog(WARNING, "num_sync = %d, s_s_num = %d", num_sync, synchronous_standby_num); Please correct my understanding if i am wrong. Regards Suraj Kharage -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5886259.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>>> Yes, please let's use the custom language, and let's not care of not >>>>> more than 1 level of nesting so as it is possible to represent >>>>> pg_stat_replication in a simple way for the user. >>>> >>>> "not" is used twice in this sentence in a way that renders me not able >>>> to be sure that I'm not understanding it not properly. >>> >>> 4 times here. Score beaten. >>> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine >>> to only support configurations up to one level of nested objects, like >>> that: >>> 2[node1, node2, node3] >>> node1, 2[node2, node3], node3 >>> In short, we could restrict things so as we cannot define a group of >>> nodes within an existing group. >> >> No, actually, that's stupid. Having up to two nested levels makes more >> sense, a quite common case for this feature being something like that: >> 2{node1,[node2,node3]} >> In short, sync confirmation is waited from node1 and (node2 or node3). >> >> Flattening groups of nodes with a new catalog will be necessary to >> ease the view of this data to users: >> - group name? >> - array of members with nodes/groups >> - group type: quorum or priority >> - number of items to wait for in this group > > So, here are some thoughts to make that more user-friendly. I think > that the critical issue here is to properly flatten the meta data in > the custom language and represent it properly in a new catalog, > without messing up too much with the existing pg_stat_replication that > people are now used to for 5 releases since 9.0. So, I would think > that we will need to have a new catalog, say > pg_stat_replication_groups with the following things: > - One line of this catalog represents the status of a group or of a single node. > - The status of a node/group is either sync or potential, if a > node/group is specified more than once, it may be possible that it > would be sync and potential depending on where it is defined, in which > case setting its status to 'sync' has the most sense. If it is in sync > state I guess. > - Move sync_priority and sync_state, actually an equivalent from > pg_stat_replication into this new catalog, because those represent the > status of a node or group of nodes. > - group name, and by that I think that we had perhaps better make > mandatory the need to append a name with a quorum or priority group. > The group at the highest level is forcibly named as 'top', 'main', or > whatever if not directly specified by the user. If the entry is > directly a node, use the application_name. > - Type of group, quorum or priority > - Elements in this group, an element can be a group name or a node > name, aka application_name. If group is of type priority, the elements > are listed in increasing order. So the elements with lower priority > get first, etc. We could have one column listing explicitly a list of > integers that map with the elements of a group but it does not seem > worth it, what users would like to know is what are the nodes that are > prioritized. This covers the former 'priority' field of > pg_stat_replication. > > We may have a good idea of how to define a custom language, still we > are going to need to design a clean interface at catalog level more or > less close to what is written here. If we can get a clean interface, > the custom language implemented, and TAP tests that take advantage of > this user interface to check the node/group statuses, I guess that we > would be in good shape for this patch. > > Anyway that's not a small project, and perhaps I am over-complicating > the whole thing. > > Thoughts? I agree that we would need something like such new view in the future, however it seems too late to work on that for 9.6 unfortunately. There is only one CommitFest left. Let's focus on very simple case, i.e., 1-level priority list, now, then we can extend it to cover other cases. If we can commit the simple version too early and there is enough time before the date of feature freeze, of course I'm happy to review the extended version like you proposed, for 9.6. Regards, -- Fujii Masao
Hello,
>> I agree with first version, and attached the updated patch which are
>> modified so that it supports simple multiple sync replication you
>>suggested.
>> (but test cases are not included yet.)
I have tried for some basic in-built test cases for multisync rep.
I have created one patch over Michael's <a href="http://www.postgresql.org/message-id/CAB7nPqTEqou=[hidden email]">patch</a> patch.
Still it is in progress.
Please have look and correct me if i am wrong and suggest remaining test cases.
Regards
Suraj Kharage
If you reply to this email, your message will be added to the discussion below:
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5886259.html
This email was sent by kharagesuraj (via Nabble)
To receive all replies by email, subscribe to this discussion
______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.
recovery_test_suite_with_multisync.patch (36K) Download Attachment
View this message in context: RE: Support for N synchronous standby servers - take 2
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Hello,
>> I agree with first version, and attached the updated patch which are
>> modified so that it supports simple multiple sync replication you
>>suggested.
>> (but test cases are not included yet.)
I have tried for some basic in-built test cases for multisync rep.
I have created one patch over Michael's <a href="http://www.postgresql.org/message-id/CAB7nPqTEqou=[hidden email]">patch</a> patch.
Still it is in progress.
Please have look and correct me if i am wrong and suggest remaining test cases.
+my $result = $node_master->psql('postgres', "select application_name, sync_state from pg_stat_replication;");
+print "$result \n";
+is($result, "standby_1|sync\nstandby_2|sync\nstandby_3|potential", 'checked for sync standbys state initially');
Attachment
Hello, At Tue, 9 Feb 2016 00:48:57 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHnTKmd90Vu19Swu0C+2mnWxvAH=1FE=-xUbo3s94pRRg@mail.gmail.com> > On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier > >> <michael.paquier@gmail.com> wrote: > >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier > >>>> <michael.paquier@gmail.com> wrote: > >>>>> Yes, please let's use the custom language, and let's not care of not > >>>>> more than 1 level of nesting so as it is possible to represent > >>>>> pg_stat_replication in a simple way for the user. > >>>> > >>>> "not" is used twice in this sentence in a way that renders me not able > >>>> to be sure that I'm not understanding it not properly. > >>> > >>> 4 times here. Score beaten. > >>> > >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine > >>> to only support configurations up to one level of nested objects, like > >>> that: > >>> 2[node1, node2, node3] > >>> node1, 2[node2, node3], node3 > >>> In short, we could restrict things so as we cannot define a group of > >>> nodes within an existing group. > >> > >> No, actually, that's stupid. Having up to two nested levels makes more > >> sense, a quite common case for this feature being something like that: > >> 2{node1,[node2,node3]} > >> In short, sync confirmation is waited from node1 and (node2 or node3). > >> > >> Flattening groups of nodes with a new catalog will be necessary to > >> ease the view of this data to users: > >> - group name? > >> - array of members with nodes/groups > >> - group type: quorum or priority > >> - number of items to wait for in this group > > > > So, here are some thoughts to make that more user-friendly. I think > > that the critical issue here is to properly flatten the meta data in > > the custom language and represent it properly in a new catalog, > > without messing up too much with the existing pg_stat_replication that > > people are now used to for 5 releases since 9.0. So, I would think > > that we will need to have a new catalog, say > > pg_stat_replication_groups with the following things: > > - One line of this catalog represents the status of a group or of a single node. > > - The status of a node/group is either sync or potential, if a > > node/group is specified more than once, it may be possible that it > > would be sync and potential depending on where it is defined, in which > > case setting its status to 'sync' has the most sense. If it is in sync > > state I guess. > > - Move sync_priority and sync_state, actually an equivalent from > > pg_stat_replication into this new catalog, because those represent the > > status of a node or group of nodes. > > - group name, and by that I think that we had perhaps better make > > mandatory the need to append a name with a quorum or priority group. > > The group at the highest level is forcibly named as 'top', 'main', or > > whatever if not directly specified by the user. If the entry is > > directly a node, use the application_name. > > - Type of group, quorum or priority > > - Elements in this group, an element can be a group name or a node > > name, aka application_name. If group is of type priority, the elements > > are listed in increasing order. So the elements with lower priority > > get first, etc. We could have one column listing explicitly a list of > > integers that map with the elements of a group but it does not seem > > worth it, what users would like to know is what are the nodes that are > > prioritized. This covers the former 'priority' field of > > pg_stat_replication. > > > > We may have a good idea of how to define a custom language, still we > > are going to need to design a clean interface at catalog level more or > > less close to what is written here. If we can get a clean interface, > > the custom language implemented, and TAP tests that take advantage of > > this user interface to check the node/group statuses, I guess that we > > would be in good shape for this patch. > > > > Anyway that's not a small project, and perhaps I am over-complicating > > the whole thing. > > > > Thoughts? > > I agree that we would need something like such new view in the future, > however it seems too late to work on that for 9.6 unfortunately. > There is only one CommitFest left. Let's focus on very simple case, i.e., > 1-level priority list, now, then we can extend it to cover other cases. > > If we can commit the simple version too early and there is enough > time before the date of feature freeze, of course I'm happy to review > the extended version like you proposed, for 9.6. I agree to Fujii-san. There would be many of convenient gadgets around this and they are completely welcome, but having fundamental functionality in 9.6 seems to be far benetifical for most of us. At least the extensible syntax is fixed, internal structures can be gradually exnteded along with syntactical enhancement. Over three levels of definition or group name are syntactically reserved and they are allowed to be nothing for now. JSON could be added but it is too complicated for simple cases. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Suraj, On 2016/02/09 12:16, kharagesuraj wrote: > Hello, > > >>> I agree with first version, and attached the updated patch which are >>> modified so that it supports simple multiple sync replication you >>> suggested. >>> (but test cases are not included yet.) > > I have tried for some basic in-built test cases for multisync rep. > I have created one patch over Michael's <a href="http://www.postgresql.org/message-id/CAB7nPqTEqou=xrYrGSgA13QW1xxsSD6tFHz-Sm_J3EgDvSOCHw@mail.gmail.com">patch</a> patch. > Still it is in progress. > Please have look and correct me if i am wrong and suggest remaining test cases. > > recovery_test_suite_with_multisync.patch (36K) <http://postgresql.nabble.com/attachment/5886503/0/recovery_test_suite_with_multisync.patch> Thanks for creating the patch. Sorry to nitpick but as has been brought up before, it's better to send patches as email attachments (that is, not as a links to external sites). Also, it would be helpful if your patch is submitted as a diff over applying Michael's patch. That is, only the stuff specific to testing the multiple sync feature and let the rest be taken care of by Michael's base patch. Thanks, Amit
On Tue, Feb 9, 2016 at 1:16 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Tue, 9 Feb 2016 00:48:57 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHnTKmd90Vu19Swu0C+2mnWxvAH=1FE=-xUbo3s94pRRg@mail.gmail.com> >> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier >> > <michael.paquier@gmail.com> wrote: >> >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier >> >> <michael.paquier@gmail.com> wrote: >> >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier >> >>>> <michael.paquier@gmail.com> wrote: >> >>>>> Yes, please let's use the custom language, and let's not care of not >> >>>>> more than 1 level of nesting so as it is possible to represent >> >>>>> pg_stat_replication in a simple way for the user. >> >>>> >> >>>> "not" is used twice in this sentence in a way that renders me not able >> >>>> to be sure that I'm not understanding it not properly. >> >>> >> >>> 4 times here. Score beaten. >> >>> >> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine >> >>> to only support configurations up to one level of nested objects, like >> >>> that: >> >>> 2[node1, node2, node3] >> >>> node1, 2[node2, node3], node3 >> >>> In short, we could restrict things so as we cannot define a group of >> >>> nodes within an existing group. >> >> >> >> No, actually, that's stupid. Having up to two nested levels makes more >> >> sense, a quite common case for this feature being something like that: >> >> 2{node1,[node2,node3]} >> >> In short, sync confirmation is waited from node1 and (node2 or node3). >> >> >> >> Flattening groups of nodes with a new catalog will be necessary to >> >> ease the view of this data to users: >> >> - group name? >> >> - array of members with nodes/groups >> >> - group type: quorum or priority >> >> - number of items to wait for in this group >> > >> > So, here are some thoughts to make that more user-friendly. I think >> > that the critical issue here is to properly flatten the meta data in >> > the custom language and represent it properly in a new catalog, >> > without messing up too much with the existing pg_stat_replication that >> > people are now used to for 5 releases since 9.0. So, I would think >> > that we will need to have a new catalog, say >> > pg_stat_replication_groups with the following things: >> > - One line of this catalog represents the status of a group or of a single node. >> > - The status of a node/group is either sync or potential, if a >> > node/group is specified more than once, it may be possible that it >> > would be sync and potential depending on where it is defined, in which >> > case setting its status to 'sync' has the most sense. If it is in sync >> > state I guess. >> > - Move sync_priority and sync_state, actually an equivalent from >> > pg_stat_replication into this new catalog, because those represent the >> > status of a node or group of nodes. >> > - group name, and by that I think that we had perhaps better make >> > mandatory the need to append a name with a quorum or priority group. >> > The group at the highest level is forcibly named as 'top', 'main', or >> > whatever if not directly specified by the user. If the entry is >> > directly a node, use the application_name. >> > - Type of group, quorum or priority >> > - Elements in this group, an element can be a group name or a node >> > name, aka application_name. If group is of type priority, the elements >> > are listed in increasing order. So the elements with lower priority >> > get first, etc. We could have one column listing explicitly a list of >> > integers that map with the elements of a group but it does not seem >> > worth it, what users would like to know is what are the nodes that are >> > prioritized. This covers the former 'priority' field of >> > pg_stat_replication. >> > >> > We may have a good idea of how to define a custom language, still we >> > are going to need to design a clean interface at catalog level more or >> > less close to what is written here. If we can get a clean interface, >> > the custom language implemented, and TAP tests that take advantage of >> > this user interface to check the node/group statuses, I guess that we >> > would be in good shape for this patch. >> > >> > Anyway that's not a small project, and perhaps I am over-complicating >> > the whole thing. >> > >> > Thoughts? >> >> I agree that we would need something like such new view in the future, >> however it seems too late to work on that for 9.6 unfortunately. >> There is only one CommitFest left. Let's focus on very simple case, i.e., >> 1-level priority list, now, then we can extend it to cover other cases. >> >> If we can commit the simple version too early and there is enough >> time before the date of feature freeze, of course I'm happy to review >> the extended version like you proposed, for 9.6. > > I agree to Fujii-san. There would be many of convenient gadgets > around this and they are completely welcome, but having > fundamental functionality in 9.6 seems to be far benetifical for > most of us. Hm. Rushing features in because we need them now is not really community-like. I'd rather not have us taking decisions like that knowing that we may pay a certain price in the long-term, while it pays in the short term, aka the 9.6 release. However, having a base in place for the mini-language would give enough room for future improvements, so I am fine with having only 1-level of nesting, with {} and [] supported. This can as well be simply represented within pg_stat_replication because we'd have basically only one group of nodes for now (if I got the idea correctly), the and status of each entry in pg_stat_replication would just need to reflect either potential or sync, which is something that now users are used to. So, if I got the vibe correctly, we would basically just allow that in a first shot: N{node_list}, to define a priority group N[node_list], to define a quorum group There can be only one group, and elements in a node list cannot be a group. No need of group names either. -- Michael
On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote: > Also, to be frank, I think we ought to be putting more effort into > another patch in this same area, specifically Thomas Munro's causal > reads patch. I think a lot of people today are trying to use > synchronous replication to build load-balancing clusters and avoid the > problem where you write some data and then read back stale data from a > standby server. Of course, our current synchronous replication > facilities make no such guarantees - his patch does, and I think > that's pretty important. I'm not saying that we shouldn't do this > too, of course. Yeah, sure. Each one of those patches is trying to solve a different problem where Postgres is deficient, here we'd like to be sure a commit WAL record is correctly flushed on multiple standbys, while the patch of Thomas is trying to ensure that there is no need to scan for the replay position of a standby using some GUC parameters and a validation/sanity layer in syncrep.c to do that. Surely the patch of this thread has got more attention than Thomas', and both of them have merits and try to address real problems. FWIW, the patch of Thomas is a topic that I find rather interesting, and I am planning to look at it as well, perhaps for next CF or even before that. We'll see how other things move on. -- Michael
On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote: >> Also, to be frank, I think we ought to be putting more effort into >> another patch in this same area, specifically Thomas Munro's causal >> reads patch. I think a lot of people today are trying to use >> synchronous replication to build load-balancing clusters and avoid the >> problem where you write some data and then read back stale data from a >> standby server. Of course, our current synchronous replication >> facilities make no such guarantees - his patch does, and I think >> that's pretty important. I'm not saying that we shouldn't do this >> too, of course. > > Yeah, sure. Each one of those patches is trying to solve a different > problem where Postgres is deficient, here we'd like to be sure a > commit WAL record is correctly flushed on multiple standbys, while the > patch of Thomas is trying to ensure that there is no need to scan for > the replay position of a standby using some GUC parameters and a > validation/sanity layer in syncrep.c to do that. Surely the patch of > this thread has got more attention than Thomas', and both of them have > merits and try to address real problems. FWIW, the patch of Thomas is > a topic that I find rather interesting, and I am planning to look at > it as well, perhaps for next CF or even before that. We'll see how > other things move on. Attached first version dedicated language patch (document patch is not yet.) This patch supports only 1-nest priority method, but this feature will be expanded with adding quorum method or > 1 level nesting. So this patch are implemented while being considered about its extensibility. And I've implemented the new system view we discussed on this thread but that feature is not included in this patch (because it's not necessary yet now) == Syntax == s_s_names can have two type syntaxes like follows, 1. s_s_names = 'node1, node2, node3' 2. s_s_names = '2[node1, node2, node3]' #1 syntax is for backward compatibility, which implies the master server wait for only 1 server. #2 syntax is new syntax using dedicated language. In above #2 setting, node1 standby has lowest priority and node3 standby has highest priority. And master server will wait for COMMIT until at least 2 lowest priority standbys send ACK to master. == Memory Structure == Previously, master server has value of s_s_names as string, and used it when master server determine standby priority. This patch changed it so that master server has new memory structure (called SyncGroupNode) in order to be able to handle multiple (and nested in the future) standby nodes flexibly. All information of SyncGroupNode are set during parsing s_s_names. The memory structure is, struct SyncGroupNode { /* Common information */ int type; char *name; SyncGroupNode *next; /* same group next name node */ /* For group ndoe */ int sync_method; /* priority */ int wait_num; SyncGroupNode *member; /* member of its group */ bool (*SyncRepGetSyncedLsnsFn) (SyncGroupNode *group, XLogRecPtr *write_pos, XLogRecPtr *flush_pos); int (*SyncRepGetSyncStandbysFn) (SyncGroupNode *group, int *list); }; SyncGroupNode can be different two types; name node, group node, and have pointer to another name/group node in same group and list of group members. name node represents a synchronous standby. group node represents a group of some name nodes, which can have list of group member, and synchronous method, number of waiting node. The list of members are linked with one-way list, and are located in s_s_names definition order. e.g. in case of above #2 setting, member list could be, "main".member -> "node1".next -> "node2".next -> "node3".next -> NULL The most top level node is always "main" group node. i.g., in this version patch, only 1 group ("main" group) is created which has some name nodes (not group node). And group node has two functions pointer; * SyncRepGetSyncedLsnsFn This function decides group write/flush LSNs at that moment. For example in case of priority method, the lowest LSNs of standbys that are considered as synchronous should be selected. If there are not synchronous standbys enough to decide LSNs then this function return false. * SyncRepGetSyncStandbysFn : This function obtains array of walsnd positions of its standby members that are considered as synchronous. This implementation might not good in some reason, so please give me feedbacks. And I will create new commitfest entry for this patch to CF5. Regards, -- Masahiko Sawada
Attachment
On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote: >>> Also, to be frank, I think we ought to be putting more effort into >>> another patch in this same area, specifically Thomas Munro's causal >>> reads patch. I think a lot of people today are trying to use >>> synchronous replication to build load-balancing clusters and avoid the >>> problem where you write some data and then read back stale data from a >>> standby server. Of course, our current synchronous replication >>> facilities make no such guarantees - his patch does, and I think >>> that's pretty important. I'm not saying that we shouldn't do this >>> too, of course. >> >> Yeah, sure. Each one of those patches is trying to solve a different >> problem where Postgres is deficient, here we'd like to be sure a >> commit WAL record is correctly flushed on multiple standbys, while the >> patch of Thomas is trying to ensure that there is no need to scan for >> the replay position of a standby using some GUC parameters and a >> validation/sanity layer in syncrep.c to do that. Surely the patch of >> this thread has got more attention than Thomas', and both of them have >> merits and try to address real problems. FWIW, the patch of Thomas is >> a topic that I find rather interesting, and I am planning to look at >> it as well, perhaps for next CF or even before that. We'll see how >> other things move on. > > Attached first version dedicated language patch (document patch is not yet.) Thanks for the patch! Will review it. I think that it's time to write the documentation patch. Though I've not read the patch yet, I found that your patch changed s_s_names so that it rejects non-alphabet character like *, according to my simple test. It should accept any application_name which we can use. Regards, -- Fujii Masao
On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached first version dedicated language patch (document patch is not yet.) > > Thanks for the patch! Will review it. > > I think that it's time to write the documentation patch. > > Though I've not read the patch yet, I found that your patch > changed s_s_names so that it rejects non-alphabet character > like *, according to my simple test. It should accept any > application_name which we can use. Cool. Planning to look at it as well. Could you as well submit a regression test based on the recovery infrastructure and submit it as a separate patch? There is a version upthread of such a test but it would be good to extract it properly. -- Michael
Hello, At Tue, 9 Feb 2016 13:31:46 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSJgDLLsVk_Et-O=NBfJNqx3GbHszCYGvuTLRxHaZV3xQ@mail.gmail.com> > On Tue, Feb 9, 2016 at 1:16 PM, Kyotaro HORIGUCHI > >> > Anyway that's not a small project, and perhaps I am over-complicating > >> > the whole thing. > >> > > >> > Thoughts? > >> > >> I agree that we would need something like such new view in the future, > >> however it seems too late to work on that for 9.6 unfortunately. > >> There is only one CommitFest left. Let's focus on very simple case, i.e., > >> 1-level priority list, now, then we can extend it to cover other cases. > >> > >> If we can commit the simple version too early and there is enough > >> time before the date of feature freeze, of course I'm happy to review > >> the extended version like you proposed, for 9.6. > > > > I agree to Fujii-san. There would be many of convenient gadgets > > around this and they are completely welcome, but having > > fundamental functionality in 9.6 seems to be far benetifical for > > most of us. > > Hm. Rushing features in because we need them now is not really > community-like. I'd rather not have us taking decisions like that > knowing that we may pay a certain price in the long-term, while it > pays in the short term, aka the 9.6 release. However, having a base in > place for the mini-language would give enough room for future > improvements, so I am fine with having only 1-level of nesting, with > {} and [] supported. This can as well be simply represented within > pg_stat_replication because we'd have basically only one group of > nodes for now (if I got the idea correctly), the and status of each > entry in pg_stat_replication would just need to reflect either > potential or sync, which is something that now users are used to. I agree to be more prudent for more 'stiff', a hard-to-modify-later things. But if once we decede to use []{} format at the beginning (I believe) for this feature, it is surely nextensible enough and 1-level of replication sets is sufficient to cover many new cases and make implement simple. Internal structure can be evolutionary in contrast to its user interface. Such a way of development is I don't think not community-like, concerning the cases like this. Anyway thank you very much for understanding. > So, if I got the vibe correctly, we would basically just allow that in > a first shot: > N{node_list}, to define a priority group > N[node_list], to define a quorum group > There can be only one group, and elements in a node list cannot be a > group. No need of group names either. > -- That's quite reasonable for the first release of this feature. We can/should consider the extensibility of the implement of this feature through reviewing. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 10, 2016 at 9:18 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Attached first version dedicated language patch (document patch is not yet.) >> >> Thanks for the patch! Will review it. >> >> I think that it's time to write the documentation patch. >> >> Though I've not read the patch yet, I found that your patch >> changed s_s_names so that it rejects non-alphabet character >> like *, according to my simple test. It should accept any >> application_name which we can use. > > Cool. Planning to look at it as well. Could you as well submit a > regression test based on the recovery infrastructure and submit it as > a separate patch? There is a version upthread of such a test but it > would be good to extract it properly. Yes, I will implement regression test patch and documentation patch as well. Attached latest version patch supporting s_s_names = '*'. Unlike currently behaviour a bit, s_s_names can have only one '*' character. e.g, The following setting will get syntax error. s_s_names = '*, node1,node2' s_s_names = `2[node1, *, node2]` when we use '*' character as s_s_names element, we must set s_s_names like follows. s_s_names = '*' s_s_names = '2[*]' BTW, we've discussed about mini language syntax. IIRC, the syntax uses [] and () like, 'N[node1, node2, ...]', to define priority standbys. 'N(node1, node2, ...)', to define quorum standbys. And current patch behaves so. Which type of parentheses should be used for this syntax to be more clarity? Or other character should be used such as <>, // ? Regards, -- Masahiko Sawada
Attachment
Hello, At Wed, 10 Feb 2016 02:57:54 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHR1MNpAgRMh9T0oy0OnydkGaymcNgVOE-1VLZ8Z9twjA@mail.gmail.com> > On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 9, 2016 at 10:32 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >> On Wed, Feb 3, 2016 at 7:33 AM, Robert Haas wrote: > >>> Also, to be frank, I think we ought to be putting more effort into > >>> another patch in this same area, specifically Thomas Munro's causal > >>> reads patch. I think a lot of people today are trying to use > >>> synchronous replication to build load-balancing clusters and avoid the > >>> problem where you write some data and then read back stale data from a > >>> standby server. Of course, our current synchronous replication > >>> facilities make no such guarantees - his patch does, and I think > >>> that's pretty important. I'm not saying that we shouldn't do this > >>> too, of course. > >> > >> Yeah, sure. Each one of those patches is trying to solve a different > >> problem where Postgres is deficient, here we'd like to be sure a > >> commit WAL record is correctly flushed on multiple standbys, while the > >> patch of Thomas is trying to ensure that there is no need to scan for > >> the replay position of a standby using some GUC parameters and a > >> validation/sanity layer in syncrep.c to do that. Surely the patch of > >> this thread has got more attention than Thomas', and both of them have > >> merits and try to address real problems. FWIW, the patch of Thomas is > >> a topic that I find rather interesting, and I am planning to look at > >> it as well, perhaps for next CF or even before that. We'll see how > >> other things move on. > > > > Attached first version dedicated language patch (document patch is not yet.) > > Thanks for the patch! Will review it. > > I think that it's time to write the documentation patch. > > Though I've not read the patch yet, I found that your patch > changed s_s_names so that it rejects non-alphabet character > like *, according to my simple test. It should accept any > application_name which we can use. Thanks for the quick response. At a glance, I'd like to show you some random suggestions, mainly on writing conventions. === Running postgresql with s_s_names = '*', makes error as Fujii-san said. And it yeilds the following message. | $ postgres | FATAL: syntax error: unexpected character "*" Mmm.. It should be tough to find what has happened.. === check_synchronous_standby_names frees parsed SyncRepStandbyNames immediately but no reason is explained there. The following comment looks to be saying something related to this but it doesn't explain the reason to free. + /* + * Any additional validation of standby names should go here. + * + * Don't attempt to set WALSender priority because this is executed by + * postmaster at startup, not WALSender, so the application_name is not + * yet correctly set. + */ Addtion to that, I'd like to see a description like 'syncgroup_yyparse sets the global SyncRepStandbyNames as side effect' around it. === malloc/free are used in create_name_node and other functions to be used in scanner, but syncgroup_gram.y is said to use palloc/pfree. Maybe they should use the same memory allocation/freeing functions. === The variable name SyncRepStandbyNames holds the list of SyncGroupNode*. This is somewhat confusing. How about SyncRepStandbys? === +static void +SyncRepClearStandbyGroupList(SyncGroupNode *group) +{ + SyncGroupNode *n = group->member; The name 'n' is a bit confusing, I believe that the one-letter variables should be used following implicit (and ancient?) convention otherwise pretty short-term and obvious cases. name, or group_name instead might be better. There's similar usage of 'n' in other places. === + * Find active walsender position of WalSnd by name. Returns index of walsnds + * array if found, otherwise return -1. I didn't get what is 'walsender position' within this comment. And as the discussion upthread, there can be multiple walsenders with the same name. So this might be like this. > * Finds the first active synchronous walsender with given name > * in WalSndCtl->wansnds and returns the index of that. Returns > * -1 if not found. === + * Get both synced LSNS: write and flush, using its group function and check + * whether each LSN has advanced to, or not. This is question for all. Which to use synced, synched or synchronized? Maybe we should use non-abbreviated spellings unless the description become too long to make it hard to read. > * Return true if we have enough synchronized standbys and the 'safe' > * written and flushed LSNs, which are LSNs assured in all standbys > * considered should be synchronized. # Please rewrite me. === +SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) +{ + XLogRecPtr cur_write_pos; + XLogRecPtr cur_flush_pos; + bool ret; The name cur_*_pos are a bit confusing. They hold LSNs where all of standbys choosed as synchronized ones. So how about safe_*_pos? And 'ret' is not the return value of this function and it can have more specific name, such like... satisfied? or else.. === +SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) ... + /* Check whether each LSN has advanced to */ + if (ret) + { ... + return true; + } + + return false; This might be a kind of favor, It would be simple to be written with reverse-condition. === + SyncRepSyncedLsnAdvancedTo(XLogRecPtr *write_pos, XLogRecPtr *flush_pos) ... + ret = SyncRepStandbyNames->SyncRepGetSyncedLsnsFn(SyncRepStandbyNames, + &cur_write_pos, + &cur_flush_pos); ... + if (MyWalSnd->write >= cur_write_pos) I suppose SyncRepGetSyncedLsnsFn, or SyncRepGetSyncedLsnsPriority can return InvalidXLogRecPtr as cur_*_pos even when it returns true. And, I suppose comparison of LSN values with InvalidXLogRecPtr is not well-defined. Anyway the condition goes wrong when cur_write_pos = InvalidXLogRecPtr (but ret = true). === + * Obtain a array containing positions of standbys of specified group + * currently considered as synchronous up to wait_num of its group. + * Caller is respnsible for allocating the data obtained. # Anyone please reedit my rewriting below.. Perhaps my writing is # quite unreadable.. > * Return the positions of the first group->wait_num > * synchronized standbys in group->member list into > * sync_list. sync_list is assumed to have enough space for > * at least group->wait_num elements. === +bool +SyncRepGetSyncedLsnsPriority(SyncGroupNode *group, XLogRecPtr *write_pos, XLogRecPtr *flush_pos) +{ ... + for(n = group->member; n != NULL; n = n->next) group->member holds two or more items, so the name would be better to be group->members, or member_list. === + /* We already got enough synchronous standbys, return */ + if (num == group->wait_num) As convention for saftiness, this kind of comparison is to use inequality operators. > if (num >= group->wait_num) === At a glance, SyncRepGetSyncedLsnsPriority and SyncRepGetSyncStandbysPriority does almost the same thing and both runs loops over group members. Couldn't they run at once? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, At Wed, 10 Feb 2016 11:25:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCHytB88ZdC0899J7PLNTKWTg0gczC2M7dqLmK71vdY0w@mail.gmail.com> > On Wed, Feb 10, 2016 at 9:18 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Wed, Feb 10, 2016 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > >> On Wed, Feb 10, 2016 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >>> Attached first version dedicated language patch (document patch is not yet.) > >> > >> Thanks for the patch! Will review it. > >> > >> I think that it's time to write the documentation patch. > >> > >> Though I've not read the patch yet, I found that your patch > >> changed s_s_names so that it rejects non-alphabet character > >> like *, according to my simple test. It should accept any > >> application_name which we can use. > > > > Cool. Planning to look at it as well. Could you as well submit a > > regression test based on the recovery infrastructure and submit it as > > a separate patch? There is a version upthread of such a test but it > > would be good to extract it properly. > > Yes, I will implement regression test patch and documentation patch as well. > > Attached latest version patch supporting s_s_names = '*'. > Unlike currently behaviour a bit, s_s_names can have only one '*' character. > e.g, The following setting will get syntax error. > > s_s_names = '*, node1,node2' > s_s_names = `2[node1, *, node2]` We could use the setting s_s_names = 'node1, node2, *' as a extended representation of old s_s_names. It tests node1, node2 as first and try any name if they failed. Similary, '2[node1, node2, *]' is also meaningful. > when we use '*' character as s_s_names element, we must set s_s_names > like follows. > > s_s_names = '*' > s_s_names = '2[*]' > > BTW, we've discussed about mini language syntax. > IIRC, the syntax uses [] and () like, > 'N[node1, node2, ...]', to define priority standbys. > 'N(node1, node2, ...)', to define quorum standbys. > And current patch behaves so. > > Which type of parentheses should be used for this syntax to be more clarity? > Or other character should be used such as <>, // ? I believed that [] and {} are used respectively for no distinct reason. I think symmetrical pair of characters is preferable for readability. Candidate pairs in ascii characters are. (), {}, [] <> {} might be a bit difficult to distinguish from [] on unclear consoles :p regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Yes, I will implement regression test patch and documentation patch as well.
Cool, now that we have a clear picture of where we want to move, that would be an excellent thing to have. Having the docs in the place is clearly mandatory.
> Attached latest version patch supporting s_s_names = '*'.
> Unlike currently behaviour a bit, s_s_names can have only one '*' character.
> e.g, The following setting will get syntax error.
>
> s_s_names = '*, node1,node2'
> s_s_names = `2[node1, *, node2]`
>
> when we use '*' character as s_s_names element, we must set s_s_names
> like follows.
>
> s_s_names = '*'
> s_s_names = '2[*]'
>
> BTW, we've discussed about mini language syntax.
> IIRC, the syntax uses [] and () like,
> 'N[node1, node2, ...]', to define priority standbys.
> 'N(node1, node2, ...)', to define quorum standbys.
> And current patch behaves so.
>
> Which type of parentheses should be used for this syntax to be more clarity?
> Or other character should be used such as <>, // ?
I am personally fine with () and [] as you mention, we could even consider {}, each one of them has a different meaning mathematically..
I am not entered into a detailed review yet (waiting for the docs), but the patch looks brittle. I have been able to crash the server just by querying pg_stat_replication:
* thread #1: tid = 0x0000, 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783, stop reason = signal SIGSTOP
* frame #0: 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783
frame #1: 0x0000000105d4277d postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838, econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8, expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at execQual.c:2211
frame #2: 0x0000000105d70c24 postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at nodeFunctionscan.c:95
* thread #1: tid = 0x0000, 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783, stop reason = signal SIGSTOP
frame #0: 0x0000000105eb36c2 postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at walsender.c:2783
2780 /*
2781 * Get the currently active synchronous standby.
2782 */
-> 2783 sync_standbys = (int *) palloc(sizeof(int) * SyncRepStandbyNames->wait_num);
2784 LWLockAcquire(SyncRepLock, LW_SHARED);
2785 num_sync = SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys);
2786 LWLockRelease(SyncRepLock);
(lldb) p SyncRepStandbyNames
(SyncGroupNode *) $0 = 0x0000000000000000
+ sync_list { $$ = create_group_node(1, $1); }
+ | sync_element_ast { $$ = create_group_node(1, $1);}
+ | INT '[' sync_list ']' { $$ = create_group_node($1, $3);}
+ | INT '[' sync_element_ast ']' { $$ = create_group_node($1, $3); }
+void
+yyerror(const char *message)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg_internal("%s", message)));
+}
Michael
On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > I am personally fine with () and [] as you mention, we could even consider > {}, each one of them has a different meaning mathematically.. > > I am not entered into a detailed review yet (waiting for the docs), but the > patch looks brittle. I have been able to crash the server just by querying > pg_stat_replication: > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > walsender.c:2783, stop reason = signal SIGSTOP > * frame #0: 0x0000000105eb36c2 > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > walsender.c:2783 > frame #1: 0x0000000105d4277d > postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838, > econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8, > expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at > execQual.c:2211 > frame #2: 0x0000000105d70c24 > postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at > nodeFunctionscan.c:95 > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > walsender.c:2783, stop reason = signal SIGSTOP > frame #0: 0x0000000105eb36c2 > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > walsender.c:2783 > 2780 /* > 2781 * Get the currently active synchronous standby. > 2782 */ > -> 2783 sync_standbys = (int *) palloc(sizeof(int) * > SyncRepStandbyNames->wait_num); > 2784 LWLockAcquire(SyncRepLock, LW_SHARED); > 2785 num_sync = > SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys); > 2786 LWLockRelease(SyncRepLock); > (lldb) p SyncRepStandbyNames > (SyncGroupNode *) $0 = 0x0000000000000000 > > +sync_node_group: > + sync_list { $$ = create_group_node(1, $1); > } > + | sync_element_ast { $$ = create_group_node(1, > $1);} > + | INT '[' sync_list ']' { $$ = create_group_node($1, > $3);} > + | INT '[' sync_element_ast ']' { $$ = create_group_node($1, > $3); } > We may want to be careful with the use of '[' in application_name. I am not > much thrilled with forbidding the use of []() in application_name, so we may > want to recommend user to use a backslash when using s_s_names when a group > is defined. > > +void > +yyerror(const char *message) > +{ > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg_internal("%s", message))); > +} > whitespace errors here. +#define MAX_WALSENDER_NAME 8192 +typedef enum WalSndState{ WALSNDSTATE_STARTUP = 0, @@ -62,6 +64,11 @@ typedef struct WalSnd * SyncRepLock. */ int sync_standby_priority; + + /* + * Corresponding standby's application_name. + */ + const char name[MAX_WALSENDER_NAME];} WalSnd; NAMEDATALEN instead? -- Michael
Hello, At Wed, 10 Feb 2016 15:22:44 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRk4ZjoQfs4rmF6Di1zp=b4eA=hk0L4GFzUj47GwhgM7g@mail.gmail.com> > On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > I am personally fine with () and [] as you mention, we could even consider > > {}, each one of them has a different meaning mathematically.. > > > > I am not entered into a detailed review yet (waiting for the docs), but the > > patch looks brittle. I have been able to crash the server just by querying > > pg_stat_replication: > > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > walsender.c:2783, stop reason = signal SIGSTOP > > * frame #0: 0x0000000105eb36c2 > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > walsender.c:2783 > > frame #1: 0x0000000105d4277d > > postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838, > > econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8, > > expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at > > execQual.c:2211 > > frame #2: 0x0000000105d70c24 > > postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at > > nodeFunctionscan.c:95 > > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > walsender.c:2783, stop reason = signal SIGSTOP > > frame #0: 0x0000000105eb36c2 > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > walsender.c:2783 > > 2780 /* > > 2781 * Get the currently active synchronous standby. > > 2782 */ > > -> 2783 sync_standbys = (int *) palloc(sizeof(int) * > > SyncRepStandbyNames->wait_num); > > 2784 LWLockAcquire(SyncRepLock, LW_SHARED); > > 2785 num_sync = > > SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys); > > 2786 LWLockRelease(SyncRepLock); > > (lldb) p SyncRepStandbyNames > > (SyncGroupNode *) $0 = 0x0000000000000000 > > > > +sync_node_group: > > + sync_list { $$ = create_group_node(1, $1); > > } > > + | sync_element_ast { $$ = create_group_node(1, > > $1);} > > + | INT '[' sync_list ']' { $$ = create_group_node($1, > > $3);} > > + | INT '[' sync_element_ast ']' { $$ = create_group_node($1, > > $3); } > > We may want to be careful with the use of '[' in application_name. I am not > > much thrilled with forbidding the use of []() in application_name, so we may > > want to recommend user to use a backslash when using s_s_names when a group > > is defined. Mmmm. I found that application_name can contain commas. Furthermore, there seems to be no limitation for character in the name. postgres=# set application_name='ho,ge'; postgres=# select application_name from pg_stat_activity;application_name ------------------ho,ge check_application_name() allows all characters in the range between 32 to 126 in ascii. All other characters are replaced with '?'. > > +void > > +yyerror(const char *message) > > +{ > > + ereport(ERROR, > > + (errcode(ERRCODE_SYNTAX_ERROR), > > + errmsg_internal("%s", message))); > > +} > > whitespace errors here. > > +#define MAX_WALSENDER_NAME 8192 > + > typedef enum WalSndState > { > WALSNDSTATE_STARTUP = 0, > @@ -62,6 +64,11 @@ typedef struct WalSnd > * SyncRepLock. > */ > int sync_standby_priority; > + > + /* > + * Corresponding standby's application_name. > + */ > + const char name[MAX_WALSENDER_NAME]; > } WalSnd; > NAMEDATALEN instead? -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 10, 2016 at 5:34 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, > > At Wed, 10 Feb 2016 15:22:44 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRk4ZjoQfs4rmF6Di1zp=b4eA=hk0L4GFzUj47GwhgM7g@mail.gmail.com> > > On Wed, Feb 10, 2016 at 3:13 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > > > On Wed, Feb 10, 2016 at 11:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> > > > wrote: > > > I am personally fine with () and [] as you mention, we could even consider > > > {}, each one of them has a different meaning mathematically.. > > > > > > I am not entered into a detailed review yet (waiting for the docs), but the > > > patch looks brittle. I have been able to crash the server just by querying > > > pg_stat_replication: > > > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > > walsender.c:2783, stop reason = signal SIGSTOP > > > * frame #0: 0x0000000105eb36c2 > > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > > walsender.c:2783 > > > frame #1: 0x0000000105d4277d > > > postgres`ExecMakeTableFunctionResult(funcexpr=0x00007fea128f3838, > > > econtext=0x00007fea128f1b58, argContext=0x00007fea128c8ea8, > > > expectedDesc=0x00007fea128f4710, randomAccess='\0') + 1005 at > > > execQual.c:2211 > > > frame #2: 0x0000000105d70c24 > > > postgres`FunctionNext(node=0x00007fea128f2f78) + 180 at > > > nodeFunctionscan.c:95 > > > * thread #1: tid = 0x0000, 0x0000000105eb36c2 > > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > > walsender.c:2783, stop reason = signal SIGSTOP > > > frame #0: 0x0000000105eb36c2 > > > postgres`pg_stat_get_wal_senders(fcinfo=0x00007fff5a156290) + 498 at > > > walsender.c:2783 > > > 2780 /* > > > 2781 * Get the currently active synchronous standby. > > > 2782 */ > > > -> 2783 sync_standbys = (int *) palloc(sizeof(int) * > > > SyncRepStandbyNames->wait_num); > > > 2784 LWLockAcquire(SyncRepLock, LW_SHARED); > > > 2785 num_sync = > > > SyncRepGetSyncStandbysPriority(SyncRepStandbyNames, sync_standbys); > > > 2786 LWLockRelease(SyncRepLock); > > > (lldb) p SyncRepStandbyNames > > > (SyncGroupNode *) $0 = 0x0000000000000000 > > > > > > +sync_node_group: > > > + sync_list { $$ = create_group_node(1, $1); > > > } > > > + | sync_element_ast { $$ = create_group_node(1, > > > $1);} > > > + | INT '[' sync_list ']' { $$ = create_group_node($1, > > > $3);} > > > + | INT '[' sync_element_ast ']' { $$ = create_group_node($1, > > > $3); } > > > We may want to be careful with the use of '[' in application_name. I am not > > > much thrilled with forbidding the use of []() in application_name, so we may > > > want to recommend user to use a backslash when using s_s_names when a group > > > is defined. > > Mmmm. I found that application_name can contain > commas. Furthermore, there seems to be no limitation for > character in the name. > > postgres=# set application_name='ho,ge'; > postgres=# select application_name from pg_stat_activity; > application_name > ------------------ > ho,ge > > check_application_name() allows all characters in the range > between 32 to 126 in ascii. All other characters are replaced > with '?'. Actually I was thinking about that a couple of hours ago. If the application_name of a node has a comma, it cannot become a sync replica, no? Wouldn't we need a special handling in s_s_names like '\,' make a comma part of an application name? Or just ban commas from the list of supported characters in the application name? -- Michael
On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > So, here are some thoughts to make that more user-friendly. I think > that the critical issue here is to properly flatten the meta data in > the custom language and represent it properly in a new catalog, > without messing up too much with the existing pg_stat_replication that > people are now used to for 5 releases since 9.0. Putting the metadata in a catalog doesn't seem great because that only can ever work on the master. Maybe there's no need to configure this on the slaves and therefore it's OK, but I feel nervous about putting cluster configuration in catalogs. Another reason for that is that if synchronous replication is broken, then you need a way to change the catalog, which involves committing a write transaction; there's a danger that your efforts to do this will be tripped up by the broken synchronous replication configuration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 12, 2016 at 2:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> So, here are some thoughts to make that more user-friendly. I think >> that the critical issue here is to properly flatten the meta data in >> the custom language and represent it properly in a new catalog, >> without messing up too much with the existing pg_stat_replication that >> people are now used to for 5 releases since 9.0. > > Putting the metadata in a catalog doesn't seem great because that only > can ever work on the master. Maybe there's no need to configure this > on the slaves and therefore it's OK, but I feel nervous about putting > cluster configuration in catalogs. Another reason for that is that if > synchronous replication is broken, then you need a way to change the > catalog, which involves committing a write transaction; there's a > danger that your efforts to do this will be tripped up by the broken > synchronous replication configuration. I was referring to a catalog view that parses the information related to groups of s_s_names in a flattened way to show each group sync status. Perhaps my words should have been clearer. -- Michael
On Thu, Feb 11, 2016 at 5:40 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Feb 12, 2016 at 2:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Feb 5, 2016 at 3:36 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> So, here are some thoughts to make that more user-friendly. I think >>> that the critical issue here is to properly flatten the meta data in >>> the custom language and represent it properly in a new catalog, >>> without messing up too much with the existing pg_stat_replication that >>> people are now used to for 5 releases since 9.0. >> >> Putting the metadata in a catalog doesn't seem great because that only >> can ever work on the master. Maybe there's no need to configure this >> on the slaves and therefore it's OK, but I feel nervous about putting >> cluster configuration in catalogs. Another reason for that is that if >> synchronous replication is broken, then you need a way to change the >> catalog, which involves committing a write transaction; there's a >> danger that your efforts to do this will be tripped up by the broken >> synchronous replication configuration. > > I was referring to a catalog view that parses the information related > to groups of s_s_names in a flattened way to show each group sync > status. Perhaps my words should have been clearer. Ah. Well, that's different, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, At Wed, 10 Feb 2016 18:36:43 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqTHmuuDdKWmoaY1ZAi-gRnT_HRdHGyiqpNfFFr15qc5uA@mail.gmail.com> > On Wed, Feb 10, 2016 at 5:34 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > > +sync_node_group: > > > > + sync_list { $$ = create_group_node(1, $1); > > > > } > > > > + | sync_element_ast { $$ = create_group_node(1, > > > > $1);} > > > > + | INT '[' sync_list ']' { $$ = create_group_node($1, > > > > $3);} > > > > + | INT '[' sync_element_ast ']' { $$ = create_group_node($1, > > > > $3); } > > > > We may want to be careful with the use of '[' in application_name. I am not > > > > much thrilled with forbidding the use of []() in application_name, so we may > > > > want to recommend user to use a backslash when using s_s_names when a group > > > > is defined. > > > > Mmmm. I found that application_name can contain > > commas. Furthermore, there seems to be no limitation for > > character in the name. > > > > postgres=# set application_name='ho,ge'; > > postgres=# select application_name from pg_stat_activity; > > application_name > > ------------------ > > ho,ge > > > > check_application_name() allows all characters in the range > > between 32 to 126 in ascii. All other characters are replaced > > with '?'. > > Actually I was thinking about that a couple of hours ago. If the > application_name of a node has a comma, it cannot become a sync > replica, no? Wouldn't we need a special handling in s_s_names like > '\,' make a comma part of an application name? Or just ban commas from > the list of supported characters in the application name? Surprizingly yes. The list is handled as an identifier list and parsed by SplitIdentifierString thus it can accept deouble-quoted names. s_s_names='abc, def, " abc,""def"' Result list is ["abc", "def", " abc,\"def"] Simplly supporting the same notation addresses the problem and accepts strings like the following. s_s_names='2["comma,name", "foo[bar,baz]"]' It is currently an undocumented behavior but I doubt the necessity to have an explict mention. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote: > Surprizingly yes. The list is handled as an identifier list and > parsed by SplitIdentifierString thus it can accept double-quoted > names. Good point. I was not aware of this trick. -- Michael
On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote: >> Surprizingly yes. The list is handled as an identifier list and >> parsed by SplitIdentifierString thus it can accept double-quoted >> names. > Attached latest version patch which has only feature logic so far. I'm writing document patch about this feature now, so this version patch doesn't have document and regression test patch. > | $ postgres > | FATAL: syntax error: unexpected character "*" > Mmm.. It should be tough to find what has happened.. I'm trying to implement better error message, but that change is not included in this version patch yet. > malloc/free are used in create_name_node and other functions to > be used in scanner, but syncgroup_gram.y is said to use > palloc/pfree. Maybe they should use the same memory > allocation/freeing functions. Setting like this, I think that we use malloc/free funcion when we allocate/free memory for SyncRepStandbys variables. OTOH, we use palloc/pfree function during parsing SyncRepStandbyString. Am I missing something? > I suppose SyncRepGetSyncedLsnsFn, or SyncRepGetSyncedLsnsPriority > can return InvalidXLogRecPtr as cur_*_pos even when it returns > true. And, I suppose comparison of LSN values with > InvalidXLogRecPtr is not well-defined. Anyway the condition goes > wrong when cur_write_pos = InvalidXLogRecPtr (but ret = true). In this version patch, it's not possible to return InvalidXLogRecPtr with got_lsns = false (was ret = false). So we can ensure that we got valid LSNs when got_lsns = true. > At a glance, SyncRepGetSyncedLsnsPriority and > SyncRepGetSyncStandbysPriority does almost the same thing and both > runs loops over group members. Couldn't they run at once? Yeah, I've optimized that logic. > We may want to be careful with the use of '[' in application_name. > I am not much thrilled with forbidding the use of []() in application_name, so we may > want to recommend user to use a backslash when using s_s_names when a > group is defined. > s_s_names='abc, def, " abc,""def"' > > Result list is ["abc", "def", " abc,\"def"] > > Simplly supporting the same notation addresses the problem and > accepts strings like the following. > > s_s_names='2["comma,name", "foo[bar,baz]"]' I've changed s_s_names parser so that it can handle special 4 characters (\,\ \[\]) and can handle double-quoted string accurately same as what SplitIdentifierString does. We can not use special 4 characters (\,\ \[ \]) without using double-quoted string. Also if we use "(double-quote) character in double-quoted string, we should use ""(double double-quotes). For example, if application_name = 'hoge " bar', s_s_name = '"hoge "" bar"' would be matched. Other given comments are fixed. Remaining tasks are; - Document patch. - Regression test patch. - Syntax error message for s_s_names improvement. Regards, -- Masahiko Sawada
Attachment
Hello, >Remaining tasks are; >- Document patch. >- Regression test patch. >- Syntax error message for s_s_names improvement. Please find patch attached for regression test for multisync replication. I have created this patch over Michael's recovery-test-suite patch. Please review it. Regards Suraj kharage ______________________________________________________________________ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.
Attachment
On Tue, Feb 16, 2016 at 4:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote: >>> Surprizingly yes. The list is handled as an identifier list and >>> parsed by SplitIdentifierString thus it can accept double-quoted >>> names. >> > > Attached latest version patch which has only feature logic so far. > I'm writing document patch about this feature now, so this version > patch doesn't have document and regression test patch. Thanks for updating the patch! When I changed s_s_names to 'hoge*' and reloaded the configuration file, the server crashed unexpectedly with the following error message. This is obviously a bug. FATAL: syntax error Regards, -- Fujii Masao
Hello, At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com> > On Tue, Feb 16, 2016 at 4:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Feb 15, 2016 at 2:54 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >> On Mon, Feb 15, 2016 at 2:11 PM, Kyotaro HORIGUCHI wrote: > >>> Surprizingly yes. The list is handled as an identifier list and > >>> parsed by SplitIdentifierString thus it can accept double-quoted > >>> names. > >> > > > > Attached latest version patch which has only feature logic so far. > > I'm writing document patch about this feature now, so this version > > patch doesn't have document and regression test patch. > > Thanks for updating the patch! > > When I changed s_s_names to 'hoge*' and reloaded the configuration file, > the server crashed unexpectedly with the following error message. > This is obviously a bug. > > FATAL: syntax error I had a glance on the lexer part in the new patch. It'd be better to design the lexer from the beginning according to the required behavior. The documentation for the syntax is saying as the following, http://www.postgresql.org/docs/current/static/runtime-config-logging.html > application_name (string) > > The application_name can be any string of less than NAMEDATALEN > characters (64 characters in a standard build). <snip> Only > printable ASCII characters may be used in the application_name > value. Other characters will be replaced with question marks (?). And according to what some functions mentioned so far do, totally an application_name is treated as follwoing, I suppose. - check_application_name() currently allows [\x20-\x7e], which differs from the definition of the SQL identifiers. - SplitIdentifierString() and syncrep code - allows any byte except a double quote in double-quoted representation. A double-quote just after a delimiter can open quoted representation. - Non-quoted name can contain any character including double quotes except ',' and white spaces. - The syncrep code does case-insensitive matching with the application_name. So, to preserve or following the current behavior expct the last one, the following pattern definitions would do. The lexer/grammer for the new format of s_s_names could be simpler than what it is. space [ \n\r\f\t\v] /* See the definition of isspace(3) */ whitespace {space}+ dquote \" app_name_chars [\x21-\x2b\x2d-\x7e] /* excluding ' ', ',' */ app_name_indq_chars [\x20\x21\x23-\x7e] /* excluding '"' */ app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote}) delimiter {whitespace}*,{whitespace}* app_name ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote}) s_s_names {app_name}({delimiter}{app_name})* regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, Ok, I think we should concentrate the parser part for now. At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, > > At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com> > > Thanks for updating the patch! > > > > When I changed s_s_names to 'hoge*' and reloaded the configuration file, > > the server crashed unexpectedly with the following error message. > > This is obviously a bug. > > > > FATAL: syntax error > > I had a glance on the lexer part in the new patch. It'd be > better to design the lexer from the beginning according to the > required behavior. > > The documentation for the syntax is saying as the following, > > http://www.postgresql.org/docs/current/static/runtime-config-logging.html > > > application_name (string) > > > > The application_name can be any string of less than NAMEDATALEN > > characters (64 characters in a standard build). <snip> Only > > printable ASCII characters may be used in the application_name > > value. Other characters will be replaced with question marks (?). > > And according to what some functions mentioned so far do, totally > an application_name is treated as follwoing, I suppose. > > - check_application_name() currently allows [\x20-\x7e], which > differs from the definition of the SQL identifiers. > > - SplitIdentifierString() and syncrep code > > - allows any byte except a double quote in double-quoted > representation. A double-quote just after a delimiter can open > quoted representation. > > - Non-quoted name can contain any character including double > quotes except ',' and white spaces. > > - The syncrep code does case-insensitive matching with the > application_name. > > So, to preserve or following the current behavior expct the last > one, the following pattern definitions would do. The > lexer/grammer for the new format of s_s_names could be simpler > than what it is. > > space [ \n\r\f\t\v] /* See the definition of isspace(3) */ > whitespace {space}+ > dquote \" > app_name_chars [\x21-\x2b\x2d-\x7e] /* excluding ' ', ',' */ > app_name_indq_chars [\x20\x21\x23-\x7e] /* excluding '"' */ > app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote}) > delimiter {whitespace}*,{whitespace}* > app_name ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote}) > s_s_names {app_name}({delimiter}{app_name})* So I made a hasty independent parser for the syntax including the group names for the convenience for separate testing. The parser takes input from stdin and prints the result structure. It can take old s_s_name format and new list format. We haven't discussed how to add gruop names but I added it as "<grpname>" just before the # of syncronous standbys of [] and {} lists. Is this usable for further discussions? The sources can be compiles by the following commandline. $ bison -v test.y; flex -l test.l; gcc -g -DYYDEBUG=1 -DYYERROR_VERBOSE -o ltest test.tab.c and it makes the output like following. [horiguti@drain tmp]$ echo '123[1,3,<x>3{a,b,e},4,*]' | ./ltest TYPE: PRIO_LIST GROUPNAME: <none> NSYNC: 123 NEST: 2 CHILDREN { { TYPE: HOSTNAME HOSTNAME: 1 QUOTED: No NEST: 1 } { TYPE: HOSTNAME HOSTNAME: 3 QUOTED: No NEST:0 } TYPE: QUORUM_LIST GROUPNAME: x NSYNC: 3 NEST: 1 CHILDREN { { TYPE: HOSTNAME HOSTNAME: a QUOTED: No NEST: 0 } { TYPE: HOSTNAME HOSTNAME: b QUOTED: No NEST: 0 } { TYPE: HOSTNAME HOSTNAME:e QUOTED: No NEST: 0 } } { TYPE: HOSTNAME HOSTNAME: 4 QUOTED: No NEST: 0 } { TYPE: HOSTNAME HOSTNAME: * QUOTED: No NEST: 0 } } regards, -- Kyotaro Horiguchi NTT Open Source Software Center %{ #include <stdio.h> #include <stdlib.h> %} %option noyywrap %x DQNAME %x APPNAME space [ \t\n\r\f] whitespace {space}+ dquote \" app_name_chars [\x21-\x2b\x2d-\x3b\x3d\x3f-\x5a\x5c\x5e-\x7a\x7c\x7e] app_name_indq_chars [\x20\x21\x23-\x7e] app_name {app_name_chars}+ app_name_dq ({app_name_indq_chars}|{dquote}{dquote})+ delimiter {whitespace}*,{whitespace}* app_name_start {app_name_chars} any_app \*|({dquote}\*{dquote}) xdstart {dquote} xdstop {dquote} self [\[\]\{\}<>] %% {xdstart} { BEGIN(DQNAME); } <DQNAME>{xdstop} { BEGIN(INITIAL); } <DQNAME>{app_name_dq} { static char name[64]; int i, j; for (i = j = 0 ; j < 63 && yytext[i] ; i++, j++) { if (yytext[i] == '"') { if (yytext[i+1] == '"') name[j]= '"'; else fprintf(stderr, "illegal quote escape"); i++;} else name[j] = yytext[i]; } name[j] = 0; yylval.str = strdup(name); return QUOTED_NAME; } {app_name_start} { BEGIN(APPNAME); yyless(0);} <APPNAME>{app_name} {char *p; yylval.str = strdup(yytext);for (p = yylval.str ; *p ; p++){ if (*p >= 'A' && *p <= 'Z') *p = *p + ('a' - 'A');}BEGIN(INITIAL);returnNAME_OR_NUMBER; } {delimiter} { return DELIMITER;} {self} { return yytext[0];} %% //int main(void) //{ // int r; // // while(r = yylex()) { // fprintf(stderr, "#%d:(%s)#", r, yylval.str); // yylval.str = ""; // } //} %{ #include <stdio.h> #include <stdlib.h> #include <string.h> //#define YYDEBUG 1 typedef enum treeelemtype { TE_HOSTNAME, TE_PRIORITY_LIST, TE_QUORUM_LIST } treeelemtype; struct syncdef; typedef struct syncdef { treeelemtype type; char *name; int quoted; int nsync; int nest; struct syncdef *elems; struct syncdef *next; } syncdef; void yyerror(const char *s); int yylex(void); int depth = 0; syncdef *defroot = NULL; syncdef *curr = NULL; %} %union { char *str; int ival; syncdef *syncdef; } %token <str> NAME_OR_NUMBER %token <str> QUOTED_NAME %token DELIMITER %type <syncdef> qlist plist name_list name_elem name_elem_nonlist %type <syncdef> old_list s_s_names list_maybe_with_name %type <str> group_name %% s_s_names:old_list{ syncdef *t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_PRIORITY_LIST; t->name = NULL; t->quoted = 0; t->nsync = 1; t->elems = $1; t->next = NULL; defroot =$$ = t;}| list_maybe_with_name{ defroot = $$ = $1;}; old_list:name_elem_nonlist{ $$ = $1;}| old_list DELIMITER name_elem_nonlist{ syncdef *p = $1; while (p->next) p = p->next; p->next = $3;}; list_maybe_with_name:plist{$$ = $1;}| qlist{$$ = $1;}| '<' group_name '>' plist{ $4->name = $2; $$ = $4;}| '<' group_name'>' qlist{ $4->name = $2; $$ = $4;}; group_name:NAME_OR_NUMBER{ $$ = strdup($1); }| QUOTED_NAME{ $$ = strdup($1); }; plist: NAME_OR_NUMBER '[' name_list ']'{ syncdef *t; int n = atoi($1); if (n == 0) { yyerror("prefix number is 0 ornon-integer"); return 1; } if ($3->nest > 1) { yyerror("Up to 2 levels of nesting is supported"); return 1; } for (t = $3 ; t ; t = t->next) { if (t->type == TE_HOSTNAME && t->next && strcmp(t->name, "*") == 0) { yyerror("\"*\" is allowed only at the end of priority list"); return 1; } } t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_PRIORITY_LIST; t->nsync = n; t->name = NULL; t->quoted = 0; t->nest =$3->nest + 1; t->elems = $3; t->next = NULL; $$ = t;} ; qlist: NAME_OR_NUMBER '{' name_list '}'{ syncdef *t; int n = atoi($1); if (n == 0) { yyerror("prefix number is 0 ornon-integer"); return 1; } if ($3->nest > 1) { yyerror("Up to 2 levels of nesting is supported"); return 1; } for (t = $3 ; t ; t = t->next) { if (t->type == TE_HOSTNAME && strcmp(t->name, "*") == 0) { yyerror("\"*\"is not allowed in quorum list"); return 1; } } t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_QUORUM_LIST; t->nsync = n; t->name = NULL; t->quoted = 0; t->nest = $3->nest+ 1; t->elems = $3; t->next = NULL; $$ = t;} ; name_list:name_elem{ $$ = $1;}| name_list DELIMITER name_elem{ syncdef *p = $1; if (p->nest < $3->nest) p->nest = $3->nest; while (p->next) p = p->next; p->next = $3; $$ = $1;}; name_elem:name_elem_nonlist{ $$ = $1; }| list_maybe_with_name{ $$ = $1;}; name_elem_nonlist:NAME_OR_NUMBER{ syncdef *t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_HOSTNAME; t->nsync = 0; t->name = strdup($1); t->quoted = 0; t->nest = 0; t->elems = NULL; t->next = NULL; $$ = t; }| QUOTED_NAME{ syncdef*t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_HOSTNAME; t->nsync = 0; t->name = strdup($1); t->quoted =1; t->nest = 0; t->elems = NULL; t->next = NULL; $$ = t; }; %% void indent(int level) { int i; for (i = 0 ; i < level * 2 ; i++) putc(' ', stdout); } void dump_def(syncdef *def, int level) { char *typelabel[] = {"HOSTNAME", "PRIO_LIST", "QUORUM_LIST"}; syncdef *p; if (def == NULL) return; switch (def->type) { case TE_HOSTNAME: indent(level); puts("{"); indent(level+1); printf("TYPE: %s\n", typelabel[def->type]); indent(level+1); printf("HOSTNAME: %s\n", def->name); indent(level+1); printf("QUOTED: %s\n", def->quoted ? "Yes" : "No"); indent(level+1); printf("NEST:%d\n", def->nest); indent(level); puts("}"); if (def->next) dump_def(def->next, level); break; case TE_PRIORITY_LIST: case TE_QUORUM_LIST: indent(level); printf("TYPE: %s\n", typelabel[def->type]); indent(level); printf("GROUPNAME: %s\n", def->name ? def->name : "<none>"); indent(level); printf("NSYNC: %d\n", def->nsync); indent(level); printf("NEST: %d\n", def->nest); indent(level); puts("CHILDREN {"); level++; dump_def(def->elems, level); level--; indent(level); puts("}"); if (def->next) dump_def(def->next, level); break; default: fprintf(stderr,"Unknown type?\n"); exit(1); } level--; } int main(void) { // yydebug = 1;if (!yyparse()) dump_def(defroot, 0); } void yyerror(const char* s) { fprintf(stderr, "Error: %s\n", s); } #include "lex.yy.c"
On Wed, Feb 24, 2016 at 5:37 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > Ok, I think we should concentrate the parser part for now. > > At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp> >> Hello, >> >> At Mon, 22 Feb 2016 22:52:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwENujogaQvcc=u0tffNfFGtwXNb1yFcphdTYCJdG1_j1A@mail.gmail.com> >> > Thanks for updating the patch! >> > >> > When I changed s_s_names to 'hoge*' and reloaded the configuration file, >> > the server crashed unexpectedly with the following error message. >> > This is obviously a bug. >> > >> > FATAL: syntax error >> >> I had a glance on the lexer part in the new patch. It'd be >> better to design the lexer from the beginning according to the >> required behavior. >> >> The documentation for the syntax is saying as the following, >> >> http://www.postgresql.org/docs/current/static/runtime-config-logging.html >> >> > application_name (string) >> > >> > The application_name can be any string of less than NAMEDATALEN >> > characters (64 characters in a standard build). <snip> Only >> > printable ASCII characters may be used in the application_name >> > value. Other characters will be replaced with question marks (?). >> >> And according to what some functions mentioned so far do, totally >> an application_name is treated as follwoing, I suppose. >> >> - check_application_name() currently allows [\x20-\x7e], which >> differs from the definition of the SQL identifiers. >> >> - SplitIdentifierString() and syncrep code >> >> - allows any byte except a double quote in double-quoted >> representation. A double-quote just after a delimiter can open >> quoted representation. >> >> - Non-quoted name can contain any character including double >> quotes except ',' and white spaces. >> >> - The syncrep code does case-insensitive matching with the >> application_name. >> >> So, to preserve or following the current behavior expct the last >> one, the following pattern definitions would do. The >> lexer/grammer for the new format of s_s_names could be simpler >> than what it is. >> >> space [ \n\r\f\t\v] /* See the definition of isspace(3) */ >> whitespace {space}+ >> dquote \" >> app_name_chars [\x21-\x2b\x2d-\x7e] /* excluding ' ', ',' */ >> app_name_indq_chars [\x20\x21\x23-\x7e] /* excluding '"' */ >> app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote}) >> delimiter {whitespace}*,{whitespace}* >> app_name ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote}) >> s_s_names {app_name}({delimiter}{app_name})* > > > So I made a hasty independent parser for the syntax including the > group names for the convenience for separate testing. The parser > takes input from stdin and prints the result structure. > > It can take old s_s_name format and new list format. We haven't > discussed how to add gruop names but I added it as "<grpname>" > just before the # of syncronous standbys of [] and {} lists. > > Is this usable for further discussions? Thank you for your suggestion. Another option is to add group name with ":" to immediately after set of standbys as I said earlier. <http://www.postgresql.org/message-id/CAD21AoA9UqcbTnDKi0osd0yhN4FPgTrg6wuZeTtvpSYy2LqL5Q@mail.gmail.com> s_s_names with group name would be as follows. s_s_names = '2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]' Though? Regards, -- Masahiko Sawada
Hello, At Wed, 24 Feb 2016 18:01:59 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCetS5BMcTpXXtMwG0hyszZgNn=zK1U73GcWTgJ-Wn3pQ@mail.gmail.com> > On Wed, Feb 24, 2016 at 5:37 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, > > > > Ok, I think we should concentrate the parser part for now. > > > > At Tue, 23 Feb 2016 17:44:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20160223.174444.178687579.horiguchi.kyotaro@lab.ntt.co.jp> > >> Hello, ... > >> So, to preserve or following the current behavior expct the last > >> one, the following pattern definitions would do. The > >> lexer/grammer for the new format of s_s_names could be simpler > >> than what it is. > >> > >> space [ \n\r\f\t\v] /* See the definition of isspace(3) */ > >> whitespace {space}+ > >> dquote \" > >> app_name_chars [\x21-\x2b\x2d-\x7e] /* excluding ' ', ',' */ > >> app_name_indq_chars [\x20\x21\x23-\x7e] /* excluding '"' */ > >> app_name_dq_chars ({app_name_indq_chars}|{dquote}{dquote}) > >> delimiter {whitespace}*,{whitespace}* > >> app_name ({app_name_chars}+|{dquote}{app_name_dq_chars}+{dquote}) > >> s_s_names {app_name}({delimiter}{app_name})* > > > > > > So I made a hasty independent parser for the syntax including the > > group names for the convenience for separate testing. The parser > > takes input from stdin and prints the result structure. > > > > It can take old s_s_name format and new list format. We haven't > > discussed how to add gruop names but I added it as "<grpname>" > > just before the # of syncronous standbys of [] and {} lists. > > > > Is this usable for further discussions? > > Thank you for your suggestion. > > Another option is to add group name with ":" to immediately after set > of standbys as I said earlier. > <http://www.postgresql.org/message-id/CAD21AoA9UqcbTnDKi0osd0yhN4FPgTrg6wuZeTtvpSYy2LqL5Q@mail.gmail.com> > > s_s_names with group name would be as follows. > s_s_names = '2[local, 2[london1, london2, london3]:london, (tokyo1, > tokyo2):tokyo]' > > Though? I have no problem with it. The attached new sample parser does so. By the way, your parser also complains for an example I've seen somewhere upthread "1[2,3,4]". This is because '2', '3' and '4' are regarded as INT, not NAME. Whether a sequence of digits is a prefix number of a list or a host name cannot be identified until reading some following characters. So my previous test.l defined NAME_OR_INTEGER and it is distinguished in the grammar side to resolve this problem. If you want them identified in the lexer side, it should do looking-forward as <NAME_OR_PREFIX>{prefix} in the attached test.l does. This makes the lexer a bit complex but in contrast test.y simpler. The test.l, test.y attached got refactored but .l gets a bit tricky.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center %{ #include <stdio.h> #include <stdlib.h> %} %option noyywrap %x DQNAME %x NAME_OR_PREFIX %x APPNAME %x GRPCLOSED space [ \t\n\r\f] whitespace {space}+ dquote \" app_name_chars [\x21-\x27\x2a\x2b\x2d-\x5a\x5c\x5e-\x7a\x7c\x7e] app_name_indq_chars [\x20\x21\x23-\x7e] app_name {app_name_chars}+ app_name_dq ({app_name_indq_chars}|{dquote}{dquote})+ delimiter {whitespace}*,{whitespace}* app_name_start {app_name_chars} any_app \*|({dquote}\*{dquote}) xdstart {dquote} xdstop {dquote} openlist [\[\(] prefix [0-9]+{whitespace}*{openlist} closelist [\]\)] %% {xdstart} { BEGIN(DQNAME); } <DQNAME>{xdstop} { BEGIN(INITIAL); } <DQNAME>{app_name_dq} { appname *name = (appname *)malloc(sizeof(appname)); int i, j; for (i = j = 0 ; j < 63 && yytext[i] ; i++, j++) { if (yytext[i] == '"') { if (yytext[i+1] == '"') name->str[j]= '"'; else fprintf(stderr, "illegal quote escape\n"); i++;} else name->str[j] = yytext[i]; } name->str[j] = 0; name->quoted = 1; yylval.name = name; return NAME; } {app_name_start} { BEGIN(NAME_OR_PREFIX); yyless(0);} <NAME_OR_PREFIX>{app_name} { appname *name = (appname *)malloc(sizeof(appname));char *p; name->quoted = 0;strncpy(name->str, yytext, 63);name->str[63] = 0;for (p = name->str ; *p ; p++){ if (*p >= 'A' &&*p <= 'Z') *p = *p + ('a' - 'A');}yylval.name = name;BEGIN(INITIAL);return NAME; } <NAME_OR_PREFIX>{prefix} {static char prefix[16];int i, l; /* find the last digit */for (l = 0 ; l < 16 && isdigit(yytext[l]) ; l++);if (l > 15) fprintf(stderr, "too long prefixnumber for lists\n");for (i = 0 ; i < l ; i++) prefix[i] = yytext[i];prefix[i] = 0;yylval.str = strdup(prefix); /* prefix ends with a left brace or paren, so go backward by 1 char for further readin */ yyless(yyleng - 1); BEGIN(INITIAL);return PREFIX; } <GRPCLOSED>{whitespace}*. {BEGIN(INITIAL);if (yytext[yyleng - 1] == ':') return yytext[yyleng - 1];yyless(0); } {delimiter} { return DELIMITER;} {openlist} {yylval.character = yytext[0];return OPENLIST; } {closelist} { BEGIN(GRPCLOSED);yylval.character = yytext[0];return CLOSELIST; } %% //int main(void) //{ // int r; // // while(r = yylex()) { // fprintf(stderr, "#%d:(%s)#", r, yylval.str); // yylval.str = ""; // } //} %{ #include <stdio.h> #include <stdlib.h> #include <string.h> //#define YYDEBUG 1 typedef enum treeelemtype { TE_HOSTNAME, TE_PRIORITY_LIST, TE_QUORUM_LIST } treeelemtype; struct syncdef; typedef struct syncdef { treeelemtype type; char *name; int quoted; int nsync; int nest; struct syncdef *elems; struct syncdef *next; } syncdef; typedef struct {int quoted;char str[64]; } appname; void yyerror(const char *s); int yylex(void); int depth = 0; syncdef *defroot = NULL; syncdef *curr = NULL; %} %union { char character; char *str; appname *name; int ival; syncdef *syncdef; } %token <str> PREFIX %token <name> NAME %token <character> OPENLIST CLOSELIST %token DELIMITER %type <syncdef> group_list name_list name_elem name_elem_nonlist %type <syncdef> old_list s_s_names %type <name> opt_groupname %type <str> opt_prefix %% s_s_names:old_list{ syncdef *t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_PRIORITY_LIST; t->name = NULL; t->quoted = 0; t->nsync = 1; t->elems = $1; t->next = NULL; defroot =$$ = t;}| group_list{ defroot = $$ = $1;}; old_list:name_elem_nonlist{ $$ = $1;}| old_list DELIMITER name_elem_nonlist{ syncdef *p = $1; while (p->next) p = p->next; p->next = $3;}; group_list: opt_prefix OPENLIST name_list CLOSELIST opt_groupname{ syncdef *t; char *p = $1; int n = atoi($1); if (n== 0) { yyerror("prefix number is 0 or non-integer"); return 1; } if ($3->nest > 1) { yyerror("Up to 2 levelsof nesting is supported"); return 1; } for (t = $3 ; t ; t = t->next) { if (t->type == TE_HOSTNAME && t->next&& strcmp(t->name, "*") == 0) { yyerror("\"*\" is allowed only at the end of priority list"); return 1; } } if ($2 == '[' && $4 != ']' || $2 == '(' && $4 != ')') { yyerror("Unmatched group parentheses"); return 1; } t = (syncdef*)malloc(sizeof(syncdef)); t->type = ($2 == '[' ? TE_PRIORITY_LIST :TE_QUORUM_LIST); t->nsync = n; t->name = $5->str; t->quoted = $5->quoted; t->nest = $3->nest + 1; t->elems = $3; t->next= NULL; $$ = t;} ; opt_prefix:PREFIX{ $$ = $1; }| { $$ = "1"; }; opt_groupname:':' NAME{ $$ = $2; }| /* EMPTY */{ appname *name = (appname *)malloc(sizeof(name)); name->str[0] =0; name->quoted = 0; $$ = name; }; name_list:name_elem{ $$ = $1;}| name_list DELIMITER name_elem{ syncdef *p = $1; if (p->nest < $3->nest) p->nest = $3->nest; while (p->next) p = p->next; p->next = $3; $$ = $1;}; name_elem:name_elem_nonlist{ $$ = $1; }| group_list{ $$ = $1;}; name_elem_nonlist:NAME{ syncdef *t = (syncdef*)malloc(sizeof(syncdef)); t->type = TE_HOSTNAME; t->nsync = 0; t->name= strdup($1->str); t->quoted = $1->quoted; t->nest = 0; t->elems = NULL; t->next = NULL; $$ = t; }; %% void indent(int level) { int i; for (i = 0 ; i < level * 2 ; i++) putc(' ', stdout); } void dump_def(syncdef *def, int level) { char *typelabel[] = {"HOSTNAME", "PRIO_LIST", "QUORUM_LIST"}; syncdef *p; if (def == NULL) return; switch (def->type) { case TE_HOSTNAME: indent(level); puts("{"); indent(level+1); printf("TYPE: %s\n", typelabel[def->type]); indent(level+1); printf("HOSTNAME: %s\n", def->name); indent(level+1); printf("QUOTED: %s\n", def->quoted ? "Yes" : "No"); indent(level+1); printf("NEST:%d\n", def->nest); indent(level); puts("}"); if (def->next) dump_def(def->next, level); break; case TE_PRIORITY_LIST: case TE_QUORUM_LIST: indent(level); printf("TYPE: %s\n", typelabel[def->type]); indent(level); printf("GROUPNAME: %s\n", def->name ? def->name : "<none>"); indent(level); printf("NSYNC: %d\n", def->nsync); indent(level); printf("NEST: %d\n", def->nest); indent(level); puts("CHILDREN {"); level++; dump_def(def->elems, level); level--; indent(level); puts("}"); if (def->next) dump_def(def->next, level); break; default: fprintf(stderr,"Unknown type?\n"); exit(1); } level--; } int main(void) { // yydebug = 1;if (!yyparse()) dump_def(defroot, 0); } void yyerror(const char* s) { fprintf(stderr, "Error: %s\n", s); } #include "lex.yy.c"
Attached latest patch includes document patch. > When I changed s_s_names to 'hoge*' and reloaded the configuration file, > the server crashed unexpectedly with the following error message. > This is obviously a bug. Fixed. > - allows any byte except a double quote in double-quoted > representation. A double-quote just after a delimiter can open > quoted representation. No. double quote is also allowed in double-quoted representation using by two double-quotes. if s_s_names = '"node""hoge"' then standby name will be 'node"hoge'. > > I have no problem with it. The attached new sample parser does > so. > > By the way, your parser also complains for an example I've seen > somewhere upthread "1[2,3,4]". This is because '2', '3' and '4' > are regarded as INT, not NAME. Whether a sequence of digits is a > prefix number of a list or a host name cannot be identified until > reading some following characters. So my previous test.l defined > NAME_OR_INTEGER and it is distinguished in the grammar side to > resolve this problem. > > If you want them identified in the lexer side, it should do > looking-forward as <NAME_OR_PREFIX>{prefix} in the attached > test.l does. This makes the lexer a bit complex but in contrast > test.y simpler. The test.l, test.y attached got refactored but .l > gets a bit tricky.. I think that lexer can pass both INT and NAME as char* to parser, and then parser regards them as integer or char*. It would be more simple. Thoughts? Thank you for giving lexer and parser example but I'm not sure that it makes thing more easier. It seems to make thing more complex. Attached patch handles parameter using similar way as postgres parses SQL. Please having a look it and give me feedbacks. Regards, -- Masahiko Sawada
Attachment
On Fri, Feb 26, 2016 at 1:23 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached latest patch includes document patch. > >> When I changed s_s_names to 'hoge*' and reloaded the configuration file, >> the server crashed unexpectedly with the following error message. >> This is obviously a bug. > > Fixed. > >> - allows any byte except a double quote in double-quoted >> representation. A double-quote just after a delimiter can open >> quoted representation. > > No. double quote is also allowed in double-quoted representation using > by two double-quotes. > if s_s_names = '"node""hoge"' then standby name will be 'node"hoge'. > >> >> I have no problem with it. The attached new sample parser does >> so. >> >> By the way, your parser also complains for an example I've seen >> somewhere upthread "1[2,3,4]". This is because '2', '3' and '4' >> are regarded as INT, not NAME. Whether a sequence of digits is a >> prefix number of a list or a host name cannot be identified until >> reading some following characters. So my previous test.l defined >> NAME_OR_INTEGER and it is distinguished in the grammar side to >> resolve this problem. >> >> If you want them identified in the lexer side, it should do >> looking-forward as <NAME_OR_PREFIX>{prefix} in the attached >> test.l does. This makes the lexer a bit complex but in contrast >> test.y simpler. The test.l, test.y attached got refactored but .l >> gets a bit tricky.. > > I think that lexer can pass both INT and NAME as char* to parser, and > then parser regards them as integer or char*. > It would be more simple. > Thoughts? > > Thank you for giving lexer and parser example but I'm not sure that it > makes thing more easier. > It seems to make thing more complex. > > Attached patch handles parameter using similar way as postgres parses SQL. > Please having a look it and give me feedbacks. > Previous patch could not parse one character standby name correctly. Attached latest patch. Regards, -- Masahiko Sawada
Attachment
Hello, Thanks for the new patch. At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com> > Previous patch could not parse one character standby name correctly. > Attached latest patch. I haven't looked it in detail but it won't work as you expected. flex compains as the following for v12 patch. syncgroup_scanner.l:80: warning, rule cannot be matched syncgroup_scanner.l:84: warning, rule cannot be matched They are warnings about the patterns [1-9][0-9]* and {asterisk} because it is matched by {node_name}+. The latter would no harm (or the pattern is useless) but the former will make '1[a,b,c]' to fail. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 26 Feb 2016 10:38:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160226.103822.12680005.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, Thanks for the new patch. > > > At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com> > > Previous patch could not parse one character standby name correctly. > > Attached latest patch. > > I haven't looked it in detail but it won't work as you > expected. flex compains as the following for v12 patch. > > syncgroup_scanner.l:80: warning, rule cannot be matched > syncgroup_scanner.l:84: warning, rule cannot be matched Making it independent from postgres body then compile it with -DYYDEBUG and set yydebug = 1 would give you valuable information and make testing of the parser far easier. | $ flex test2.l; bison -v test2.y; gcc -g -DYYDEBUG -o ltest2 test2.tab.c | $ echo '1[aa,bb,cc]' | ./ltest2 | Starting parse | Entering state 0 | Reading a token: Next token is token NAME () | Shifting token NAME () | ... | Entering state 4 | Next token is token '[' () | syntax error at or near "[" in "(null) regards, -- Kyotaro Horiguchi NTT Open Source Software Center %{ //#include "postgres.h" /* No reason to constrain amount of data slurped */ #define YY_READ_BUF_SIZE 16777216 #define BUFSIZE 8192 /* Handles to the buffer that the lexer uses internally */ static YY_BUFFER_STATE scanbufhandle; /* Functions for handling double quoted string */ static void init_xd_string(void); static void addlit_xd_string(char *ytext, int yleng); static void addlitchar_xd_string(unsigned char ychar); char *scanbuf; char *xd_string; int xd_size; /* actual size of xd_string */ int xd_len; /* string length of xd_string */ %} %option 8bit /* %option never-interactive*/ /* %option nounput*/ /* %option noinput*/ %option noyywrap %option warn /* %option prefix="syncgroup_yy" */ /** <xd> delimited identifiers (double-quoted identifiers)*/ %x xd space [ \t\n\r\f] non_newline [^\n\r] whitespace ({space}+) self [\[\]\,] asterisk \* /** Basically all ascii characteres except for {self} and {whitespace} are allowed* to be used for node name. These specialcharater could be used by double-quoted.*//* excluding ' ', '\"', '*', ',', '[', ']' */ node_name [\x21\x23-\x29\x29-\x2b\x2d-\x5a\x5c\x5e-\x7e] /* excluding '\"' */ dquoted_name [\x20\x21\x23-\x7e] /* Double-quoted string */ dquote \" xdstart {dquote} xddouble {dquote}{dquote} xdstop {dquote} xdinside {dquoted_name}+ %% {whitespace} { /* ignore */ } {xdstart} { init_xd_string(); BEGIN(xd); } <xd>{xddouble} { addlitchar_xd_string('\"'); } <xd>{xdinside} { addlit_xd_string(yytext, yyleng); } <xd>{xdstop} { xd_string[xd_len] = '\0'; yylval.str = xd_string; BEGIN(INITIAL); return NAME; } {node_name}+ { yylval.str = strdup(yytext); return NAME; } [1-9][0-9]* { yylval.str = yytext; return NUM; } {asterisk} { yylval.str = strdup(yytext); return AST; } {self} { return yytext[0]; } . { // ereport(ERROR, // (errcode(ERRCODE_SYNTAX_ERROR), // errmsg("syntax error: unexpected character \"%s\"", yytext))); fprintf(stderr, "syntaxerror: unexpected character \"%s\"", yytext); exit(1);} %% void yyerror(const char *message) { // ereport(ERROR, // (errcode(ERRCODE_SYNTAX_ERROR), // errmsg("%s at or near \"%s\" in \"%s\"", message, // yytext, scanbuf)));fprintf(stderr, "%s at or near \"%s\" in \"%s\"", message, yytext, scanbuf);exit(1); } void syncgroup_scanner_init(const char *str) {Size slen = strlen(str); /* * Might be left over after ereport() */if (YY_CURRENT_BUFFER) yy_delete_buffer(YY_CURRENT_BUFFER); /* * Make a scan buffer with special termination needed by flex. */scanbuf = (char *) palloc(slen + 2);memcpy(scanbuf, str,slen);scanbuf[slen] = scanbuf[slen + 1] = YY_END_OF_BUFFER_CHAR;scanbufhandle = yy_scan_buffer(scanbuf, slen + 2); } void syncgroup_scanner_finish(void) {yy_delete_buffer(scanbufhandle);scanbufhandle = NULL; } static void init_xd_string() {xd_string = palloc(sizeof(char) * BUFSIZE);xd_size = BUFSIZE;xd_len = 0; } static void addlit_xd_string(char *ytext, int yleng) {/* enlarge buffer if needed */if ((xd_len + yleng) > xd_size) xd_string = repalloc(xd_string, xd_size + BUFSIZE); memcpy(xd_string + xd_len, ytext, yleng);xd_len += yleng; } static void addlitchar_xd_string(unsigned char ychar) {/* enlarge buffer if needed */if ((xd_len + 1) > xd_size) xd_string = repalloc(xd_string, xd_size + BUFSIZE); xd_string[xd_len] = ychar;xd_len += 1; } %{ /*-------------------------------------------------------------------------** syncgroup_gram.y - Parser forsynchronous replication group** Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group* Portions Copyright(c) 1994, Regents of the University of California*** IDENTIFICATION* src/backend/replication/syncgroup_gram.y**-------------------------------------------------------------------------*/ //#include "postgres.h" //#include "replication/syncrep.h" #include <stdio.h> #include <stdlib.h> #include <string.h> #define palloc malloc #define repalloc realloc #define pfree free #define SYNC_REP_GROUP_MAIN 0x01 #define SYNC_REP_GROUP_NAME 0x02 #define SYNC_REP_GROUP_GROUP 0x04 #define SYNC_REP_METHOD_PRIORITY 0 struct SyncGroupNode; typedef struct SyncGroupNode SyncGroupNode; struct SyncGroupNode {/* Common information */int type;char *name;SyncGroupNode *next; /* Same group, next name node */ /* For group ndoe */int sync_method; /* priority */int wait_num;SyncGroupNode *members; /* member of its group */ }; static SyncGroupNode *create_name_node(char *name); static SyncGroupNode *add_node(SyncGroupNode *node_list, SyncGroupNode *node); static SyncGroupNode *create_group_node(char *wait_num, SyncGroupNode *node_list); static void yyerror(const char *message);typedef int Size; /** Bison doesn't allocate anything that needs to live across parser calls,* so we can easily have it use palloc insteadof malloc. This prevents* memory leaks if we error out during parsing. Note this only works with* bison >= 2.0. However, in bison 1.875 the default is to use alloca()* if possible, so there's not really much problem anyhow, at leastif* you're building with gcc.*/ #define YYMALLOC palloc #define YYFREE pfreeSyncGroupNode *SyncRepStandbys; %} %expect 0/*%name-prefix="syncgroup_yy"*/ %union {char *str;SyncGroupNode *expr; } %token <str> NAME NUM %token <str> AST %type <expr> result sync_list sync_list_ast sync_element sync_element_ast sync_node_group sync_group_old sync_group %start result %% result: sync_node_group { SyncRepStandbys = $1; } ; sync_node_group: sync_group_old { $$ = $1; } | sync_group { $$ = $1;} ; sync_group_old: sync_list { $$ = create_group_node("1", $1); } | sync_list_ast { $$ = create_group_node("1", $1); } ; sync_group: NUM '[' sync_list ']' { $$ = create_group_node($1, $3); } | NUM '[' sync_list_ast ']' { $$ = create_group_node($1, $3); } ; sync_list: sync_element { $$ = $1;} | sync_list ',' sync_element { $$ = add_node($1,$3);} ; sync_list_ast: sync_element_ast { $$ = $1;} | sync_list ',' sync_element_ast { $$ = add_node($1,$3);} ; sync_element: NAME { $$ = create_name_node($1); } | NUM { $$ = create_name_node($1); } ; sync_element_ast: AST { $$ = create_name_node($1); } ; %% static SyncGroupNode * create_name_node(char *name) {SyncGroupNode *name_node = (SyncGroupNode *)malloc(sizeof(SyncGroupNode)); /* Common information */name_node->type = SYNC_REP_GROUP_NAME;name_node->name = strdup(name);name_node->next = NULL; /* For GROUP node */name_node->sync_method = 0;name_node->wait_num = 0;name_node->members = NULL;// name_node->SyncRepGetSyncedLsnsFn= NULL;// name_node->SyncRepGetSyncStandbysFn = NULL; return name_node; } static SyncGroupNode * create_group_node(char *wait_num, SyncGroupNode *node_list) {SyncGroupNode *group_node = (SyncGroupNode *)malloc(sizeof(SyncGroupNode)); /* For NAME node */group_node->type = SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN;group_node->name = "main";group_node->next= NULL; /* For GROUP node */group_node->sync_method = SYNC_REP_METHOD_PRIORITY;group_node->wait_num = atoi(wait_num);group_node->members= node_list;// group_node->SyncRepGetSyncedLsnsFn = SyncRepGetSyncedLsnsUsingPriority;// group_node->SyncRepGetSyncStandbysFn = SyncRepGetSyncStandbysUsingPriority; return group_node; } static SyncGroupNode * add_node(SyncGroupNode *node_list, SyncGroupNode *node) {SyncGroupNode *tmp = node_list; /* Add node to tailing of node_list */while(tmp->next != NULL) tmp = tmp->next; tmp->next = node;return node_list; } void indent(int level) { int i; for (i = 0 ; i < level * 2 ; i++) putc(' ', stdout); } static void dump_syncgroupnode(SyncGroupNode *def, int level) { char *typelabel[] = {"MAIN", "NAME", "GROUP"}; SyncGroupNode *p; if (def == NULL) return; switch(def->type) { case SYNC_REP_GROUP_NAME: indent(level); puts("{"); indent(level+1); printf("NODE_TYPE: SYNC_REP_GROUP_NAME\n"); indent(level+1); printf("NAME: %s\n", def->name); indent(level); puts("}"); if (def->next) dump_syncgroupnode(def->next, level); break; case SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN: indent(level); puts("{"); indent(level+1); printf("NODE_TYPE: SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN\n"); indent(level+1); printf("NAME: %s\n", def->name); indent(level+1); printf("SYNC_METHOD: PRIORITY\n"); indent(level+1); printf("WAIT_NUM: %d\n", def->wait_num); indent(level+1); if (def->members) dump_syncgroupnode(def->members,level+1); indent(level); puts("}"); if (def->next) dump_syncgroupnode(def->next,level); break; default: fprintf(stderr, "ERR\n"); exit(1); } level--; } int main(void) { yydebug = 1; yyparse(); dump_syncgroupnode(SyncRepStandbys, 0); } //#include "syncgroup_scanner.c" #include "lex.yy.c"
On Fri, Feb 26, 2016 at 10:53 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Fri, 26 Feb 2016 10:38:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20160226.103822.12680005.horiguchi.kyotaro@lab.ntt.co.jp> >> Hello, Thanks for the new patch. >> >> >> At Fri, 26 Feb 2016 08:52:54 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAZKFVu8-MVhkJ3ywAiJmb=P-HSbJTGi=gK1La73KjS6Q@mail.gmail.com> >> > Previous patch could not parse one character standby name correctly. >> > Attached latest patch. >> >> I haven't looked it in detail but it won't work as you >> expected. flex compains as the following for v12 patch. >> >> syncgroup_scanner.l:80: warning, rule cannot be matched >> syncgroup_scanner.l:84: warning, rule cannot be matched > > Making it independent from postgres body then compile it with > -DYYDEBUG and set yydebug = 1 would give you valuable information > and make testing of the parser far easier. Thank you for your suggestion. Attached latest version patch. The changes from previous version are, - Fix parser, lexer bugs. - Add regression test patch based on patch Suraji submitted. Please review it. Regards, -- Masahiko Sawada
Attachment
Sorry, I misread the previous patch. It actually worked. At Sun, 28 Feb 2016 04:04:37 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoB69-tNLVzKRZ0Opzsr6LcLY36GJ2tHGohW33Btq3yRsw@mail.gmail.com> > The changes from previous version are, > - Fix parser, lexer bugs. > - Add regression test patch based on patch Suraji submitted. Thank you for the new patch. The parser almost looks to work as expected, but the following warnings were seen on build. > In file included from syncgroup_gram.y:138:0: > syncgroup_scanner.l:23:12: warning: ‘xd_size’ defined but not used [-Wunused-variable] > static int xd_size; /* actual size of xd_string */ > ^ > syncgroup_scanner.l:24:12: warning: ‘xd_len’ defined but not used [-Wunused-variable] > static int xd_len; /* string length of xd_string */ Some random comments follow. Commnents for the lexer part. === > +node_name [^\ \,\[\]] This accepts 'abc^Id' as a name, which is wrong behavior (but such appliction names are not allowed anyway. If you assume so, I'd like to see a comment for that.). And the excessive escaping make it hard to read a bit. The pattern can be written as the following more precisely. (but I don't know whether it is generally easy to read..) | node_name [\x20-\x7f]{-}[ \[\],] === The pattern name {node_name} gives me a bit uneasiness. node_name_cont or name_chars would be preferable. === > [1-9][0-9]* { I see no necessity to inhibit 0-prefixed integers as NUM. Would you mind allowing [0-9]+ there? === addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned char ychar) requires differnt character types. Is there any reason for that? === I personally don't like addlit*string() things for such simple syntax but itself is acceptble enough for me. However it uses StringInfo to hold double-quoted names, which pallocs 1024 bytes of memory chunk for every double-quoted name. The chunks are finally stacked up left uncollected until the current memorycontext is deleted or reset (It is deleted just after finishing config file processing). Addition to that, setting s_s_names runs the parser twice. It seems to me too greedy and seems that static char [NAMEDATALEN] is enough using the v12 way without palloc/repalloc. Comments for parser part. === The rule "result" in syncgruop_gram.y sets malloced chunk to SyncRepStandbys ignoring exiting content so repetitive setting to the gud s_s_names causes a memory leak. Using SyncRepClearStandbyGroupList would be enough. === The meaning of SyncGroupNode.type seems oscure. The member seems to be referred to decide how to treat the node, but the following code will break the assumption. > group_node->type = SYNC_REP_GROUP_GROUP | SYNC_REP_GROUP_MAIN; It seems me that *_MAIN is an equivalent of *_GROUP && sync_method = *_PRIORITY. If so, *_MAIN is useless. The reader of SyncGroupNode needs not to see wheter it was in traditional s_s_names or in new format. === Bare names in s_s_names are down-cased and double-quoted ones are not. The parser of this patch doesn't for both. === xd_stringdup() doesn't make a copy of the string against its name. It's error-prone. === I found that the name SyncGroupName.wait_num is not instinctive. How about sync_num, sync_member_num or sync_standby_num? If the last is preferable, .members also should be .standbys . Comment for the quorum commit body part. === I am quite uncomfortable with the existence of WanSnd.sync_standby_priority. It represented the pirority in the old linear s_s_names format but nested groups or even single-level quarum list obviously doesn't fit it. Can we get rid of sync_standby_priority, even though we realize atmost n-priority for now? === The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to have specific code for every prioritizing method (which are priority, quorum, nested and so). Is there any reson to use it as a callback of SyncGroupNode? Others - random commnets === SyncRepClearStandbyGroupList is defined in syncrep.c but the other related functions are defined in syncgroup_gram.y. It would be better to place them together. === SyncRepStandbys are to be in multilevel and the struct is naturally allowed to be so but SyncRepClearStandbyGroupList assumes it in single level. Make the function to free multilevel or explicitly inhibit multilevel using asserttion. === - errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); + errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s)."))); The message doesn't contain specific number of standbys so just using plural seems to be enough for me. And besides, the message should describe the situation more precisely. Word correction is left to anyone else:) + errdetail("The transaction has already committed locally, but might not have been replicated to some of the requiredstandbys."))); === + * Check whether specified standby is active, which means not only having + * pid but also having any priority. "active" means not only defined priority but also have informed WAL flush position. + * Check whether specified standby is active, which means not only having + * pid but also having any priority and valid flush position reported. === If there's no reason for SyncRepStandbyIsSync not to take WalSnd directly, taking walsnd is simpler. static bool SyncRepStandbyIsSync(volatile WalSnd *walsnd); === > * Update the LSNs on each queue based upon our latest state. This > * implements a simple policy of first-valid-standby-releases-waiter. > * > * Other policies are possible, which would change what we do here and what > * perhaps also which information we store as well. > */ > void > SyncRepReleaseWaiters(void) This comment looks wrong for the new code. === > * Select low priority standbys from walsnds array. If there are same > * priority standbys, first defined standby is selected. It's possible > * to have same priority different standbys, so we can not break loop > * even when standby having target_prioirty priority is found. "low priority" here seems to be a mistake of "high priority standbys" or "standbys with low priority value". > * Returns the list of standbys in sync up to the number that > * required to satisfy synchronous_standby_names. If there > * are standbys with the same priority values, the first > * defined ones are selected. It's possible for multiple > * standbys to have a same priority value when multiple > * walreceiver gives the same name, so we do not break the > * inner loop just by finding a standby with the > * target_priority. === > /* Got enough synchronous stnadby */ "staneby" => "standbys" === This is a comment from the aspect of abstractness of objects. The callers of SyncRepGetSyncStandbysUsingPriority() need to care the inside of SyncGroupNode but what the function should just return seems to be the list of wansnds element. Element number is useless when the SyncGroupNode nests. > int > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) This might need to expose 'volatile WalSnd*' (only pointer type) outside of walsender. Or it should return the list of index number of *WalSndCtl->walsnds*. === The dependency definition seems to be wrong in Makefile so editing related files won't cause appropriate compilation. syncgroup_gram.h and syncgroup_gram.c are generated at once from the .y file. and syncgroup_gram.o is generated from syncgroup_gram.c and syncgroup_scanner.c. -syncgroup_gram.o: syncgroup_scanner.c - -syncgroup_gram.h: syncgroup_gram.c ; +syncgroup_gram.o: syncgroup_scanner.c syncgroup_gram.c === In pg_stat_get_wal_senders, the num_sync looks to have a chance to be used uninitialized but I don't know why the compiler don't complain about it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Feb 28, 2016 at 8:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Attached latest version patch. > > The changes from previous version are, > - Fix parser, lexer bugs. > - Add regression test patch based on patch Suraji submitted. > > Please review it. > > [000_multi_sync_replication_v13.patch] Hi Masahiko, Hi, I have a couple of small suggestions for the documentation and comments: + Specifies a standby names that can support <firstterm>synchronous replication</> using + either two types of syntax; comma-separated list or dedicated language, as + described in <xref linkend="synchronous-replication">. + Transcations waiting for commit will be allowed to proceed after the + specified number of standby servers confirms receipt of their data. Suggestion: Specifies the standby names that can support <firstterm>synchronous replication</> using either of two syntaxes: a comma-separated list, or a more flexible syntax described in <xref linkend="synchronous-replication">. Transactions waiting for commit will be allowed to proceed after a configurable subset of standby servers confirms receipt of their data. For the simple comma-separated list syntax, it is one server. + If the current any of synchronous standbys disconnects for whatever reason, s/the current any of/any of the current/ + no mechanism to enforce uniqueness. For each specified standby name, + only the specified count of standbys will be chosen to be synchronous + standbys, though exactly which one is indeterminate, the rest will + represent potential synchronous standbys. s/one/ones/ s/indeterminate, the/indeterminate. The/ + made by a transcation have been transferred to one or more synchronous standby + server. This extends that standard levelof durability s/transcation/transaction/ s/that standard levelof/the standard level of/ offered by a transaction commit. This level of protection is referred to as 2-safe replication in computer sciencetheory. Is this still called "2-safe" or does this patch make it "N-safe", "group-safe", or something else? - The minimum wait time is the roundtrip time between primary to standby. + The minimum wait time is the roundtrip time between primary to standbys. Suggestion: The minimum wait time is the roundtrip time between the primary and the slowest synchronous standby. + Multiple synchronous replication is set up by setting <xref linkend="guc-synchronous-standby-names"> + using dedicated language. The syntax of dedicated language is following. Suggestion: Multiple synchronous replication is set up by setting <xref linkend="guc-synchronous-standby-names"> using the following syntax. + Using dedicated language, we can define a synchronous group with a number N. + synchronous group can have some members which are consdiered as synchronous standby using comma-separated list. + Any standby name is accepted at any position of its list, but '*' is accepted at only tailing of the standby list. + The leading N is a number which specifies that how many standbys the master server waits to commit for. This number + must be less than actual number of members of its group. + The listed standby are given highest priority from left defined starting with 1. Suggestion: This syntax allows us to define a synchronous group that will wait for at least N standbys, and a comma-separated list of group members. The special value <literal>*</> is accepted at the tail of the member list, and matches any standby. The number N must not be greater than the number of members listed in the group, unless <literal>*</> is used. Priority is given to servers in the order that they appear in the list. The first named server has the highest priority. + All ASCII characters except for special characters(',', '"', '[', ']', ' ') are allowed as standby name. + When these special characters are used as standby name, whole standby name string need to be written in + double-quoted representation. Suggestion: ... are allowed in unquoted standby names. To use these special characters, the standby name should be enclosed in double quotes. + * In 9.5 we support the possibility to have multiple synchronous standbys, s/9.5/9.6/ + * as defined in synchronous_standby_names. Before on standby can become a s/ on / a / + * Waiters will be released from the queue once the number of standbys + * specified in synchronous_standby_names have caught. s/caught/processed the commit record/ + * Check whether specified standby is active, which means not only having + * pid but also having any priority. s/having any priority/having a non-zero priority (meaning it is configured as potential sync standby)./ - announce_next_takeover = true; By removing this, haven't we lost the ability to announce takeover more than once per walsender? I'm not sure exactly where this should go now but the walsender needs to detect its own transition from potential to sync state. Also, that message, where it appears below should probably be tweaked slightly s/the/a/, so "standby \"%s\" is now a synchronous standby with priority %u", not "... the synchronous standby ...". /* + * Return true if we have enough synchrononized standbys and the 'safe' written + * flushed LSNs, which are LSNs assured in all standbys considered should be + * synchronized. + */ Suggestion: Return true if we have enough synchronous standbys. If true, also store the 'safe' write and flush position in the output parameters write_pos and flush_pos, but only if the standby managed by this walsender is one of the standbys that has reached each safe position respectively. + /* Check whether each LSN has advanced to */ Suggestion: /* Check whether this standby has reached the safe positions. */ +/* + * Decide synced LSNs at this moment using priority method. + * If there are not active standbys enough to determine LSNs, return false. s/not active standbys enough/not enough active standbys/ +/* + * Return the positions of the first group->wait_num synchronized standbys + * in group->member list into sync_list. sync_list is assumed to have enough + * space for at least group->wait_num elements. + */ s/Return/Write/ s/sychronized/synchronous/ Then add: "Return the number found." +int +SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, int *sync_list) +{ + int target_priority = 1; /* lowest priority is 1 */ 1 is actually the *highest* priority standby. + /* + * Select low priority standbys from walsnds array. If there are same + * priority standbys, first defined standby is selected. It's possible + * to have same priority different standbys, so we can not break loop + * even when standby having target_prioirty priority is found. s/target_prioirty/target_priority/ + /* Got enough synchronous stnadby */ s/stnadby/standbys/ + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + (errmsg_internal("The number of group memebers must be less than its group waits.")))); I'm not sure what the right error code is, but this isn't an syntax error. Maybe ERRCODE_CONFIG_FILE_ERROR or ERRCODE_INVALID_PARAMETER_VALUE? Suggestion for the message: "the configured number of synchronous standbys exceeds the length of the group of standby names: %d" + /* + * syncgroup_yyparse sets the global SyncRepStandbys as side effect. + * But this function is required to just check, so frees SyncRepStandbyNanes s/SyncRepStandbyNanes/SyncRepStandbys/ ??? + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + (errmsg_internal("Invalid syntax. synchronous_standby_names parse returned %d", + parse_rc)))); Looking at other error messages I see that they always start with lower case and then put extra details after ':' rather than using a '.'. Maybe this could be "could not parse synchronous_standby_names: error code %d"? +#define MAX_WALSENDER_NAME 8192 Seems to be unused. Thanks! -- Thomas Munro http://www.enterprisedb.com
Hi, Thank you so much for reviewing this patch! All review comments regarding document and comment are fixed. Attached latest v14 patch. > This accepts 'abc^Id' as a name, which is wrong behavior (but > such appliction names are not allowed anyway. If you assume so, > I'd like to see a comment for that.). 'abc^Id' is accepted as application_name, no? postgres(1)=# set application_name to 'abc^Id'; SET postgres(1)=# show application_name ; application_name ------------------ abc^Id (1 row) > addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned > char ychar) requires differnt character types. Is there any reason > for that? Because addlit_xd_string() is for adding string(char *) to xd_string, OTOH addlit_xd_char() is for adding just one character to xd_string. > I personally don't like addlit*string() things for such simple > syntax but itself is acceptble enough for me. However it uses > StringInfo to hold double-quoted names, which pallocs 1024 bytes > of memory chunk for every double-quoted name. The chunks are > finally stacked up left uncollected until the current > memorycontext is deleted or reset (It is deleted just after > finishing config file processing). Addition to that, setting > s_s_names runs the parser twice. It seems to me too greedy and > seems that static char [NAMEDATALEN] is enough using the v12 way > without palloc/repalloc. I though that length of group name could be more than NAMEDATALEN, so I use StringInfo. Is it not necessary? > I found that the name SyncGroupName.wait_num is not > instinctive. How about sync_num, sync_member_num or > sync_standby_num? If the last is preferable, .members also should > be .standbys . Thanks, sync_num is preferable to me. === > I am quite uncomfortable with the existence of > WanSnd.sync_standby_priority. It represented the pirority in the > old linear s_s_names format but nested groups or even > single-level quarum list obviously doesn't fit it. Can we get rid > of sync_standby_priority, even though we realize atmost > n-priority for now? We could get rid of sync_standby_priority. But if so, we will not be able to see the next sync standby in pg_stat_replication system view. Regarding each node priority, I was thinking that standbys in quorum list have same priority, and in nested group each standbys are given the priority starting from 1. === > The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to > have specific code for every prioritizing method (which are > priority, quorum, nested and so). Is there any reson to use it as > a callback of SyncGroupNode? The reason why the current code is so is that current code is for only priority method supporting. At first version of this feature, I'd like to implement it more simple. Aside from this, of course I'm planning to have specific code for nested design. - The group can have some name nodes or group nodes. - The group can use either 2 types of method: priority or quorum. - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN at that moment using group's method. - SyncRepGetStandbysFn() function returns standbys of its group, which are considered as sync using group's method. For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys memory structure will be, "main(quorum)" --- "a" | -- "b" | -- "group1(priority)" --- "c" | -- "d" When determine synced LSNs, we need to consider group1's LSN using by priority method at first, and then we can determine main's LSN using by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. So SyncRepGetSyncedLsnsUsingPriority() function would be, bool SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) { sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); if (sync_num < group->sync_num) return false; for (each member of sync_list) { if (member->type == group node) call SyncRepGetSyncedLsnsFn(member, w, f) and store w and f into lsn_list. else Store name node LSNs into lsn_list. } Determine synced LSNs of this group using lsn_list and priority method. Store synced LSNs into write_lsn and flush_lsn. return true; } > SyncRepClearStandbyGroupList is defined in syncrep.c but the > other related functions are defined in syncgroup_gram.y. It would > be better to place them together. SyncRepClearStandbyGroupList() is used by check_synchronous_standby_names(), so I put this function syncrep.c. > SyncRepStandbys are to be in multilevel and the struct is > naturally allowed to be so but SyncRepClearStandbyGroupList > assumes it in single level. Because I think that we don't need to implement to fully support nested style at first version. We have to carefully design this feature while considering expandability, but overkill implementation could be cause of crash. Consider remaining time for 9.6, I feel we could implement quorum method at best. > This is a comment from the aspect of abstractness of objects. > The callers of SyncRepGetSyncStandbysUsingPriority() need to care > the inside of SyncGroupNode but what the function should just > return seems to be the list of wansnds element. Element number is > useless when the SyncGroupNode nests. > > int > > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) > This might need to expose 'volatile WalSnd*' (only pointer type) > outside of walsender. > Or it should return the list of index number of > *WalSndCtl->walsnds*. SyncRepGetSyncStandbysUsingPriority() already returns the list of index number of "WalSndCtl->walsnd" as sync_list, no? As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the inside of SyncGroupNode in my design. Selecting sync nodes from its group doesn't depend on the type of node. What SyncRepGetSyncStandbyFn() should do is to select sync node from *its* group. Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Hi, > > Thank you so much for reviewing this patch! > > All review comments regarding document and comment are fixed. > Attached latest v14 patch. > >> This accepts 'abc^Id' as a name, which is wrong behavior (but >> such appliction names are not allowed anyway. If you assume so, >> I'd like to see a comment for that.). > > 'abc^Id' is accepted as application_name, no? > postgres(1)=# set application_name to 'abc^Id'; > SET > postgres(1)=# show application_name ; > application_name > ------------------ > abc^Id > (1 row) > >> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned >> char ychar) requires differnt character types. Is there any reason >> for that? > > Because addlit_xd_string() is for adding string(char *) to xd_string, > OTOH addlit_xd_char() is for adding just one character to xd_string. > >> I personally don't like addlit*string() things for such simple >> syntax but itself is acceptble enough for me. However it uses >> StringInfo to hold double-quoted names, which pallocs 1024 bytes >> of memory chunk for every double-quoted name. The chunks are >> finally stacked up left uncollected until the current >> memorycontext is deleted or reset (It is deleted just after >> finishing config file processing). Addition to that, setting >> s_s_names runs the parser twice. It seems to me too greedy and >> seems that static char [NAMEDATALEN] is enough using the v12 way >> without palloc/repalloc. > > I though that length of group name could be more than NAMEDATALEN, so > I use StringInfo. > Is it not necessary? > >> I found that the name SyncGroupName.wait_num is not >> instinctive. How about sync_num, sync_member_num or >> sync_standby_num? If the last is preferable, .members also should >> be .standbys . > > Thanks, sync_num is preferable to me. > > === >> I am quite uncomfortable with the existence of >> WanSnd.sync_standby_priority. It represented the pirority in the >> old linear s_s_names format but nested groups or even >> single-level quarum list obviously doesn't fit it. Can we get rid >> of sync_standby_priority, even though we realize atmost >> n-priority for now? > > We could get rid of sync_standby_priority. > But if so, we will not be able to see the next sync standby in > pg_stat_replication system view. > Regarding each node priority, I was thinking that standbys in quorum > list have same priority, and in nested group each standbys are given > the priority starting from 1. > > === >> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to >> have specific code for every prioritizing method (which are >> priority, quorum, nested and so). Is there any reson to use it as >> a callback of SyncGroupNode? > > The reason why the current code is so is that current code is for only > priority method supporting. > At first version of this feature, I'd like to implement it more simple. > > Aside from this, of course I'm planning to have specific code for nested design. > - The group can have some name nodes or group nodes. > - The group can use either 2 types of method: priority or quorum. > - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() > - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN > at that moment using group's method. > - SyncRepGetStandbysFn() function returns standbys of its group, > which are considered as sync using group's method. > > For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys > memory structure will be, > > "main(quorum)" --- "a" > | > -- "b" > | > -- "group1(priority)" --- "c" > | > -- "d" > > When determine synced LSNs, we need to consider group1's LSN using by > priority method at first, and then we can determine main's LSN using > by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. > So SyncRepGetSyncedLsnsUsingPriority() function would be, > > bool > SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) > { > sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); > > if (sync_num < group->sync_num) > return false; > > for (each member of sync_list) > { > if (member->type == group node) > call SyncRepGetSyncedLsnsFn(member, w, f) and store w and > f into lsn_list. > else > Store name node LSNs into lsn_list. > } > > Determine synced LSNs of this group using lsn_list and priority method. > Store synced LSNs into write_lsn and flush_lsn. > return true; > } > >> SyncRepClearStandbyGroupList is defined in syncrep.c but the >> other related functions are defined in syncgroup_gram.y. It would >> be better to place them together. > > SyncRepClearStandbyGroupList() is used by > check_synchronous_standby_names(), so I put this function syncrep.c. > >> SyncRepStandbys are to be in multilevel and the struct is >> naturally allowed to be so but SyncRepClearStandbyGroupList >> assumes it in single level. > > Because I think that we don't need to implement to fully support > nested style at first version. > We have to carefully design this feature while considering > expandability, but overkill implementation could be cause of crash. > Consider remaining time for 9.6, I feel we could implement quorum > method at best. > >> This is a comment from the aspect of abstractness of objects. >> The callers of SyncRepGetSyncStandbysUsingPriority() need to care >> the inside of SyncGroupNode but what the function should just >> return seems to be the list of wansnds element. Element number is >> useless when the SyncGroupNode nests. >> > int >> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) >> This might need to expose 'volatile WalSnd*' (only pointer type) >> outside of walsender. >> Or it should return the list of index number of >> *WalSndCtl->walsnds*. > > SyncRepGetSyncStandbysUsingPriority() already returns the list of > index number of "WalSndCtl->walsnd" as sync_list, no? > As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the > inside of SyncGroupNode in my design. > Selecting sync nodes from its group doesn't depend on the type of node. > What SyncRepGetSyncStandbyFn() should do is to select sync node from > *its* group. > Previous patch has bug around GUC parameter handling. Attached updated version. Regards, -- Masahiko Sawada
Attachment
On Fri, Mar 4, 2016 at 7:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Previous patch has bug around GUC parameter handling. > Attached updated version. I spotted a couple of typos: + used. Priority is given to servers in the order that the appear in the list. s/the appear/they appear/ - The minimum wait time is the roundtrip time between primary to standby. + The minimum wait time is the roundtrip time between the primary and the + almost synchronous standby. s/almost/slowest/ -- Thomas Munro http://www.enterprisedb.com
Hello, Sorry for long, hard-to-read writings in advance.. At Thu, 3 Mar 2016 23:30:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD3XGZtuvgc5uKJdvcoJP5S0rvGQQCJLRL4rLsruRch5Q@mail.gmail.com> > Hi, > > Thank you so much for reviewing this patch! > > All review comments regarding document and comment are fixed. > Attached latest v14 patch. > > > This accepts 'abc^Id' as a name, which is wrong behavior (but > > such appliction names are not allowed anyway. If you assume so, > > I'd like to see a comment for that.). > > 'abc^Id' is accepted as application_name, no? > postgres(1)=# set application_name to 'abc^Id'; > SET > postgres(1)=# show application_name ; > application_name > ------------------ > abc^Id > (1 row) Sorry, I implicitly used "^" in the meaning of "ctrl key". So "^I" is so-called Ctrl-I, that is horizontal tab or 0x09. So the following in psql shows that. =# set application_name to E'abc\td'; =# show application_name ;application_name ------------------ab?d (1 row) The <tab> is replaced with '?' (literally) at the time of guc assinment. > > addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned > > char ychar) requires differnt character types. Is there any reason > > for that? > > Because addlit_xd_string() is for adding string(char *) to xd_string, > OTOH addlit_xd_char() is for adding just one character to xd_string. Umm. My qustion might have been a bit out of the point. The addlitchar_xd_string(str,unsigned char c) does appendStringInfoChar(, c). On the other hand, the signature of the function of stringinfo is the following. AppendStringInfoChar(StringInfo str, char ch); Of course "char" is equivalent of "signed char" as default. addlitchar_xd_string assigns the given character in "unsigned char" to the parameter of AppendStringInfoChar of "signed char". These two are incompatible types. Imagine the following codelet, #include <stdio.h> void hoge(signed char c){ int ch = c; fprintf(stderr, "char = %d\n", ch); } int main(void) { unsigned char u; u = 200; hoge(u); return 0; } The result is -56. So we generally should get rid of such type of mixture of signedness for no particular reason. In this case, the domain of the variable is 0x20-0x7e so no problem won't be actualized but also there's no reason for the signedness mixture. > > I personally don't like addlit*string() things for such simple > > syntax but itself is acceptble enough for me. However it uses > > StringInfo to hold double-quoted names, which pallocs 1024 bytes > > of memory chunk for every double-quoted name. The chunks are > > finally stacked up left uncollected until the current > > memorycontext is deleted or reset (It is deleted just after > > finishing config file processing). Addition to that, setting > > s_s_names runs the parser twice. It seems to me too greedy and > > seems that static char [NAMEDATALEN] is enough using the v12 way > > without palloc/repalloc. > > I though that length of group name could be more than NAMEDATALEN, so > I use StringInfo. > Is it not necessary? Such long names doesn't seem to necessary. Too long identifiers no longer act as identifier for human eyeballs. We are limiting the length of identifiers of the whole database system to NAMEDATALEN-1, which seems to have been enough so I don't see any reason to have a group name longer than that. > > I found that the name SyncGroupName.wait_num is not > > instinctive. How about sync_num, sync_member_num or > > sync_standby_num? If the last is preferable, .members also should > > be .standbys . > > Thanks, sync_num is preferable to me. > > === > > I am quite uncomfortable with the existence of > > WanSnd.sync_standby_priority. It represented the pirority in the > > old linear s_s_names format but nested groups or even > > single-level quarum list obviously doesn't fit it. Can we get rid > > of sync_standby_priority, even though we realize atmost > > n-priority for now? > > We could get rid of sync_standby_priority. > But if so, we will not be able to see the next sync standby in > pg_stat_replication system view. > Regarding each node priority, I was thinking that standbys in quorum > list have same priority, and in nested group each standbys are given > the priority starting from 1. As far as I can see the varialbe is referred to as a boolean to indicate whether a walsernder is connected to a candidate synchronous standby. So the value is totally useless, at least for now. However, SyncRepRelaseWaiters uses the value to check if the synced LSNs can be advaned by a walsender so the variable is useful as a boolean. In the previous versions, the reason why WanSnd had the priority value is that a pair of synchronized LSNs is determined only by one wansender, which has the highest priority among active wansenders. So even if a walsender receives a response from walreceiver, it doesn't need to do nothing if it is not at the highest priority. It's a simple world. In the quorum commit word, in contrast, what SyncRepGetSyncStandbysFn shoud do is returning certain private information to be used to calculate a pair of safe/synched LSNs in SyncRepGetSYncedLsnsFn looking into WalSndCtl->wansnds list. The latter passes a pair of safe/synced LSNs to the upper level list or SyncRepSyncedLsnAdvancedTo as the topmost caller. There's no room for sync_standby_priority to work as the original objective. Even if we assign the value in the explained way, the values are always 1 for quorum method and duplicate values for multiple priority method. What do you want to show by the value to users? > === > > The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to > > have specific code for every prioritizing method (which are > > priority, quorum, nested and so). Is there any reson to use it as > > a callback of SyncGroupNode? > > The reason why the current code is so is that current code is for only > priority method supporting. > At first version of this feature, I'd like to implement it more simple. > > Aside from this, of course I'm planning to have specific code for nested design. > - The group can have some name nodes or group nodes. > - The group can use either 2 types of method: priority or quorum. > - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() > - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN > at that moment using group's method. > - SyncRepGetStandbysFn() function returns standbys of its group, > which are considered as sync using group's method. > > For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys > memory structure will be, > > "main(quorum)" --- "a" > | > -- "b" > | > -- "group1(priority)" --- "c" > | > -- "d" > > When determine synced LSNs, we need to consider group1's LSN using by > priority method at first, and then we can determine main's LSN using > by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. > So SyncRepGetSyncedLsnsUsingPriority() function would be, Thank you for the explanation. I *recalled* that. > > SyncRepClearStandbyGroupList is defined in syncrep.c but the > > other related functions are defined in syncgroup_gram.y. It would > > be better to place them together. > > SyncRepClearStandbyGroupList() is used by > check_synchronous_standby_names(), so I put this function syncrep.c. Thanks. > > SyncRepStandbys are to be in multilevel and the struct is > > naturally allowed to be so but SyncRepClearStandbyGroupList > > assumes it in single level. > > Because I think that we don't need to implement to fully support > nested style at first version. > We have to carefully design this feature while considering > expandability, but overkill implementation could be cause of crash. > Consider remaining time for 9.6, I feel we could implement quorum > method at best. Yes, so I proposed to ass Aseert() in the function. > > This is a comment from the aspect of abstractness of objects. > > The callers of SyncRepGetSyncStandbysUsingPriority() need to care > > the inside of SyncGroupNode but what the function should just > > return seems to be the list of wansnds element. Element number is > > useless when the SyncGroupNode nests. > > > int > > > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) > > This might need to expose 'volatile WalSnd*' (only pointer type) > > outside of walsender. > > Or it should return the list of index number of > > *WalSndCtl->walsnds*. > > SyncRepGetSyncStandbysUsingPriority() already returns the list of > index number of "WalSndCtl->walsnd" as sync_list, no? Yes, myself don't understand what I tried to say by this:( Maybe I mistook what sync_list returns as an index list of SyncGroupNode. Anyway sorry for the noise. > As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the > inside of SyncGroupNode in my design. > Selecting sync nodes from its group doesn't depend on the type of node. > What SyncRepGetSyncStandbyFn() should do is to select sync node from > *its* group. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Reply to multiple hackers. Thank you for reviewing this patch. > + used. Priority is given to servers in the order that the appear > in the list. > > s/the appear/they appear/ > > - The minimum wait time is the roundtrip time between primary to standby. > + The minimum wait time is the roundtrip time between the primary and the > + almost synchronous standby. > > s/almost/slowest/ Will fix this typo. Thanks! On Fri, Mar 4, 2016 at 5:22 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > Sorry for long, hard-to-read writings in advance.. > > At Thu, 3 Mar 2016 23:30:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD3XGZtuvgc5uKJdvcoJP5S0rvGQQCJLRL4rLsruRch5Q@mail.gmail.com> >> Hi, >> >> Thank you so much for reviewing this patch! >> >> All review comments regarding document and comment are fixed. >> Attached latest v14 patch. >> >> > This accepts 'abc^Id' as a name, which is wrong behavior (but >> > such appliction names are not allowed anyway. If you assume so, >> > I'd like to see a comment for that.). >> >> 'abc^Id' is accepted as application_name, no? >> postgres(1)=# set application_name to 'abc^Id'; >> SET >> postgres(1)=# show application_name ; >> application_name >> ------------------ >> abc^Id >> (1 row) > > Sorry, I implicitly used "^" in the meaning of "ctrl key". So > "^I" is so-called Ctrl-I, that is horizontal tab or 0x09. So the > following in psql shows that. > > =# set application_name to E'abc\td'; > =# show application_name ; > application_name > ------------------ > ab?d > (1 row) > > The <tab> is replaced with '?' (literally) at the time of > guc assinment. Oh, I see. I will comment for that. >> > addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned >> > char ychar) requires differnt character types. Is there any reason >> > for that? >> >> Because addlit_xd_string() is for adding string(char *) to xd_string, >> OTOH addlit_xd_char() is for adding just one character to xd_string. > > Umm. My qustion might have been a bit out of the point. > > The addlitchar_xd_string(str,unsigned char c) does > appendStringInfoChar(, c). On the other hand, the signature of > the function of stringinfo is the following. > > AppendStringInfoChar(StringInfo str, char ch); > > Of course "char" is equivalent of "signed char" as > default. addlitchar_xd_string assigns the given character in > "unsigned char" to the parameter of AppendStringInfoChar of > "signed char". > > These two are incompatible types. Imagine the > following codelet, > > #include <stdio.h> > > void hoge(signed char c){ > int ch = c; > fprintf(stderr, "char = %d\n", ch); > } > > int main(void) > { > unsigned char u; > > u = 200; > hoge(u); > return 0; > } > > The result is -56. So we generally should get rid of such type of > mixture of signedness for no particular reason. > > In this case, the domain of the variable is 0x20-0x7e so no > problem won't be actualized but also there's no reason for the > signedness mixture. Thank you for explanation. I will fix this. >> > I personally don't like addlit*string() things for such simple >> > syntax but itself is acceptble enough for me. However it uses >> > StringInfo to hold double-quoted names, which pallocs 1024 bytes >> > of memory chunk for every double-quoted name. The chunks are >> > finally stacked up left uncollected until the current >> > memorycontext is deleted or reset (It is deleted just after >> > finishing config file processing). Addition to that, setting >> > s_s_names runs the parser twice. It seems to me too greedy and >> > seems that static char [NAMEDATALEN] is enough using the v12 way >> > without palloc/repalloc. >> >> I though that length of group name could be more than NAMEDATALEN, so >> I use StringInfo. >> Is it not necessary? > > Such long names doesn't seem to necessary. Too long identifiers > no longer act as identifier for human eyeballs. We are limiting > the length of identifiers of the whole database system to > NAMEDATALEN-1, which seems to have been enough so I don't see any > reason to have a group name longer than that. > I see. I will fix this. >> > I found that the name SyncGroupName.wait_num is not >> > instinctive. How about sync_num, sync_member_num or >> > sync_standby_num? If the last is preferable, .members also should >> > be .standbys . >> >> Thanks, sync_num is preferable to me. >> >> === >> > I am quite uncomfortable with the existence of >> > WanSnd.sync_standby_priority. It represented the pirority in the >> > old linear s_s_names format but nested groups or even >> > single-level quarum list obviously doesn't fit it. Can we get rid >> > of sync_standby_priority, even though we realize atmost >> > n-priority for now? >> >> We could get rid of sync_standby_priority. >> But if so, we will not be able to see the next sync standby in >> pg_stat_replication system view. >> Regarding each node priority, I was thinking that standbys in quorum >> list have same priority, and in nested group each standbys are given >> the priority starting from 1. > > As far as I can see the varialbe is referred to as a boolean to > indicate whether a walsernder is connected to a candidate > synchronous standby. So the value is totally useless, at least > for now. However, SyncRepRelaseWaiters uses the value to check if > the synced LSNs can be advaned by a walsender so the variable is > useful as a boolean. > > In the previous versions, the reason why WanSnd had the priority > value is that a pair of synchronized LSNs is determined only by > one wansender, which has the highest priority among active > wansenders. So even if a walsender receives a response from > walreceiver, it doesn't need to do nothing if it is not at the > highest priority. It's a simple world. > > In the quorum commit word, in contrast, what > SyncRepGetSyncStandbysFn shoud do is returning certain private > information to be used to calculate a pair of safe/synched LSNs > in SyncRepGetSYncedLsnsFn looking into WalSndCtl->wansnds > list. The latter passes a pair of safe/synced LSNs to the upper > level list or SyncRepSyncedLsnAdvancedTo as the topmost > caller. There's no room for sync_standby_priority to work as the > original objective. > > Even if we assign the value in the explained way, the values are > always 1 for quorum method and duplicate values for multiple > priority method. What do you want to show by the value to users? I agree with you. When we implement nested style of multiple sync replication, it would tough to show to users using by sync_standby_priority. But in current our first goal (implementing 1-nest style), it doesn't seem to need to get rid of sync_standby_priority from WalSnd so far, no? Towards multiple nested style, I'm roughly planning to have new system view is defined like follows. - New system view shows all groups and nodes informations. - Move sync_state from pg_stat_replication to new system view. - Get rid of sync_priority from pg_stat_replication. - Add new sync_state 'quorum' that indicates candidate sync standbys of its group using quorum method. - If parent group state is potential, 'potential:' prefix is added to the child standby's sync_state. * s_s_names = '2[a, 1(b,c):group1, 1[d,e]:gorup2]' name | sync_method | member | sync_num | sync_state | parant_group -----------+--------------------+---------------------------+---------------+--------------------------+--------------main | priority | {a,group1,group2} | 2 | |a | | | | sync | maingroup1 | quorum | {b,c} | 1 | sync | mainb | | | | sync | group1c | | | | potential | group1group2| priority | {d,e} | 1| potential | maind | | | | potential:sync | group2e | | | | potential:potential | group2 (8 rows) * s_s_names = '2(a, 1[b,c]:group1, 1(d,e):group2)' name | sync_method | member | sync_num | sync_state | parant_group -----------+--------------------+--------------------------+----------------+--------------------------+--------------main | quorum | {a,group1,group2} | 2 | |a | | | | quorum | maingroup1 | priority | {b,c} | 1 | quorum | mainb | | | | sync |group1c | | | | potential | group1group2 | quorum | {d,e} | 1 | quorum | maind | | | | quorum | group2e | | | | quorum | group2 (8 rows) >> > SyncRepStandbys are to be in multilevel and the struct is >> > naturally allowed to be so but SyncRepClearStandbyGroupList >> > assumes it in single level. >> >> Because I think that we don't need to implement to fully support >> nested style at first version. >> We have to carefully design this feature while considering >> expandability, but overkill implementation could be cause of crash. >> Consider remaining time for 9.6, I feel we could implement quorum >> method at best. > > Yes, so I proposed to ass Aseert() in the function. Will add it. Regards, -- Masahiko Sawada
On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Hi, >> >> Thank you so much for reviewing this patch! >> >> All review comments regarding document and comment are fixed. >> Attached latest v14 patch. >> >>> This accepts 'abc^Id' as a name, which is wrong behavior (but >>> such appliction names are not allowed anyway. If you assume so, >>> I'd like to see a comment for that.). >> >> 'abc^Id' is accepted as application_name, no? >> postgres(1)=# set application_name to 'abc^Id'; >> SET >> postgres(1)=# show application_name ; >> application_name >> ------------------ >> abc^Id >> (1 row) >> >>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned >>> char ychar) requires differnt character types. Is there any reason >>> for that? >> >> Because addlit_xd_string() is for adding string(char *) to xd_string, >> OTOH addlit_xd_char() is for adding just one character to xd_string. >> >>> I personally don't like addlit*string() things for such simple >>> syntax but itself is acceptble enough for me. However it uses >>> StringInfo to hold double-quoted names, which pallocs 1024 bytes >>> of memory chunk for every double-quoted name. The chunks are >>> finally stacked up left uncollected until the current >>> memorycontext is deleted or reset (It is deleted just after >>> finishing config file processing). Addition to that, setting >>> s_s_names runs the parser twice. It seems to me too greedy and >>> seems that static char [NAMEDATALEN] is enough using the v12 way >>> without palloc/repalloc. >> >> I though that length of group name could be more than NAMEDATALEN, so >> I use StringInfo. >> Is it not necessary? >> >>> I found that the name SyncGroupName.wait_num is not >>> instinctive. How about sync_num, sync_member_num or >>> sync_standby_num? If the last is preferable, .members also should >>> be .standbys . >> >> Thanks, sync_num is preferable to me. >> >> === >>> I am quite uncomfortable with the existence of >>> WanSnd.sync_standby_priority. It represented the pirority in the >>> old linear s_s_names format but nested groups or even >>> single-level quarum list obviously doesn't fit it. Can we get rid >>> of sync_standby_priority, even though we realize atmost >>> n-priority for now? >> >> We could get rid of sync_standby_priority. >> But if so, we will not be able to see the next sync standby in >> pg_stat_replication system view. >> Regarding each node priority, I was thinking that standbys in quorum >> list have same priority, and in nested group each standbys are given >> the priority starting from 1. >> >> === >>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to >>> have specific code for every prioritizing method (which are >>> priority, quorum, nested and so). Is there any reson to use it as >>> a callback of SyncGroupNode? >> >> The reason why the current code is so is that current code is for only >> priority method supporting. >> At first version of this feature, I'd like to implement it more simple. >> >> Aside from this, of course I'm planning to have specific code for nested design. >> - The group can have some name nodes or group nodes. >> - The group can use either 2 types of method: priority or quorum. >> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() >> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN >> at that moment using group's method. >> - SyncRepGetStandbysFn() function returns standbys of its group, >> which are considered as sync using group's method. >> >> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys >> memory structure will be, >> >> "main(quorum)" --- "a" >> | >> -- "b" >> | >> -- "group1(priority)" --- "c" >> | >> -- "d" >> >> When determine synced LSNs, we need to consider group1's LSN using by >> priority method at first, and then we can determine main's LSN using >> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. >> So SyncRepGetSyncedLsnsUsingPriority() function would be, >> >> bool >> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) >> { >> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); >> >> if (sync_num < group->sync_num) >> return false; >> >> for (each member of sync_list) >> { >> if (member->type == group node) >> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and >> f into lsn_list. >> else >> Store name node LSNs into lsn_list. >> } >> >> Determine synced LSNs of this group using lsn_list and priority method. >> Store synced LSNs into write_lsn and flush_lsn. >> return true; >> } >> >>> SyncRepClearStandbyGroupList is defined in syncrep.c but the >>> other related functions are defined in syncgroup_gram.y. It would >>> be better to place them together. >> >> SyncRepClearStandbyGroupList() is used by >> check_synchronous_standby_names(), so I put this function syncrep.c. >> >>> SyncRepStandbys are to be in multilevel and the struct is >>> naturally allowed to be so but SyncRepClearStandbyGroupList >>> assumes it in single level. >> >> Because I think that we don't need to implement to fully support >> nested style at first version. >> We have to carefully design this feature while considering >> expandability, but overkill implementation could be cause of crash. >> Consider remaining time for 9.6, I feel we could implement quorum >> method at best. >> >>> This is a comment from the aspect of abstractness of objects. >>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care >>> the inside of SyncGroupNode but what the function should just >>> return seems to be the list of wansnds element. Element number is >>> useless when the SyncGroupNode nests. >>> > int >>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) >>> This might need to expose 'volatile WalSnd*' (only pointer type) >>> outside of walsender. >>> Or it should return the list of index number of >>> *WalSndCtl->walsnds*. >> >> SyncRepGetSyncStandbysUsingPriority() already returns the list of >> index number of "WalSndCtl->walsnd" as sync_list, no? >> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the >> inside of SyncGroupNode in my design. >> Selecting sync nodes from its group doesn't depend on the type of node. >> What SyncRepGetSyncStandbyFn() should do is to select sync node from >> *its* group. >> > > Previous patch has bug around GUC parameter handling. > Attached updated version. Thanks for updating the patch! Now I'm fixing some problems (e.g., current patch doesn't work with EXEC_BACKEND environment) and revising the patch. I will post the revised version this weekend or the first half of next week. Regards, -- Fujii Masao
<para> Synchronous replication offers the ability to confirm that all changes - made by a transaction have been transferred to one synchronous standby - server. This extends the standard level of durability + made by a transaction have been transferred to one or more synchronous standby + server. This extends that standard level of durability offered by a transaction commit. This level of protectionis referred - to as 2-safe replication in computer science theory. + to as group-safe replication in computer science theory. </para> A message on the -general list today pointed me to some earlier discussion[1] which quoted and referenced definitions of these academic terms[2]. I think the above documentation should say: "This level of protection is referred to as 2-safe replication in computer science literature when <variable>synchronous_commit</> is set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when <variable>synchronous_commit</> is set to <literal>remote_write</>." By my reading, the situation doesn't actually change with this patch. It doesn't matter whether you need 1 or 42 synchronous standbys to make a quorum: 2-safe means durable (fsync) on all of them, group-1-safe means durable on one server and received (implied by remote_write) by all of them. I think we should be using those definitions because Gray's earlier definition of 2-safe from Transaction Processing 12.6.3 doesn't really fit: It can optionally mean remote receipt or remote durable storage, but it doesn't wait if the 'backup' is down, so it's not the same type of guarantee. (He also has 'very safe' which might describe our syncrep, I'm not sure.) [1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com [2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76 On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Hi, >>> >>> Thank you so much for reviewing this patch! >>> >>> All review comments regarding document and comment are fixed. >>> Attached latest v14 patch. >>> >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but >>>> such appliction names are not allowed anyway. If you assume so, >>>> I'd like to see a comment for that.). >>> >>> 'abc^Id' is accepted as application_name, no? >>> postgres(1)=# set application_name to 'abc^Id'; >>> SET >>> postgres(1)=# show application_name ; >>> application_name >>> ------------------ >>> abc^Id >>> (1 row) >>> >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned >>>> char ychar) requires differnt character types. Is there any reason >>>> for that? >>> >>> Because addlit_xd_string() is for adding string(char *) to xd_string, >>> OTOH addlit_xd_char() is for adding just one character to xd_string. >>> >>>> I personally don't like addlit*string() things for such simple >>>> syntax but itself is acceptble enough for me. However it uses >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes >>>> of memory chunk for every double-quoted name. The chunks are >>>> finally stacked up left uncollected until the current >>>> memorycontext is deleted or reset (It is deleted just after >>>> finishing config file processing). Addition to that, setting >>>> s_s_names runs the parser twice. It seems to me too greedy and >>>> seems that static char [NAMEDATALEN] is enough using the v12 way >>>> without palloc/repalloc. >>> >>> I though that length of group name could be more than NAMEDATALEN, so >>> I use StringInfo. >>> Is it not necessary? >>> >>>> I found that the name SyncGroupName.wait_num is not >>>> instinctive. How about sync_num, sync_member_num or >>>> sync_standby_num? If the last is preferable, .members also should >>>> be .standbys . >>> >>> Thanks, sync_num is preferable to me. >>> >>> === >>>> I am quite uncomfortable with the existence of >>>> WanSnd.sync_standby_priority. It represented the pirority in the >>>> old linear s_s_names format but nested groups or even >>>> single-level quarum list obviously doesn't fit it. Can we get rid >>>> of sync_standby_priority, even though we realize atmost >>>> n-priority for now? >>> >>> We could get rid of sync_standby_priority. >>> But if so, we will not be able to see the next sync standby in >>> pg_stat_replication system view. >>> Regarding each node priority, I was thinking that standbys in quorum >>> list have same priority, and in nested group each standbys are given >>> the priority starting from 1. >>> >>> === >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to >>>> have specific code for every prioritizing method (which are >>>> priority, quorum, nested and so). Is there any reson to use it as >>>> a callback of SyncGroupNode? >>> >>> The reason why the current code is so is that current code is for only >>> priority method supporting. >>> At first version of this feature, I'd like to implement it more simple. >>> >>> Aside from this, of course I'm planning to have specific code for nested design. >>> - The group can have some name nodes or group nodes. >>> - The group can use either 2 types of method: priority or quorum. >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() >>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN >>> at that moment using group's method. >>> - SyncRepGetStandbysFn() function returns standbys of its group, >>> which are considered as sync using group's method. >>> >>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys >>> memory structure will be, >>> >>> "main(quorum)" --- "a" >>> | >>> -- "b" >>> | >>> -- "group1(priority)" --- "c" >>> | >>> -- "d" >>> >>> When determine synced LSNs, we need to consider group1's LSN using by >>> priority method at first, and then we can determine main's LSN using >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. >>> So SyncRepGetSyncedLsnsUsingPriority() function would be, >>> >>> bool >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) >>> { >>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); >>> >>> if (sync_num < group->sync_num) >>> return false; >>> >>> for (each member of sync_list) >>> { >>> if (member->type == group node) >>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and >>> f into lsn_list. >>> else >>> Store name node LSNs into lsn_list. >>> } >>> >>> Determine synced LSNs of this group using lsn_list and priority method. >>> Store synced LSNs into write_lsn and flush_lsn. >>> return true; >>> } >>> >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the >>>> other related functions are defined in syncgroup_gram.y. It would >>>> be better to place them together. >>> >>> SyncRepClearStandbyGroupList() is used by >>> check_synchronous_standby_names(), so I put this function syncrep.c. >>> >>>> SyncRepStandbys are to be in multilevel and the struct is >>>> naturally allowed to be so but SyncRepClearStandbyGroupList >>>> assumes it in single level. >>> >>> Because I think that we don't need to implement to fully support >>> nested style at first version. >>> We have to carefully design this feature while considering >>> expandability, but overkill implementation could be cause of crash. >>> Consider remaining time for 9.6, I feel we could implement quorum >>> method at best. >>> >>>> This is a comment from the aspect of abstractness of objects. >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care >>>> the inside of SyncGroupNode but what the function should just >>>> return seems to be the list of wansnds element. Element number is >>>> useless when the SyncGroupNode nests. >>>> > int >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) >>>> This might need to expose 'volatile WalSnd*' (only pointer type) >>>> outside of walsender. >>>> Or it should return the list of index number of >>>> *WalSndCtl->walsnds*. >>> >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of >>> index number of "WalSndCtl->walsnd" as sync_list, no? >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the >>> inside of SyncGroupNode in my design. >>> Selecting sync nodes from its group doesn't depend on the type of node. >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from >>> *its* group. >>> >> >> Previous patch has bug around GUC parameter handling. >> Attached updated version. > > Thanks for updating the patch! > > Now I'm fixing some problems (e.g., current patch doesn't work > with EXEC_BACKEND environment) and revising the patch. > I will post the revised version this weekend or the first half > of next week. > > Regards, > > -- > Fujii Masao -- Thomas Munro http://www.enterprisedb.com
It seems to me a matter of definition of "available replicas". At Wed, 16 Mar 2016 14:13:48 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=3Ye+Ax_5=MZeHMkx9DFn25QoRzs362sQGNvGcVWx+18w@mail.gmail.com> > <para> > Synchronous replication offers the ability to confirm that all changes > - made by a transaction have been transferred to one synchronous standby > - server. This extends the standard level of durability > + made by a transaction have been transferred to one or more > synchronous standby > + server. This extends that standard level of durability > offered by a transaction commit. This level of protection is referred > - to as 2-safe replication in computer science theory. > + to as group-safe replication in computer science theory. > </para> > > A message on the -general list today pointed me to some earlier > discussion[1] which quoted and referenced definitions of these > academic terms[2]. I think the above documentation should say: > > "This level of protection is referred to as 2-safe replication in > computer science literature when <variable>synchronous_commit</> is > set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when > <variable>synchronous_commit</> is set to <literal>remote_write</>." I suppose that the "available replica" on the paper is equivalent to "one choosen synchronous server" at the top of the queue of living standbys specified by s_s_names. The original description is true based on this interpretation. > By my reading, the situation doesn't actually change with this patch. > It doesn't matter whether you need 1 or 42 synchronous standbys to > make a quorum: 2-safe means durable (fsync) on all of them, > group-1-safe means durable on one server and received (implied by > remote_write) by all of them. Likewise, "the first two of the living standbys" (2[r01, ..r42]) and the master is translated to "three replicas". So it keeps 2-safe for the case. > I think we should be using those definitions because Gray's earlier > definition of 2-safe from Transaction Processing 12.6.3 doesn't really > fit: It can optionally mean remote receipt or remote durable storage, > but it doesn't wait if the 'backup' is down, so it's not the same type > of guarantee. (He also has 'very safe' which might describe our > syncrep, I'm not sure.) If the discussion above is true, the description doesn't seem to need to be amended in the view of the safe-criteria. > <para> > Synchronous replication offers the ability to confirm that all changes > - made by a transaction have been transferred to one synchronous standby > - server. This extends the standard level of durability > + made by a transaction have been transferred to one or more synchronous standby > + server. This extends that standard level of durability > offered by a transaction commit. This level of protection is referred > to as 2-safe replication in computer science theory. > </para> But some additional explanation might be needed. For the true quorum commit, a client will be notified when the master and any n of all standbys have committed. This won't fit exactly to the criterias in the paper. In regard to Gray's definition, "2-safe" looks to be PG's syncrep with automatic release mechanism, such like what pgsql-RA offers. And "high availability" doesn't seem to fit to PostgreSQL's behavior because the master virtually commits a transaction before making an agreement to commit among all of replicas. # I'm reading it in Japanese so some words may be incorrect. Thoughts? > [1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com > [2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76 > > On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >>> Hi, > >>> > >>> Thank you so much for reviewing this patch! > >>> > >>> All review comments regarding document and comment are fixed. > >>> Attached latest v14 patch. > >>> > >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but > >>>> such appliction names are not allowed anyway. If you assume so, > >>>> I'd like to see a comment for that.). > >>> > >>> 'abc^Id' is accepted as application_name, no? > >>> postgres(1)=# set application_name to 'abc^Id'; > >>> SET > >>> postgres(1)=# show application_name ; > >>> application_name > >>> ------------------ > >>> abc^Id > >>> (1 row) > >>> > >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned > >>>> char ychar) requires differnt character types. Is there any reason > >>>> for that? > >>> > >>> Because addlit_xd_string() is for adding string(char *) to xd_string, > >>> OTOH addlit_xd_char() is for adding just one character to xd_string. > >>> > >>>> I personally don't like addlit*string() things for such simple > >>>> syntax but itself is acceptble enough for me. However it uses > >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes > >>>> of memory chunk for every double-quoted name. The chunks are > >>>> finally stacked up left uncollected until the current > >>>> memorycontext is deleted or reset (It is deleted just after > >>>> finishing config file processing). Addition to that, setting > >>>> s_s_names runs the parser twice. It seems to me too greedy and > >>>> seems that static char [NAMEDATALEN] is enough using the v12 way > >>>> without palloc/repalloc. > >>> > >>> I though that length of group name could be more than NAMEDATALEN, so > >>> I use StringInfo. > >>> Is it not necessary? > >>> > >>>> I found that the name SyncGroupName.wait_num is not > >>>> instinctive. How about sync_num, sync_member_num or > >>>> sync_standby_num? If the last is preferable, .members also should > >>>> be .standbys . > >>> > >>> Thanks, sync_num is preferable to me. > >>> > >>> === > >>>> I am quite uncomfortable with the existence of > >>>> WanSnd.sync_standby_priority. It represented the pirority in the > >>>> old linear s_s_names format but nested groups or even > >>>> single-level quarum list obviously doesn't fit it. Can we get rid > >>>> of sync_standby_priority, even though we realize atmost > >>>> n-priority for now? > >>> > >>> We could get rid of sync_standby_priority. > >>> But if so, we will not be able to see the next sync standby in > >>> pg_stat_replication system view. > >>> Regarding each node priority, I was thinking that standbys in quorum > >>> list have same priority, and in nested group each standbys are given > >>> the priority starting from 1. > >>> > >>> === > >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to > >>>> have specific code for every prioritizing method (which are > >>>> priority, quorum, nested and so). Is there any reson to use it as > >>>> a callback of SyncGroupNode? > >>> > >>> The reason why the current code is so is that current code is for only > >>> priority method supporting. > >>> At first version of this feature, I'd like to implement it more simple. > >>> > >>> Aside from this, of course I'm planning to have specific code for nested design. > >>> - The group can have some name nodes or group nodes. > >>> - The group can use either 2 types of method: priority or quorum. > >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() > >>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN > >>> at that moment using group's method. > >>> - SyncRepGetStandbysFn() function returns standbys of its group, > >>> which are considered as sync using group's method. > >>> > >>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys > >>> memory structure will be, > >>> > >>> "main(quorum)" --- "a" > >>> | > >>> -- "b" > >>> | > >>> -- "group1(priority)" --- "c" > >>> | > >>> -- "d" > >>> > >>> When determine synced LSNs, we need to consider group1's LSN using by > >>> priority method at first, and then we can determine main's LSN using > >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. > >>> So SyncRepGetSyncedLsnsUsingPriority() function would be, > >>> > >>> bool > >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) > >>> { > >>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); > >>> > >>> if (sync_num < group->sync_num) > >>> return false; > >>> > >>> for (each member of sync_list) > >>> { > >>> if (member->type == group node) > >>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and > >>> f into lsn_list. > >>> else > >>> Store name node LSNs into lsn_list. > >>> } > >>> > >>> Determine synced LSNs of this group using lsn_list and priority method. > >>> Store synced LSNs into write_lsn and flush_lsn. > >>> return true; > >>> } > >>> > >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the > >>>> other related functions are defined in syncgroup_gram.y. It would > >>>> be better to place them together. > >>> > >>> SyncRepClearStandbyGroupList() is used by > >>> check_synchronous_standby_names(), so I put this function syncrep.c. > >>> > >>>> SyncRepStandbys are to be in multilevel and the struct is > >>>> naturally allowed to be so but SyncRepClearStandbyGroupList > >>>> assumes it in single level. > >>> > >>> Because I think that we don't need to implement to fully support > >>> nested style at first version. > >>> We have to carefully design this feature while considering > >>> expandability, but overkill implementation could be cause of crash. > >>> Consider remaining time for 9.6, I feel we could implement quorum > >>> method at best. > >>> > >>>> This is a comment from the aspect of abstractness of objects. > >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care > >>>> the inside of SyncGroupNode but what the function should just > >>>> return seems to be the list of wansnds element. Element number is > >>>> useless when the SyncGroupNode nests. > >>>> > int > >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) > >>>> This might need to expose 'volatile WalSnd*' (only pointer type) > >>>> outside of walsender. > >>>> Or it should return the list of index number of > >>>> *WalSndCtl->walsnds*. > >>> > >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of > >>> index number of "WalSndCtl->walsnd" as sync_list, no? > >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the > >>> inside of SyncGroupNode in my design. > >>> Selecting sync nodes from its group doesn't depend on the type of node. > >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from > >>> *its* group. > >>> > >> > >> Previous patch has bug around GUC parameter handling. > >> Attached updated version. > > > > Thanks for updating the patch! > > > > Now I'm fixing some problems (e.g., current patch doesn't work > > with EXEC_BACKEND environment) and revising the patch. > > I will post the revised version this weekend or the first half > > of next week. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 10, 2016 at 7:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Hi, >>> >>> Thank you so much for reviewing this patch! >>> >>> All review comments regarding document and comment are fixed. >>> Attached latest v14 patch. >>> >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but >>>> such appliction names are not allowed anyway. If you assume so, >>>> I'd like to see a comment for that.). >>> >>> 'abc^Id' is accepted as application_name, no? >>> postgres(1)=# set application_name to 'abc^Id'; >>> SET >>> postgres(1)=# show application_name ; >>> application_name >>> ------------------ >>> abc^Id >>> (1 row) >>> >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned >>>> char ychar) requires differnt character types. Is there any reason >>>> for that? >>> >>> Because addlit_xd_string() is for adding string(char *) to xd_string, >>> OTOH addlit_xd_char() is for adding just one character to xd_string. >>> >>>> I personally don't like addlit*string() things for such simple >>>> syntax but itself is acceptble enough for me. However it uses >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes >>>> of memory chunk for every double-quoted name. The chunks are >>>> finally stacked up left uncollected until the current >>>> memorycontext is deleted or reset (It is deleted just after >>>> finishing config file processing). Addition to that, setting >>>> s_s_names runs the parser twice. It seems to me too greedy and >>>> seems that static char [NAMEDATALEN] is enough using the v12 way >>>> without palloc/repalloc. >>> >>> I though that length of group name could be more than NAMEDATALEN, so >>> I use StringInfo. >>> Is it not necessary? >>> >>>> I found that the name SyncGroupName.wait_num is not >>>> instinctive. How about sync_num, sync_member_num or >>>> sync_standby_num? If the last is preferable, .members also should >>>> be .standbys . >>> >>> Thanks, sync_num is preferable to me. >>> >>> === >>>> I am quite uncomfortable with the existence of >>>> WanSnd.sync_standby_priority. It represented the pirority in the >>>> old linear s_s_names format but nested groups or even >>>> single-level quarum list obviously doesn't fit it. Can we get rid >>>> of sync_standby_priority, even though we realize atmost >>>> n-priority for now? >>> >>> We could get rid of sync_standby_priority. >>> But if so, we will not be able to see the next sync standby in >>> pg_stat_replication system view. >>> Regarding each node priority, I was thinking that standbys in quorum >>> list have same priority, and in nested group each standbys are given >>> the priority starting from 1. >>> >>> === >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to >>>> have specific code for every prioritizing method (which are >>>> priority, quorum, nested and so). Is there any reson to use it as >>>> a callback of SyncGroupNode? >>> >>> The reason why the current code is so is that current code is for only >>> priority method supporting. >>> At first version of this feature, I'd like to implement it more simple. >>> >>> Aside from this, of course I'm planning to have specific code for nested design. >>> - The group can have some name nodes or group nodes. >>> - The group can use either 2 types of method: priority or quorum. >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() >>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN >>> at that moment using group's method. >>> - SyncRepGetStandbysFn() function returns standbys of its group, >>> which are considered as sync using group's method. >>> >>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys >>> memory structure will be, >>> >>> "main(quorum)" --- "a" >>> | >>> -- "b" >>> | >>> -- "group1(priority)" --- "c" >>> | >>> -- "d" >>> >>> When determine synced LSNs, we need to consider group1's LSN using by >>> priority method at first, and then we can determine main's LSN using >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. >>> So SyncRepGetSyncedLsnsUsingPriority() function would be, >>> >>> bool >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) >>> { >>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); >>> >>> if (sync_num < group->sync_num) >>> return false; >>> >>> for (each member of sync_list) >>> { >>> if (member->type == group node) >>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and >>> f into lsn_list. >>> else >>> Store name node LSNs into lsn_list. >>> } >>> >>> Determine synced LSNs of this group using lsn_list and priority method. >>> Store synced LSNs into write_lsn and flush_lsn. >>> return true; >>> } >>> >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the >>>> other related functions are defined in syncgroup_gram.y. It would >>>> be better to place them together. >>> >>> SyncRepClearStandbyGroupList() is used by >>> check_synchronous_standby_names(), so I put this function syncrep.c. >>> >>>> SyncRepStandbys are to be in multilevel and the struct is >>>> naturally allowed to be so but SyncRepClearStandbyGroupList >>>> assumes it in single level. >>> >>> Because I think that we don't need to implement to fully support >>> nested style at first version. >>> We have to carefully design this feature while considering >>> expandability, but overkill implementation could be cause of crash. >>> Consider remaining time for 9.6, I feel we could implement quorum >>> method at best. >>> >>>> This is a comment from the aspect of abstractness of objects. >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care >>>> the inside of SyncGroupNode but what the function should just >>>> return seems to be the list of wansnds element. Element number is >>>> useless when the SyncGroupNode nests. >>>> > int >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list) >>>> This might need to expose 'volatile WalSnd*' (only pointer type) >>>> outside of walsender. >>>> Or it should return the list of index number of >>>> *WalSndCtl->walsnds*. >>> >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of >>> index number of "WalSndCtl->walsnd" as sync_list, no? >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the >>> inside of SyncGroupNode in my design. >>> Selecting sync nodes from its group doesn't depend on the type of node. >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from >>> *its* group. >>> >> >> Previous patch has bug around GUC parameter handling. >> Attached updated version. > > Thanks for updating the patch! > > Now I'm fixing some problems (e.g., current patch doesn't work > with EXEC_BACKEND environment) and revising the patch. Sorry for the delay... Here is the revised version of the patch. Please review and test this version! BTW, I've not revised the documentation and regression test yet. I will do that during the review and test of the patch. Regards, -- Fujii Masao
Attachment
Thank you for the revised patch. At Tue, 22 Mar 2016 16:02:39 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwGnvuX8wR-FYH+TrNi_TWunZzU=nJFMdXkO6O8M4GbNvQ@mail.gmail.com> > On Thu, Mar 10, 2016 at 7:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Sorry for the delay... Here is the revised version of the patch. > Please review and test this version! > BTW, I've not revised the documentation and regression test yet. > I will do that during the review and test of the patch. This version looks to focus on n-priority method. Stuffs for the other methods like n-quorum has been removed. It is okay for me. So using WalSnd->sync_standby_priority is reasonable. SyncRePGetSyncStandbys seems to work as expected, that is, collecting n standbys in the order of priority, even if multiple standbys are at the same prioirity, but in (pseudo) random order among the standbys with the same priority, not LSN order. This is the difference from the true quoraum method. About announcement of take over, > if (announce_next_takeover && am_sync) > { > announce_next_takeover = false; > ereport(LOG, > (errmsg("standby \"%s\" is now the synchronous standby with priority %u", > application_name, MyWalSnd->sync_standby_priority))); This can announces for the seemingly same standby successively if standbys with the same application_name are comming-in and going-out. But this is the same as the current behavior. Otherwise, as far as I can see, SyncRepReleaseWaiters seems to work correctly. SyncRepinitConfig parses s_s_names then prioritize all walsenders based on the result. This is run at the start of a walsender and at reloading of config. Ended walsenders are excluded on collectiong sync-standbys. All of these seems to work properly. (as before). The parser became far simpler by getting rid of the stuffs for the future expansion. It accepts only '<n>[name, ...]' and the old s_s_names format. StringInfo for double-quoted names seems to me to be overkill, since it allocates 1024 byte block for every such name. A static buffer seems enough for the usage as I said. The parser is called for not only for SIGHUP, but also for starting of every walsender. The latter is not necessary but it is the matter of trade-off between simplisity and effectiveness. The same can be said for check_synchronous_standby_names(). regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Thank you for the revised patch. Thanks for reviewing the patch! > This version looks to focus on n-priority method. Stuffs for the > other methods like n-quorum has been removed. It is okay for me. I don't think it's so difficult to extend this version so that it supports also quorum commit. > StringInfo for double-quoted names seems to me to be overkill, > since it allocates 1024 byte block for every such name. A static > buffer seems enough for the usage as I said. So, what about changing the scanner code as follows? <xd>{xdstop} { yylval.str = pstrdup(xdbuf.data); pfree(xdbuf.data); BEGIN(INITIAL); return NAME; > The parser is called for not only for SIGHUP, but also for > starting of every walsender. The latter is not necessary but it > is the matter of trade-off between simplisity and > effectiveness. Could you elaborate why you think that's not necessary? BTW, in previous patch, s_s_names is parsed by postmaster during the server startup. A child process takes over the internal data struct for the parsed s_s_names when it's forked by the postmaster. This is what the previous patch was expecting. However, this doesn't work in EXEC_BACKEND environment. In that environment, the data struct should be passed to a child process via the special file (like write_nondefault_variables() does), or it should be constructed during walsender startup (like latest version of the patch does). IMO the latter is simpler. Regards, -- Fujii Masao
On Tue, Mar 22, 2016 at 11:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Thank you for the revised patch. > > Thanks for reviewing the patch! > >> This version looks to focus on n-priority method. Stuffs for the >> other methods like n-quorum has been removed. It is okay for me. > > I don't think it's so difficult to extend this version so that > it supports also quorum commit. Yeah, 1-nest level implementation would not so difficult. >> StringInfo for double-quoted names seems to me to be overkill, >> since it allocates 1024 byte block for every such name. A static >> buffer seems enough for the usage as I said. > > So, what about changing the scanner code as follows? > > <xd>{xdstop} { > yylval.str = pstrdup(xdbuf.data); > pfree(xdbuf.data); > BEGIN(INITIAL); > return NAME; >> The parser is called for not only for SIGHUP, but also for >> starting of every walsender. The latter is not necessary but it >> is the matter of trade-off between simplisity and >> effectiveness. > > Could you elaborate why you think that's not necessary? > > BTW, in previous patch, s_s_names is parsed by postmaster during the server > startup. A child process takes over the internal data struct for the parsed > s_s_names when it's forked by the postmaster. This is what the previous > patch was expecting. However, this doesn't work in EXEC_BACKEND environment. > In that environment, the data struct should be passed to a child process via > the special file (like write_nondefault_variables() does), or it should > be constructed during walsender startup (like latest version of the patch > does). IMO the latter is simpler. Thank you for updating patch. Followings are random review comments. == + for (cell = list_head(pending); cell; cell = next) Can we use foreach() instead? == + pending = list_delete_cell(pending, cell, prev); + + if (list_length(pending) == 0) + { + list_free(pending); + return result; /* Exit if pending list is empty */ + } If pending list become empty after deleting element, we can return. It's a small optimisation. == If num_sync is greater than the number of members of sync standby list, we'd rather return error message immediately. Thoughts? == I got assertion error when master server is set up with empty s_s_names. Because current patch always tries to parse s_s_names and use it regardless value of parameter. Attached patch incorporates above comments. Please find it. Regards, -- Masahiko Sawada
Attachment
On Wed, Mar 23, 2016 at 2:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Mar 22, 2016 at 11:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> Thank you for the revised patch. >> >> Thanks for reviewing the patch! >> >>> This version looks to focus on n-priority method. Stuffs for the >>> other methods like n-quorum has been removed. It is okay for me. >> >> I don't think it's so difficult to extend this version so that >> it supports also quorum commit. > > Yeah, 1-nest level implementation would not so difficult. > >>> StringInfo for double-quoted names seems to me to be overkill, >>> since it allocates 1024 byte block for every such name. A static >>> buffer seems enough for the usage as I said. >> >> So, what about changing the scanner code as follows? >> >> <xd>{xdstop} { >> yylval.str = pstrdup(xdbuf.data); >> pfree(xdbuf.data); >> BEGIN(INITIAL); >> return NAME; I applied this change to the latest version of the patch. Please check that. Also I changed syncrep.c so that it uses list_free_deep() to free the list of the parsed s_s_names. Because the data in the list is palloc'd by syncrep_scanner.l. >>> The parser is called for not only for SIGHUP, but also for >>> starting of every walsender. The latter is not necessary but it >>> is the matter of trade-off between simplisity and >>> effectiveness. >> >> Could you elaborate why you think that's not necessary? >> >> BTW, in previous patch, s_s_names is parsed by postmaster during the server >> startup. A child process takes over the internal data struct for the parsed >> s_s_names when it's forked by the postmaster. This is what the previous >> patch was expecting. However, this doesn't work in EXEC_BACKEND environment. >> In that environment, the data struct should be passed to a child process via >> the special file (like write_nondefault_variables() does), or it should >> be constructed during walsender startup (like latest version of the patch >> does). IMO the latter is simpler. > > Thank you for updating patch. > > Followings are random review comments. > > == > + for (cell = list_head(pending); cell; cell = next) > > Can we use foreach() instead? Yes. > == > + pending = list_delete_cell(pending, cell, prev); > + > + if (list_length(pending) == 0) > + { > + list_free(pending); > + return result; /* > Exit if pending list is empty */ > + } > > If pending list become empty after deleting element, we can return. > It's a small optimisation. I don' think this is necessary because currently we can get ouf of the loop immediately after that deletion. But I found the bug about the calculation of the next highest priority. This could cause extra unnecessary loop. I fixed that in the latest version of the patch. > == > If num_sync is greater than the number of members of sync standby > list, we'd rather return error message immediately. > Thoughts? No. For example, please imagine the case where s_s_names is set to '*' and more than one sync standbys are connecting to the master. That's valid setting. > == > I got assertion error when master server is set up with empty s_s_names. > Because current patch always tries to parse s_s_names and use it > regardless value of parameter. Yeah, you're right. > > Attached patch incorporates above comments. > Please find it. Attached is the latest version of the patch based on your patch. Regards, -- Fujii Masao
Attachment
On Wed, Mar 23, 2016 at 1:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Mar 23, 2016 at 2:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Attached patch incorporates above comments. >> Please find it. > > Attached is the latest version of the patch based on your patch. Not really having a look at the core patch yet... + my $result = $node_master->psql('postgres', "SELECT application_name, sync_priority, sync_state FROM pg_stat_replication;"); + print "$result \n"; Having ORDER BY application_name would be good for those queries, and the result outputs could be made more consistent as a result. + # Change the s_s_names = '2[standby1,standby2,standby3]' and check sync state + $node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '2[standby1,standby2,standby3]';"); + $node_master->psql('postgres', "SELECT pg_reload_conf();"); Let's add a reload routine in PostgresNode.pm, this patch is not the only one who would use it. --- b/src/test/recovery/t/006_multisync_rep.pl *************** *** 0 **** --- 1,106 ---- + use strict; + use warnings; You may want to add a small description for this test as header. $postgres->AddFiles('src/backend/replication', 'repl_scanner.l', 'repl_gram.y'); + $postgres->AddFiles('src/backend/replication', 'syncrep_scanner.l', + 'syncrep_gram.y'); There is no need for a new routine call here, you can just append the new files on the existing call. -- Michael
Hello, At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com> > On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Thank you for the revised patch. > > Thanks for reviewing the patch! > > > This version looks to focus on n-priority method. Stuffs for the > > other methods like n-quorum has been removed. It is okay for me. > > I don't think it's so difficult to extend this version so that > it supports also quorum commit. Mmm. I think I understand this just now. As Sawada-san said before, all standbys in a single-level quorum set having the same sync_standby_prioirity, the current algorithm works as it is. It also true for the case that some quorum sets are in a priority set. What about some priority sets in a quorum set? > > StringInfo for double-quoted names seems to me to be overkill, > > since it allocates 1024 byte block for every such name. A static > > buffer seems enough for the usage as I said. > > So, what about changing the scanner code as follows? > > <xd>{xdstop} { > yylval.str = pstrdup(xdbuf.data); > pfree(xdbuf.data); > BEGIN(INITIAL); > return NAME; > > > The parser is called for not only for SIGHUP, but also for > > starting of every walsender. The latter is not necessary but it > > is the matter of trade-off between simplisity and > > effectiveness. > > Could you elaborate why you think that's not necessary? Sorry, starting of walsender is not so large problem, 1024 bytes memory is just abandoned once. SIGHUP is rather a problem. The part is called under two kinds of memory context, "config file processing" then "Replication command context". The former is deleted just after reading the config file so no harm but the latter is a quite long-lasting context and every reloading bloats the context with abandoned memory blocks. It is needed to be pfreed or to use a memory context with shorter lifetime, or use static storage of 64 byte-length, even though the bloat become visible after very many times of conf reloads. > BTW, in previous patch, s_s_names is parsed by postmaster during the server > startup. A child process takes over the internal data struct for the parsed > s_s_names when it's forked by the postmaster. This is what the previous > patch was expecting. However, this doesn't work in EXEC_BACKEND environment. > In that environment, the data struct should be passed to a child process via > the special file (like write_nondefault_variables() does), or it should > be constructed during walsender startup (like latest version of the patch > does). IMO the latter is simpler. Ah, I haven't notice that but I agree with it. As per my previous comment, syncrep_scanner.l doesn't reject some (nonprintable and multibyte) characters in a name, which is to be silently replaced with '?' for application_name. It would not be a problem for almost all of us but might be needed to be documented if we won't change the behavior to be the same as application_name. By the way, the following documentation fix mentioned by Thomas, - to as 2-safe replication in computer science theory. + to as group-safe replication in computer science theory. should be restored if the discussion in the following message is true. And some supplemental description would be needed. http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com> >> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Thank you for the revised patch. >> >> Thanks for reviewing the patch! >> >> > This version looks to focus on n-priority method. Stuffs for the >> > other methods like n-quorum has been removed. It is okay for me. >> >> I don't think it's so difficult to extend this version so that >> it supports also quorum commit. > > Mmm. I think I understand this just now. As Sawada-san said > before, all standbys in a single-level quorum set having the same > sync_standby_prioirity, the current algorithm works as it is. It > also true for the case that some quorum sets are in a priority > set. > > What about some priority sets in a quorum set? > >> > StringInfo for double-quoted names seems to me to be overkill, >> > since it allocates 1024 byte block for every such name. A static >> > buffer seems enough for the usage as I said. >> >> So, what about changing the scanner code as follows? >> >> <xd>{xdstop} { >> yylval.str = pstrdup(xdbuf.data); >> pfree(xdbuf.data); >> BEGIN(INITIAL); >> return NAME; >> >> > The parser is called for not only for SIGHUP, but also for >> > starting of every walsender. The latter is not necessary but it >> > is the matter of trade-off between simplisity and >> > effectiveness. >> >> Could you elaborate why you think that's not necessary? > > Sorry, starting of walsender is not so large problem, 1024 bytes > memory is just abandoned once. SIGHUP is rather a problem. > > The part is called under two kinds of memory context, "config > file processing" then "Replication command context". The former > is deleted just after reading the config file so no harm but the > latter is a quite long-lasting context and every reloading bloats > the context with abandoned memory blocks. It is needed to be > pfreed or to use a memory context with shorter lifetime, or use > static storage of 64 byte-length, even though the bloat become > visible after very many times of conf reloads. SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that in the patch. Or am I missing something? >> BTW, in previous patch, s_s_names is parsed by postmaster during the server >> startup. A child process takes over the internal data struct for the parsed >> s_s_names when it's forked by the postmaster. This is what the previous >> patch was expecting. However, this doesn't work in EXEC_BACKEND environment. >> In that environment, the data struct should be passed to a child process via >> the special file (like write_nondefault_variables() does), or it should >> be constructed during walsender startup (like latest version of the patch >> does). IMO the latter is simpler. > > Ah, I haven't notice that but I agree with it. > > > As per my previous comment, syncrep_scanner.l doesn't reject some > (nonprintable and multibyte) characters in a name, which is to be > silently replaced with '?' for application_name. It would not be > a problem for almost all of us but might be needed to be > documented if we won't change the behavior to be the same as > application_name. There are three options: 1. Replace nonprintable and non-ASCII characters in s_s_names with ? 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters 3. Do nothing (9.5 or before behave in this way) You implied that we should choose #1 or #2? > By the way, the following documentation fix mentioned by Thomas, > > - to as 2-safe replication in computer science theory. > + to as group-safe replication in computer science theory. > > should be restored if the discussion in the following message is > true. And some supplemental description would be needed. > > http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp Yeah, the document needs to be updated. Regards, -- Fujii Masao
On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Hello, >> >> At Tue, 22 Mar 2016 23:08:36 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFYG829=2r4mxV0ULeBNaUuG0ek_10yymx8Cu-gLYcLng@mail.gmail.com> >>> On Tue, Mar 22, 2016 at 9:58 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> > Thank you for the revised patch. >>> >>> Thanks for reviewing the patch! >>> >>> > This version looks to focus on n-priority method. Stuffs for the >>> > other methods like n-quorum has been removed. It is okay for me. >>> >>> I don't think it's so difficult to extend this version so that >>> it supports also quorum commit. >> >> Mmm. I think I understand this just now. As Sawada-san said >> before, all standbys in a single-level quorum set having the same >> sync_standby_prioirity, the current algorithm works as it is. It >> also true for the case that some quorum sets are in a priority >> set. >> >> What about some priority sets in a quorum set? We should surely consider it that when we support more than 1 nest level configuration. IMO, we can have another information which indicates current sync standbys instead of sync_priority. For now, we are'nt trying to support even quorum method, so we could consider it after we can support both priority method and quorum method without incident. >>> > StringInfo for double-quoted names seems to me to be overkill, >>> > since it allocates 1024 byte block for every such name. A static >>> > buffer seems enough for the usage as I said. >>> >>> So, what about changing the scanner code as follows? >>> >>> <xd>{xdstop} { >>> yylval.str = pstrdup(xdbuf.data); >>> pfree(xdbuf.data); >>> BEGIN(INITIAL); >>> return NAME; >>> >>> > The parser is called for not only for SIGHUP, but also for >>> > starting of every walsender. The latter is not necessary but it >>> > is the matter of trade-off between simplisity and >>> > effectiveness. >>> >>> Could you elaborate why you think that's not necessary? >> >> Sorry, starting of walsender is not so large problem, 1024 bytes >> memory is just abandoned once. SIGHUP is rather a problem. >> >> The part is called under two kinds of memory context, "config >> file processing" then "Replication command context". The former >> is deleted just after reading the config file so no harm but the >> latter is a quite long-lasting context and every reloading bloats >> the context with abandoned memory blocks. It is needed to be >> pfreed or to use a memory context with shorter lifetime, or use >> static storage of 64 byte-length, even though the bloat become >> visible after very many times of conf reloads. > > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that > in the patch. Or am I missing something? > >>> BTW, in previous patch, s_s_names is parsed by postmaster during the server >>> startup. A child process takes over the internal data struct for the parsed >>> s_s_names when it's forked by the postmaster. This is what the previous >>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment. >>> In that environment, the data struct should be passed to a child process via >>> the special file (like write_nondefault_variables() does), or it should >>> be constructed during walsender startup (like latest version of the patch >>> does). IMO the latter is simpler. >> >> Ah, I haven't notice that but I agree with it. >> >> >> As per my previous comment, syncrep_scanner.l doesn't reject some >> (nonprintable and multibyte) characters in a name, which is to be >> silently replaced with '?' for application_name. It would not be >> a problem for almost all of us but might be needed to be >> documented if we won't change the behavior to be the same as >> application_name. > > There are three options: > > 1. Replace nonprintable and non-ASCII characters in s_s_names with ? > 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters > 3. Do nothing (9.5 or before behave in this way) > > You implied that we should choose #1 or #2? Previous(9.5 or before) s_s_names also accepts non-ASCII character and non-printable character, and can show it without replacing these character to '?'. From backward compatibility perspective, we should not choose #1 or #2. Different behaviour between previous and current s_s_names is that previous s_s_names doesn't accept the node name having the sort of white-space character that isspace() returns true with. But current s_s_names allows us to specify such a node name. I guess that changing such behaviour is enough for fixing this issue. Thoughts? > >> By the way, the following documentation fix mentioned by Thomas, >> >> - to as 2-safe replication in computer science theory. >> + to as group-safe replication in computer science theory. >> >> should be restored if the discussion in the following message is >> true. And some supplemental description would be needed. >> >> http://www.postgresql.org/message-id/20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp > > Yeah, the document needs to be updated. I will do that. Regards, -- Masahiko Sawada
Hello, At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com> > On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>> I don't think it's so difficult to extend this version so that > >>> it supports also quorum commit. > >> > >> Mmm. I think I understand this just now. As Sawada-san said > >> before, all standbys in a single-level quorum set having the same > >> sync_standby_prioirity, the current algorithm works as it is. It > >> also true for the case that some quorum sets are in a priority > >> set. > >> > >> What about some priority sets in a quorum set? > > We should surely consider it that when we support more than 1 nest > level configuration. > IMO, we can have another information which indicates current sync > standbys instead of sync_priority. > For now, we are'nt trying to support even quorum method, so we could > consider it after we can support both priority method and quorum > method without incident. Fine with me. > >>> > StringInfo for double-quoted names seems to me to be overkill, > >>> > since it allocates 1024 byte block for every such name. A static > >>> > buffer seems enough for the usage as I said. > >>> > >>> So, what about changing the scanner code as follows? > >>> > >>> <xd>{xdstop} { > >>> yylval.str = pstrdup(xdbuf.data); > >>> pfree(xdbuf.data); > >>> BEGIN(INITIAL); > >>> return NAME; > >>> > >>> > The parser is called for not only for SIGHUP, but also for > >>> > starting of every walsender. The latter is not necessary but it > >>> > is the matter of trade-off between simplisity and > >>> > effectiveness. > >>> > >>> Could you elaborate why you think that's not necessary? > >> > >> Sorry, starting of walsender is not so large problem, 1024 bytes > >> memory is just abandoned once. SIGHUP is rather a problem. > >> > >> The part is called under two kinds of memory context, "config > >> file processing" then "Replication command context". The former > >> is deleted just after reading the config file so no harm but the > >> latter is a quite long-lasting context and every reloading bloats > >> the context with abandoned memory blocks. It is needed to be > >> pfreed or to use a memory context with shorter lifetime, or use > >> static storage of 64 byte-length, even though the bloat become > >> visible after very many times of conf reloads. > > > > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that > > in the patch. Or am I missing something? Sorry, instead, the memory from strdup() will be abandoned in upper level. (Thinking for some time..) Ah, I found that the problem should be here. > SyncRepFreeConfig(SyncRepConfigData *config)> { ... !> list_free(config->members);> pfree(config);> } The list_free *doesn't* free the memory blocks pointed by lfirst(cell), which has been pstrdup'ed. It should be list_free_deep(config->members) instead to free it completely. > >>> BTW, in previous patch, s_s_names is parsed by postmaster during the server > >>> startup. A child process takes over the internal data struct for the parsed > >>> s_s_names when it's forked by the postmaster. This is what the previous > >>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment. > >>> In that environment, the data struct should be passed to a child process via > >>> the special file (like write_nondefault_variables() does), or it should > >>> be constructed during walsender startup (like latest version of the patch > >>> does). IMO the latter is simpler. > >> > >> Ah, I haven't notice that but I agree with it. > >> > >> > >> As per my previous comment, syncrep_scanner.l doesn't reject some > >> (nonprintable and multibyte) characters in a name, which is to be > >> silently replaced with '?' for application_name. It would not be > >> a problem for almost all of us but might be needed to be > >> documented if we won't change the behavior to be the same as > >> application_name. > > > > There are three options: > > > > 1. Replace nonprintable and non-ASCII characters in s_s_names with ? > > 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters > > 3. Do nothing (9.5 or before behave in this way) > > > > You implied that we should choose #1 or #2? > > Previous(9.5 or before) s_s_names also accepts non-ASCII character and > non-printable character, and can show it without replacing these > character to '?'. Thank you for pointint it out (it was completely out of my mind..). I have no objection to keep the previous behavior. > From backward compatibility perspective, we should not choose #1 or #2. > Different behaviour between previous and current s_s_names is that > previous s_s_names doesn't accept the node name having the sort of > white-space character that isspace() returns true with. > But current s_s_names allows us to specify such a node name. > I guess that changing such behaviour is enough for fixing this issue. > Thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 24, 2016 at 2:26 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com> >> On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >>> I don't think it's so difficult to extend this version so that >> >>> it supports also quorum commit. >> >> >> >> Mmm. I think I understand this just now. As Sawada-san said >> >> before, all standbys in a single-level quorum set having the same >> >> sync_standby_prioirity, the current algorithm works as it is. It >> >> also true for the case that some quorum sets are in a priority >> >> set. >> >> >> >> What about some priority sets in a quorum set? >> >> We should surely consider it that when we support more than 1 nest >> level configuration. >> IMO, we can have another information which indicates current sync >> standbys instead of sync_priority. >> For now, we are'nt trying to support even quorum method, so we could >> consider it after we can support both priority method and quorum >> method without incident. > > Fine with me. > >> >>> > StringInfo for double-quoted names seems to me to be overkill, >> >>> > since it allocates 1024 byte block for every such name. A static >> >>> > buffer seems enough for the usage as I said. >> >>> >> >>> So, what about changing the scanner code as follows? >> >>> >> >>> <xd>{xdstop} { >> >>> yylval.str = pstrdup(xdbuf.data); >> >>> pfree(xdbuf.data); >> >>> BEGIN(INITIAL); >> >>> return NAME; >> >>> >> >>> > The parser is called for not only for SIGHUP, but also for >> >>> > starting of every walsender. The latter is not necessary but it >> >>> > is the matter of trade-off between simplisity and >> >>> > effectiveness. >> >>> >> >>> Could you elaborate why you think that's not necessary? >> >> >> >> Sorry, starting of walsender is not so large problem, 1024 bytes >> >> memory is just abandoned once. SIGHUP is rather a problem. >> >> >> >> The part is called under two kinds of memory context, "config >> >> file processing" then "Replication command context". The former >> >> is deleted just after reading the config file so no harm but the >> >> latter is a quite long-lasting context and every reloading bloats >> >> the context with abandoned memory blocks. It is needed to be >> >> pfreed or to use a memory context with shorter lifetime, or use >> >> static storage of 64 byte-length, even though the bloat become >> >> visible after very many times of conf reloads. >> > >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that >> > in the patch. Or am I missing something? > > Sorry, instead, the memory from strdup() will be abandoned in > upper level. (Thinking for some time..) Ah, I found that the > problem should be here. > > > SyncRepFreeConfig(SyncRepConfigData *config) > > { > ... > !> list_free(config->members); > > pfree(config); > > } > > The list_free *doesn't* free the memory blocks pointed by > lfirst(cell), which has been pstrdup'ed. It should be > list_free_deep(config->members) instead to free it completely. Yep, but SyncRepFreeConfig() already uses list_free_deep in the latest patch. Could you read the latest version that I posted upthread. Regards, -- Fujii Masao
On Thu, Mar 24, 2016 at 2:26 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Thu, 24 Mar 2016 13:04:49 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBVn3_5qC_CKeKSXTu963mM=n9-GxzF7KCPreTTMS+JGQ@mail.gmail.com> >> On Thu, Mar 24, 2016 at 11:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > On Wed, Mar 23, 2016 at 5:32 PM, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >>> I don't think it's so difficult to extend this version so that >> >>> it supports also quorum commit. >> >> >> >> Mmm. I think I understand this just now. As Sawada-san said >> >> before, all standbys in a single-level quorum set having the same >> >> sync_standby_prioirity, the current algorithm works as it is. It >> >> also true for the case that some quorum sets are in a priority >> >> set. >> >> >> >> What about some priority sets in a quorum set? >> >> We should surely consider it that when we support more than 1 nest >> level configuration. >> IMO, we can have another information which indicates current sync >> standbys instead of sync_priority. >> For now, we are'nt trying to support even quorum method, so we could >> consider it after we can support both priority method and quorum >> method without incident. > > Fine with me. > >> >>> > StringInfo for double-quoted names seems to me to be overkill, >> >>> > since it allocates 1024 byte block for every such name. A static >> >>> > buffer seems enough for the usage as I said. >> >>> >> >>> So, what about changing the scanner code as follows? >> >>> >> >>> <xd>{xdstop} { >> >>> yylval.str = pstrdup(xdbuf.data); >> >>> pfree(xdbuf.data); >> >>> BEGIN(INITIAL); >> >>> return NAME; >> >>> >> >>> > The parser is called for not only for SIGHUP, but also for >> >>> > starting of every walsender. The latter is not necessary but it >> >>> > is the matter of trade-off between simplisity and >> >>> > effectiveness. >> >>> >> >>> Could you elaborate why you think that's not necessary? >> >> >> >> Sorry, starting of walsender is not so large problem, 1024 bytes >> >> memory is just abandoned once. SIGHUP is rather a problem. >> >> >> >> The part is called under two kinds of memory context, "config >> >> file processing" then "Replication command context". The former >> >> is deleted just after reading the config file so no harm but the >> >> latter is a quite long-lasting context and every reloading bloats >> >> the context with abandoned memory blocks. It is needed to be >> >> pfreed or to use a memory context with shorter lifetime, or use >> >> static storage of 64 byte-length, even though the bloat become >> >> visible after very many times of conf reloads. >> > >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that >> > in the patch. Or am I missing something? > > Sorry, instead, the memory from strdup() will be abandoned in > upper level. (Thinking for some time..) Ah, I found that the > problem should be here. > > > SyncRepFreeConfig(SyncRepConfigData *config) > > { > ... > !> list_free(config->members); > > pfree(config); > > } > > The list_free *doesn't* free the memory blocks pointed by > lfirst(cell), which has been pstrdup'ed. It should be > list_free_deep(config->members) instead to free it completely. >> >>> BTW, in previous patch, s_s_names is parsed by postmaster during the server >> >>> startup. A child process takes over the internal data struct for the parsed >> >>> s_s_names when it's forked by the postmaster. This is what the previous >> >>> patch was expecting. However, this doesn't work in EXEC_BACKEND environment. >> >>> In that environment, the data struct should be passed to a child process via >> >>> the special file (like write_nondefault_variables() does), or it should >> >>> be constructed during walsender startup (like latest version of the patch >> >>> does). IMO the latter is simpler. >> >> >> >> Ah, I haven't notice that but I agree with it. >> >> >> >> >> >> As per my previous comment, syncrep_scanner.l doesn't reject some >> >> (nonprintable and multibyte) characters in a name, which is to be >> >> silently replaced with '?' for application_name. It would not be >> >> a problem for almost all of us but might be needed to be >> >> documented if we won't change the behavior to be the same as >> >> application_name. >> > >> > There are three options: >> > >> > 1. Replace nonprintable and non-ASCII characters in s_s_names with ? >> > 2. Emit an error if s_s_names contains nonprintable and non-ASCII characters >> > 3. Do nothing (9.5 or before behave in this way) >> > >> > You implied that we should choose #1 or #2? >> >> Previous(9.5 or before) s_s_names also accepts non-ASCII character and >> non-printable character, and can show it without replacing these >> character to '?'. > > Thank you for pointint it out (it was completely out of my > mind..). I have no objection to keep the previous behavior. > >> From backward compatibility perspective, we should not choose #1 or #2. >> Different behaviour between previous and current s_s_names is that >> previous s_s_names doesn't accept the node name having the sort of >> white-space character that isspace() returns true with. >> But current s_s_names allows us to specify such a node name. >> I guess that changing such behaviour is enough for fixing this issue. >> Thoughts? > Attached latest patch incorporating all review comments so far. Aside from the review comments, I did following changes; - Add logic to avoid fatal exit in yy_fatal_error(). - Improve regression test cases. Also I felt a sense of discomfort regarding using [ and ] as a special character for priority method. Because (, ) and [, ] are a little similar each other, so it would easily make many syntax errors when nested style is supported. And the synopsis of that in documentation is odd; synchronous_standby_names = 'N [ node_name [, ...] ]' This topic has been already discussed before but, we might want to change it to other characters such as < and >? Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 24, 2016 at 9:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Also I felt a sense of discomfort regarding using [ and ] as a special > character for priority method. > Because (, ) and [, ] are a little similar each other, so it would > easily make many syntax errors when nested style is supported. > And the synopsis of that in documentation is odd; > synchronous_standby_names = 'N [ node_name [, ...] ]' > > This topic has been already discussed before but, we might want to > change it to other characters such as < and >? I personally would recommend against <>. Those should mean less-than and greater-than, not grouping. I think you could use parentheses, (). There's nothing saying that has to mean any particular thing, so you may as well use it for the first thing implemented, perhaps. Or you could use [] or {}. It *is* important that you don't create confusing syntax summaries, but I don't think that's a reason to pick a nonstandard syntax for grouping. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 25, 2016 at 9:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 24, 2016 at 9:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Also I felt a sense of discomfort regarding using [ and ] as a special >> character for priority method. >> Because (, ) and [, ] are a little similar each other, so it would >> easily make many syntax errors when nested style is supported. >> And the synopsis of that in documentation is odd; >> synchronous_standby_names = 'N [ node_name [, ...] ]' >> >> This topic has been already discussed before but, we might want to >> change it to other characters such as < and >? > > I personally would recommend against <>. Those should mean less-than > and greater-than, not grouping. I think you could use parentheses, > (). There's nothing saying that has to mean any particular thing, so > you may as well use it for the first thing implemented, perhaps. Or > you could use [] or {}. It *is* important that you don't create > confusing syntax summaries, but I don't think that's a reason to pick > a nonstandard syntax for grouping. > I agree with you. I've changed it to use parentheses. Regards, -- Masahiko Sawada
Attachment
Thank you for the new patch. Sorry to have overlooked some versions. I'm looking the v19 patch now. make complains for an unused variable. | syncrep.c: In function ‘SyncRepGetSyncStandbys’: | syncrep.c:601:13: warning: variable ‘next’ set but not used [-Wunused-but-set-variable] | ListCell *next; At Thu, 24 Mar 2016 22:29:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCxwezOTf9kLQRhuf2y=1c_fGjCormqJfqHOmQW8EgaDg@mail.gmail.com> > >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that > >> > in the patch. Or am I missing something? > > > > Sorry, instead, the memory from strdup() will be abandoned in > > upper level. (Thinking for some time..) Ah, I found that the > > problem should be here. > > > > > SyncRepFreeConfig(SyncRepConfigData *config) > > > { > > ... > > !> list_free(config->members); > > > pfree(config); > > > } > > > > The list_free *doesn't* free the memory blocks pointed by > > lfirst(cell), which has been pstrdup'ed. It should be > > list_free_deep(config->members) instead to free it completely. Fujii> Yep, but SyncRepFreeConfig() already uses list_free_deep Fujii> in the latest patch. Could you read the latest version Fujii> that I posted upthread. Sorry for overlooked the version. Every pair of parse(or SyncRepUpdateConfig) and SyncRepFreeConfig is on the same memory context so it seems safe (but might be fragile since it relies on that the caller does so.). > >> Previous(9.5 or before) s_s_names also accepts non-ASCII character and > >> non-printable character, and can show it without replacing these > >> character to '?'. > > > > Thank you for pointint it out (it was completely out of my > > mind..). I have no objection to keep the previous behavior. > > > >> From backward compatibility perspective, we should not choose #1 or #2. > >> Different behaviour between previous and current s_s_names is that > >> previous s_s_names doesn't accept the node name having the sort of > >> white-space character that isspace() returns true with. > >> But current s_s_names allows us to specify such a node name. > >> I guess that changing such behaviour is enough for fixing this issue. > >> Thoughts? > > > > Attached latest patch incorporating all review comments so far. > > Aside from the review comments, I did following changes; > - Add logic to avoid fatal exit in yy_fatal_error(). Maybe good catch, but.. > syncrep_scanstr(const char *str) .. > * Regain control after a fatal, internal flex error. It may have > * corrupted parser state. Consequently, abandon the file, but trust ~~~~~~~~~~~~~~~~ > * that the state remains sane enough for syncrep_yy_delete_buffer(). ~~~~~~~~~~~~~~~~~~~~~~~~ guc-file.l actually abandones the config file but syncrep_scanner reads only a value of an item in it. And, the latter is eventually true but a bit hard to understand. The patch will emit a mysterious error message like this. > invalid value for parameter "synchronous_standby_names": "2[a,b,c]" > configuration file ".../postgresql.conf" contains errors This is utterly wrong. A bit related to that, it seems to me that syncrep_scan.l doesn't need the same mechanism with guc-file.l. The nature of the modification would be making call_*_check_hook to be tri-state instead of boolean. So just cathing errors in call_*_check_hook and ereport()'ing as SQL parser does seems enough, but either will do for me. > - Improve regression test cases. I forgot to mention that, but additionalORDER BY makes the test robust. I doubt the validity of the behavior in the following test. > # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state Is this regarded as a correct as a value for it? > Also I felt a sense of discomfort regarding using [ and ] as a special > character for priority method. > Because (, ) and [, ] are a little similar each other, so it would > easily make many syntax errors when nested style is supported. > And the synopsis of that in documentation is odd; > synchronous_standby_names = 'N [ node_name [, ...] ]' > > This topic has been already discussed before but, we might want to > change it to other characters such as < and >? I don't mind ether but as Robert said, it is true that the characters essentially to be used to enclose something should be preferred to other characters. Distinguishability of glyphs has less signinficance, perhaps. # LISPers don't hesitate to dive into Sea of Parens. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 2016/03/28 17:50, Kyotaro HORIGUCHI wrote: > > # LISPers don't hesitate to dive into Sea of Parens. Sorry in advance to be off-topic: https://xkcd.com/297 :) Thanks, Amit
On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Thank you for the new patch. Sorry to have overlooked some > versions. I'm looking the v19 patch now. > > make complains for an unused variable. > > | syncrep.c: In function ‘SyncRepGetSyncStandbys’: > | syncrep.c:601:13: warning: variable ‘next’ set but not used [-Wunused-but-set-variable] > | ListCell *next; > > > At Thu, 24 Mar 2016 22:29:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCxwezOTf9kLQRhuf2y=1c_fGjCormqJfqHOmQW8EgaDg@mail.gmail.com> >> >> > SyncRepInitConfig()->SyncRepFreeConfig() has already pfree'd that >> >> > in the patch. Or am I missing something? >> > >> > Sorry, instead, the memory from strdup() will be abandoned in >> > upper level. (Thinking for some time..) Ah, I found that the >> > problem should be here. >> > >> > > SyncRepFreeConfig(SyncRepConfigData *config) >> > > { >> > ... >> > !> list_free(config->members); >> > > pfree(config); >> > > } >> > >> > The list_free *doesn't* free the memory blocks pointed by >> > lfirst(cell), which has been pstrdup'ed. It should be >> > list_free_deep(config->members) instead to free it completely. > > Fujii> Yep, but SyncRepFreeConfig() already uses list_free_deep > Fujii> in the latest patch. Could you read the latest version > Fujii> that I posted upthread. > > Sorry for overlooked the version. Every pair of parse(or > SyncRepUpdateConfig) and SyncRepFreeConfig is on the same memory > context so it seems safe (but might be fragile since it relies on > that the caller does so.). > >> >> Previous(9.5 or before) s_s_names also accepts non-ASCII character and >> >> non-printable character, and can show it without replacing these >> >> character to '?'. >> > >> > Thank you for pointint it out (it was completely out of my >> > mind..). I have no objection to keep the previous behavior. >> > >> >> From backward compatibility perspective, we should not choose #1 or #2. >> >> Different behaviour between previous and current s_s_names is that >> >> previous s_s_names doesn't accept the node name having the sort of >> >> white-space character that isspace() returns true with. >> >> But current s_s_names allows us to specify such a node name. >> >> I guess that changing such behaviour is enough for fixing this issue. >> >> Thoughts? >> > >> >> Attached latest patch incorporating all review comments so far. >> >> Aside from the review comments, I did following changes; >> - Add logic to avoid fatal exit in yy_fatal_error(). > > Maybe good catch, but.. > >> syncrep_scanstr(const char *str) > .. >> * Regain control after a fatal, internal flex error. It may have >> * corrupted parser state. Consequently, abandon the file, but trust > ~~~~~~~~~~~~~~~~ >> * that the state remains sane enough for syncrep_yy_delete_buffer(). > ~~~~~~~~~~~~~~~~~~~~~~~~ > > guc-file.l actually abandones the config file but syncrep_scanner > reads only a value of an item in it. And, the latter is > eventually true but a bit hard to understand. > > The patch will emit a mysterious error message like this. > >> invalid value for parameter "synchronous_standby_names": "2[a,b,c]" >> configuration file ".../postgresql.conf" contains errors > > This is utterly wrong. A bit related to that, it seems to me that > syncrep_scan.l doesn't need the same mechanism with > guc-file.l. The nature of the modification would be making > call_*_check_hook to be tri-state instead of boolean. So just > cathing errors in call_*_check_hook and ereport()'ing as SQL > parser does seems enough, but either will do for me. Well, I think that call_*_check_hook can not catch such a fatal error. Because if yy_fatal_error() is called without preventing logic when reloading configuration file, postmaster process will abnormal exit immediately as well as wal sender process. > >> - Improve regression test cases. > > I forgot to mention that, but additionalORDER BY makes the test > robust. > > I doubt the validity of the behavior in the following test. > >> # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state > > Is this regarded as a correct as a value for it? Since previous s_s_names (9.5 or before) can accept this value, I didn't change behaviour. And I added this test case for checking backward compatibility more finely. Regards, -- Masahiko Sawada
Hello, At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Thank you for the new patch. Sorry to have overlooked some > > versions. I'm looking the v19 patch now. > > > > make complains for an unused variable. Thank you. I'll have a closer look on it a bit later. > >> Attached latest patch incorporating all review comments so far. > >> > >> Aside from the review comments, I did following changes; > >> - Add logic to avoid fatal exit in yy_fatal_error(). > > > > Maybe good catch, but.. > > > >> syncrep_scanstr(const char *str) > > .. > >> * Regain control after a fatal, internal flex error. It may have > >> * corrupted parser state. Consequently, abandon the file, but trust > > ~~~~~~~~~~~~~~~~ > >> * that the state remains sane enough for syncrep_yy_delete_buffer(). > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > > > guc-file.l actually abandones the config file but syncrep_scanner > > reads only a value of an item in it. And, the latter is > > eventually true but a bit hard to understand. > > > > The patch will emit a mysterious error message like this. > > > >> invalid value for parameter "synchronous_standby_names": "2[a,b,c]" > >> configuration file ".../postgresql.conf" contains errors > > > > This is utterly wrong. A bit related to that, it seems to me that > > syncrep_scan.l doesn't need the same mechanism with > > guc-file.l. The nature of the modification would be making > > call_*_check_hook to be tri-state instead of boolean. So just > > cathing errors in call_*_check_hook and ereport()'ing as SQL > > parser does seems enough, but either will do for me. > > Well, I think that call_*_check_hook can not catch such a fatal error. As mentioned in my comment, SQL parser converts yy_fatal_error into ereport(ERROR), which can be caught by the upper PG_TRY (by #define'ing fprintf). So it is doable if you mind exit(). > Because if yy_fatal_error() is called without preventing logic when > reloading configuration file, postmaster process will abnormal exit > immediately as well as wal sender process. > >> - Improve regression test cases. > > > > I forgot to mention that, but additionalORDER BY makes the test > > robust. > > > > I doubt the validity of the behavior in the following test. > > > >> # Change the synchronous_standby_names = '2[standby1,*,standby2]' and check sync_state > > > > Is this regarded as a correct as a value for it? > > Since previous s_s_names (9.5 or before) can accept this value, I > didn't change behaviour. > And I added this test case for checking backward compatibility more finely. I understand that and it's fine. But we need a explanation for the reason above in the test case or somewhere else. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Thank you for the new patch. Sorry to have overlooked some >> > versions. I'm looking the v19 patch now. >> > >> > make complains for an unused variable. > > Thank you. I'll have a closer look on it a bit later. > >> >> Attached latest patch incorporating all review comments so far. >> >> >> >> Aside from the review comments, I did following changes; >> >> - Add logic to avoid fatal exit in yy_fatal_error(). >> > >> > Maybe good catch, but.. >> > >> >> syncrep_scanstr(const char *str) >> > .. >> >> * Regain control after a fatal, internal flex error. It may have >> >> * corrupted parser state. Consequently, abandon the file, but trust >> > ~~~~~~~~~~~~~~~~ >> >> * that the state remains sane enough for syncrep_yy_delete_buffer(). >> > ~~~~~~~~~~~~~~~~~~~~~~~~ >> > >> > guc-file.l actually abandones the config file but syncrep_scanner >> > reads only a value of an item in it. And, the latter is >> > eventually true but a bit hard to understand. >> > >> > The patch will emit a mysterious error message like this. >> > >> >> invalid value for parameter "synchronous_standby_names": "2[a,b,c]" >> >> configuration file ".../postgresql.conf" contains errors >> > >> > This is utterly wrong. A bit related to that, it seems to me that >> > syncrep_scan.l doesn't need the same mechanism with >> > guc-file.l. The nature of the modification would be making >> > call_*_check_hook to be tri-state instead of boolean. So just >> > cathing errors in call_*_check_hook and ereport()'ing as SQL >> > parser does seems enough, but either will do for me. >> >> Well, I think that call_*_check_hook can not catch such a fatal error. > > As mentioned in my comment, SQL parser converts yy_fatal_error > into ereport(ERROR), which can be caught by the upper PG_TRY (by > #define'ing fprintf). So it is doable if you mind exit(). I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal flex fatal error occurs, postmaster just exits instead of jumping out of parser. ISTM that, when an internal flex fatal error occurs, it's better to elog(FATAL) and terminate the problematic process. This might lead to the server crash (e.g., if postmaster emits a FATAL error, it and its all child processes will exit soon). But probably we can live with this because the fatal error basically rarely happens. OTOH, if we make the process keep running even after it gets an internal fatal error (like Sawada's patch or your idea do), this might cause more serious problem. Please imagine the case where one walsender gets the fatal error (e.g., because of OOM), abandon new setting value of synchronous_standby_names, and keep running with the previous setting value. OTOH, the other walsender processes successfully parse the setting and keep running with new setting. In this case, the inconsistency of the setting which each walsender is based on happens. This completely will mess up the synchronous replication. Therefore, I think that it's better to make the problematic process exit with FATAL error rather than ignore the error and keep it running. Regards, -- Fujii Masao
I personally don't think it needs such a survive measure. It is very small syntax and the parser reads very short text. If the parser failes in such mode, something more serious should have occurred. At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> > On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, > > > > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> > > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI > >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > As mentioned in my comment, SQL parser converts yy_fatal_error > > into ereport(ERROR), which can be caught by the upper PG_TRY (by > > #define'ing fprintf). So it is doable if you mind exit(). > > I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is > implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal > flex fatal error occurs, postmaster just exits instead of jumping out of parser. If The ERROR may be LOG or DEBUG2 either, if we think the parser fatal erros are recoverable. guc-file.l is doing so. > ISTM that, when an internal flex fatal error occurs, it's > better to elog(FATAL) and terminate the problematic > process. This might lead to the server crash (e.g., if > postmaster emits a FATAL error, it and its all child processes > will exit soon). But probably we can live with this because the > fatal error basically rarely happens. I agree to this > OTOH, if we make the process keep running even after it gets an internal > fatal error (like Sawada's patch or your idea do), this might cause more > serious problem. Please imagine the case where one walsender gets the fatal > error (e.g., because of OOM), abandon new setting value of > synchronous_standby_names, and keep running with the previous setting value. > OTOH, the other walsender processes successfully parse the setting and > keep running with new setting. In this case, the inconsistency of the setting > which each walsender is based on happens. This completely will mess up the > synchronous replication. On the other hand, guc-file.l seems ignoring parser errors under normal operation, even though it may cause similar inconsistency, if any.. | LOG: received SIGHUP, reloading configuration files | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied > Therefore, I think that it's better to make the problematic process exit > with FATAL error rather than ignore the error and keep it running. +1. Restarting walsender would be far less harmful than keeping it running in doubtful state. Sould I wait for the next version or have a look on the latest? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I personally don't think it needs such a survive measure. It is > very small syntax and the parser reads very short text. If the > parser failes in such mode, something more serious should have > occurred. > > At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> >> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Hello, >> > >> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> >> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > As mentioned in my comment, SQL parser converts yy_fatal_error >> > into ereport(ERROR), which can be caught by the upper PG_TRY (by >> > #define'ing fprintf). So it is doable if you mind exit(). >> >> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is >> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal >> flex fatal error occurs, postmaster just exits instead of jumping out of parser. > > If The ERROR may be LOG or DEBUG2 either, if we think the parser > fatal erros are recoverable. guc-file.l is doing so. > >> ISTM that, when an internal flex fatal error occurs, it's >> better to elog(FATAL) and terminate the problematic >> process. This might lead to the server crash (e.g., if >> postmaster emits a FATAL error, it and its all child processes >> will exit soon). But probably we can live with this because the >> fatal error basically rarely happens. > > I agree to this > >> OTOH, if we make the process keep running even after it gets an internal >> fatal error (like Sawada's patch or your idea do), this might cause more >> serious problem. Please imagine the case where one walsender gets the fatal >> error (e.g., because of OOM), abandon new setting value of >> synchronous_standby_names, and keep running with the previous setting value. >> OTOH, the other walsender processes successfully parse the setting and >> keep running with new setting. In this case, the inconsistency of the setting >> which each walsender is based on happens. This completely will mess up the >> synchronous replication. > > On the other hand, guc-file.l seems ignoring parser errors under > normal operation, even though it may cause similar inconsistency, > if any.. > > | LOG: received SIGHUP, reloading configuration files > | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 > | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied > >> Therefore, I think that it's better to make the problematic process exit >> with FATAL error rather than ignore the error and keep it running. > > +1. Restarting walsender would be far less harmful than keeping > it running in doubtful state. > > Sould I wait for the next version or have a look on the latest? > Attached latest patch incorporate some review comments so far, and is rebased against current HEAD. Regards, -- Masahiko Sawada
Attachment
On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> I personally don't think it needs such a survive measure. It is >> very small syntax and the parser reads very short text. If the >> parser failes in such mode, something more serious should have >> occurred. >> >> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> >>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> > Hello, >>> > >>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> >>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> > As mentioned in my comment, SQL parser converts yy_fatal_error >>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by >>> > #define'ing fprintf). So it is doable if you mind exit(). >>> >>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is >>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal >>> flex fatal error occurs, postmaster just exits instead of jumping out of parser. >> >> If The ERROR may be LOG or DEBUG2 either, if we think the parser >> fatal erros are recoverable. guc-file.l is doing so. >> >>> ISTM that, when an internal flex fatal error occurs, it's >>> better to elog(FATAL) and terminate the problematic >>> process. This might lead to the server crash (e.g., if >>> postmaster emits a FATAL error, it and its all child processes >>> will exit soon). But probably we can live with this because the >>> fatal error basically rarely happens. >> >> I agree to this >> >>> OTOH, if we make the process keep running even after it gets an internal >>> fatal error (like Sawada's patch or your idea do), this might cause more >>> serious problem. Please imagine the case where one walsender gets the fatal >>> error (e.g., because of OOM), abandon new setting value of >>> synchronous_standby_names, and keep running with the previous setting value. >>> OTOH, the other walsender processes successfully parse the setting and >>> keep running with new setting. In this case, the inconsistency of the setting >>> which each walsender is based on happens. This completely will mess up the >>> synchronous replication. >> >> On the other hand, guc-file.l seems ignoring parser errors under >> normal operation, even though it may cause similar inconsistency, >> if any.. >> >> | LOG: received SIGHUP, reloading configuration files >> | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 >> | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied >> >>> Therefore, I think that it's better to make the problematic process exit >>> with FATAL error rather than ignore the error and keep it running. >> >> +1. Restarting walsender would be far less harmful than keeping >> it running in doubtful state. >> >> Sould I wait for the next version or have a look on the latest? >> > > Attached latest patch incorporate some review comments so far, and is > rebased against current HEAD. > Sorry I attached wrong patch. Attached patch is correct patch. Regards, -- Masahiko Sawada
Attachment
On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> I personally don't think it needs such a survive measure. It is >>> very small syntax and the parser reads very short text. If the >>> parser failes in such mode, something more serious should have >>> occurred. >>> >>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> >>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> > Hello, >>>> > >>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> >>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> > As mentioned in my comment, SQL parser converts yy_fatal_error >>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by >>>> > #define'ing fprintf). So it is doable if you mind exit(). >>>> >>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is >>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal >>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser. >>> >>> If The ERROR may be LOG or DEBUG2 either, if we think the parser >>> fatal erros are recoverable. guc-file.l is doing so. >>> >>>> ISTM that, when an internal flex fatal error occurs, it's >>>> better to elog(FATAL) and terminate the problematic >>>> process. This might lead to the server crash (e.g., if >>>> postmaster emits a FATAL error, it and its all child processes >>>> will exit soon). But probably we can live with this because the >>>> fatal error basically rarely happens. >>> >>> I agree to this >>> >>>> OTOH, if we make the process keep running even after it gets an internal >>>> fatal error (like Sawada's patch or your idea do), this might cause more >>>> serious problem. Please imagine the case where one walsender gets the fatal >>>> error (e.g., because of OOM), abandon new setting value of >>>> synchronous_standby_names, and keep running with the previous setting value. >>>> OTOH, the other walsender processes successfully parse the setting and >>>> keep running with new setting. In this case, the inconsistency of the setting >>>> which each walsender is based on happens. This completely will mess up the >>>> synchronous replication. >>> >>> On the other hand, guc-file.l seems ignoring parser errors under >>> normal operation, even though it may cause similar inconsistency, >>> if any.. >>> >>> | LOG: received SIGHUP, reloading configuration files >>> | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 >>> | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied >>> >>>> Therefore, I think that it's better to make the problematic process exit >>>> with FATAL error rather than ignore the error and keep it running. >>> >>> +1. Restarting walsender would be far less harmful than keeping >>> it running in doubtful state. >>> >>> Sould I wait for the next version or have a look on the latest? >>> >> >> Attached latest patch incorporate some review comments so far, and is >> rebased against current HEAD. >> > > Sorry I attached wrong patch. > Attached patch is correct patch. > > [mulit_sync_replication_v21.patch] Here are some TPS numbers from some quick tests I ran on a set of Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured as primary + 3 standbys, to try out different combinations of synchronous_commit levels and synchronous_standby_names numbers. They were run for a short time only and these are of course systems with limited and perhaps uneven IO and CPU, but they still give some idea of the trends. And reassuringly, the trends are travelling in the expected directions. All default settings except shared_buffers = 1GB, and the GUCs required for replication. pgbench postgres -j2 -c2 -N bench2 -T 600 1(*) 2(*) 3(*) ==== ==== ==== off = 4056 4096 4092 local = 1323 1299 1312 remote_write = 1130 1046 958 on = 860 744 701 remote_apply = 785 725 604 pgbench postgres -j16 -c16 -N bench2 -T 600 1(*) 2(*) 3(*) ==== ==== ==== off = 3952 3943 3933 local = 2964 2984 3026 remote_write = 2790 2724 2675 on = 2731 2627 2523 remote_apply = 2627 2501 2432 One thing I noticed is that there are LOG messages telling me when a standby becomes a synchronous standby, but nothing to tell me if a standby stops being a standby (ie because a higher priority one has taken its place in the quorum). Would that be interesting? Also, I spotted some tiny mistakes: + <indexterm zone="high-availability"> + <primary>Dedicated language for multiple synchornous replication</primary> + </indexterm> s/synchornous/synchronous/ + /* + * If we are managing the sync standby, though we weren't + * prior to this, then announce we are now the sync standby. + */ s/ the / a / (two occurrences) + ereport(LOG, + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", + application_name, MyWalSnd->sync_standby_priority))); s/ the / a / offered by a transaction commit. This level of protection is referred - to as 2-safe replication in computer science theory. + to as 2-safe replication in computer science theory, and group-1-safe + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to + more than <literal>remote_write</>. Why "more than"? I think those two words should be changed to "at least", or removed. + <para> + This syntax allows us to define a synchronous group that will wait for at + least N standbys of them, and a comma-separated list of group members that are surrounded by + parantheses. The special value <literal>*</> for server name matches any standby. + By surrounding list of group members using parantheses, synchronous standbys are chosen from + that group using priority method. + </para> s/parantheses/parentheses/ (two occurrences) + <sect2 id="dedicated-language-for-multi-sync-replication-priority"> + <title>Prioirty Method</title> s/Prioirty Method/Priority Method/ -- Thomas Munro http://www.enterprisedb.com
On Thu, Mar 31, 2016 at 5:11 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> I personally don't think it needs such a survive measure. It is >>>> very small syntax and the parser reads very short text. If the >>>> parser failes in such mode, something more serious should have >>>> occurred. >>>> >>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> >>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI >>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>> > Hello, >>>>> > >>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> >>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >>>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error >>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by >>>>> > #define'ing fprintf). So it is doable if you mind exit(). >>>>> >>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is >>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal >>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser. >>>> >>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser >>>> fatal erros are recoverable. guc-file.l is doing so. >>>> >>>>> ISTM that, when an internal flex fatal error occurs, it's >>>>> better to elog(FATAL) and terminate the problematic >>>>> process. This might lead to the server crash (e.g., if >>>>> postmaster emits a FATAL error, it and its all child processes >>>>> will exit soon). But probably we can live with this because the >>>>> fatal error basically rarely happens. >>>> >>>> I agree to this >>>> >>>>> OTOH, if we make the process keep running even after it gets an internal >>>>> fatal error (like Sawada's patch or your idea do), this might cause more >>>>> serious problem. Please imagine the case where one walsender gets the fatal >>>>> error (e.g., because of OOM), abandon new setting value of >>>>> synchronous_standby_names, and keep running with the previous setting value. >>>>> OTOH, the other walsender processes successfully parse the setting and >>>>> keep running with new setting. In this case, the inconsistency of the setting >>>>> which each walsender is based on happens. This completely will mess up the >>>>> synchronous replication. >>>> >>>> On the other hand, guc-file.l seems ignoring parser errors under >>>> normal operation, even though it may cause similar inconsistency, >>>> if any.. >>>> >>>> | LOG: received SIGHUP, reloading configuration files >>>> | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 >>>> | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied >>>> >>>>> Therefore, I think that it's better to make the problematic process exit >>>>> with FATAL error rather than ignore the error and keep it running. >>>> >>>> +1. Restarting walsender would be far less harmful than keeping >>>> it running in doubtful state. >>>> >>>> Sould I wait for the next version or have a look on the latest? >>>> >>> >>> Attached latest patch incorporate some review comments so far, and is >>> rebased against current HEAD. >>> >> >> Sorry I attached wrong patch. >> Attached patch is correct patch. >> >> [mulit_sync_replication_v21.patch] > > Here are some TPS numbers from some quick tests I ran on a set of > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured > as primary + 3 standbys, to try out different combinations of > synchronous_commit levels and synchronous_standby_names numbers. They > were run for a short time only and these are of course systems with > limited and perhaps uneven IO and CPU, but they still give some idea > of the trends. And reassuringly, the trends are travelling in the > expected directions. > > All default settings except shared_buffers = 1GB, and the GUCs > required for replication. > > pgbench postgres -j2 -c2 -N bench2 -T 600 > > 1(*) 2(*) 3(*) > ==== ==== ==== > off = 4056 4096 4092 > local = 1323 1299 1312 > remote_write = 1130 1046 958 > on = 860 744 701 > remote_apply = 785 725 604 > > pgbench postgres -j16 -c16 -N bench2 -T 600 > > 1(*) 2(*) 3(*) > ==== ==== ==== > off = 3952 3943 3933 > local = 2964 2984 3026 > remote_write = 2790 2724 2675 > on = 2731 2627 2523 > remote_apply = 2627 2501 2432 > > One thing I noticed is that there are LOG messages telling me when a > standby becomes a synchronous standby, but nothing to tell me if a > standby stops being a standby (ie because a higher priority one has > taken its place in the quorum). Would that be interesting? > > Also, I spotted some tiny mistakes: > > + <indexterm zone="high-availability"> > + <primary>Dedicated language for multiple synchornous replication</primary> > + </indexterm> > > s/synchornous/synchronous/ > > + /* > + * If we are managing the sync standby, though we weren't > + * prior to this, then announce we are now the sync standby. > + */ > > s/ the / a / (two occurrences) > > + ereport(LOG, > + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", > + application_name, MyWalSnd->sync_standby_priority))); > > s/ the / a / > > offered by a transaction commit. This level of protection is referred > - to as 2-safe replication in computer science theory. > + to as 2-safe replication in computer science theory, and group-1-safe > + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to > + more than <literal>remote_write</>. > > Why "more than"? I think those two words should be changed to "at > least", or removed. > > + <para> > + This syntax allows us to define a synchronous group that will wait for at > + least N standbys of them, and a comma-separated list of group > members that are surrounded by > + parantheses. The special value <literal>*</> for server name > matches any standby. > + By surrounding list of group members using parantheses, > synchronous standbys are chosen from > + that group using priority method. > + </para> > > s/parantheses/parentheses/ (two occurrences) > > + <sect2 id="dedicated-language-for-multi-sync-replication-priority"> > + <title>Prioirty Method</title> > > s/Prioirty Method/Priority Method/ A couple more comments: /* - * If we aren't managing the highest priority standby then just leave. + * If the number of sync standbys is less than requested or we aren't + * managing the sync standby then just leave. */ - if (syncWalSnd != MyWalSnd) + if (!got_oldest || !am_sync) s/ the sync / a sync / + /* + * Consider all pending standbys as sync if the number of them plus + * already-found sync ones is lower than the configuration requests. + */ + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync) + return list_concat(result, pending); The cells from 'pending' will be attached to 'result', and 'result' will be freed by the caller. But won't the List header object from 'pending' be leaked? + result = lappend_int(result, i); + if (list_length(result) == SyncRepConfig->num_sync) + { + list_free(pending); + return result; /* Exit if got enough sync standbys */ + } If we didn't take the early return in the list-not-long-enough case mentioned above, we should *always* exit via this return statement, right? Since we know that the pending list had enough elements to reach num_sync. I think that is worth a comment, and also a "not reached" comment at the bottom of the function, if it is true. As a future improvement, I wonder if we could avoid recomputing the current set of sync standbys in every walsender every time we call SyncRepReleaseWaiters, perhaps by maintaining that set incrementally in shmem when walsender states change etc. I don't have any other comments, other than to say: thank you to all the people who have contributed to this feature so far and I really really hope it goes into 9.6! -- Thomas Munro http://www.enterprisedb.com
On Sat, Apr 2, 2016 at 10:20 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Mar 31, 2016 at 5:11 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI >>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>> I personally don't think it needs such a survive measure. It is >>>>> very small syntax and the parser reads very short text. If the >>>>> parser failes in such mode, something more serious should have >>>>> occurred. >>>>> >>>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g@mail.gmail.com> >>>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI >>>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>>> > Hello, >>>>>> > >>>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q@mail.gmail.com> >>>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI >>>>>> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error >>>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by >>>>>> > #define'ing fprintf). So it is doable if you mind exit(). >>>>>> >>>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is >>>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal >>>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser. >>>>> >>>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser >>>>> fatal erros are recoverable. guc-file.l is doing so. >>>>> >>>>>> ISTM that, when an internal flex fatal error occurs, it's >>>>>> better to elog(FATAL) and terminate the problematic >>>>>> process. This might lead to the server crash (e.g., if >>>>>> postmaster emits a FATAL error, it and its all child processes >>>>>> will exit soon). But probably we can live with this because the >>>>>> fatal error basically rarely happens. >>>>> >>>>> I agree to this >>>>> >>>>>> OTOH, if we make the process keep running even after it gets an internal >>>>>> fatal error (like Sawada's patch or your idea do), this might cause more >>>>>> serious problem. Please imagine the case where one walsender gets the fatal >>>>>> error (e.g., because of OOM), abandon new setting value of >>>>>> synchronous_standby_names, and keep running with the previous setting value. >>>>>> OTOH, the other walsender processes successfully parse the setting and >>>>>> keep running with new setting. In this case, the inconsistency of the setting >>>>>> which each walsender is based on happens. This completely will mess up the >>>>>> synchronous replication. >>>>> >>>>> On the other hand, guc-file.l seems ignoring parser errors under >>>>> normal operation, even though it may cause similar inconsistency, >>>>> if any.. >>>>> >>>>> | LOG: received SIGHUP, reloading configuration files >>>>> | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1 >>>>> | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied >>>>> >>>>>> Therefore, I think that it's better to make the problematic process exit >>>>>> with FATAL error rather than ignore the error and keep it running. >>>>> >>>>> +1. Restarting walsender would be far less harmful than keeping >>>>> it running in doubtful state. >>>>> >>>>> Sould I wait for the next version or have a look on the latest? >>>>> >>>> >>>> Attached latest patch incorporate some review comments so far, and is >>>> rebased against current HEAD. >>>> >>> >>> Sorry I attached wrong patch. >>> Attached patch is correct patch. Thanks for updating the patch! I applied the following changes to the patch. Attached is the revised version of the patch. - Changed syncrep_flex_fatal() so that it just calls ereport(FATAL), based on the recent discussion with Horiguchi-san. - Improved the documentation. - Fixed some bugs. - Removed the changes for recovery testing framework. I'd like to commit those changes later separately from the main patch of multiple sync rep. Barring any objections, I'll commit this patch. >> One thing I noticed is that there are LOG messages telling me when a >> standby becomes a synchronous standby, but nothing to tell me if a >> standby stops being a standby (ie because a higher priority one has >> taken its place in the quorum). Would that be interesting? +1 >> Also, I spotted some tiny mistakes: >> >> + <indexterm zone="high-availability"> >> + <primary>Dedicated language for multiple synchornous replication</primary> >> + </indexterm> >> >> s/synchornous/synchronous/ Confirmed that there is no typo "synchornous" in the latest patch. >> + /* >> + * If we are managing the sync standby, though we weren't >> + * prior to this, then announce we are now the sync standby. >> + */ >> >> s/ the / a / (two occurrences) Fixed. >> + ereport(LOG, >> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", >> + application_name, MyWalSnd->sync_standby_priority))); >> >> s/ the / a / I have no objection to this change itself. But we have used this message in 9.5 or before, so if we apply this change, probably we need back-patching. >> >> offered by a transaction commit. This level of protection is referred >> - to as 2-safe replication in computer science theory. >> + to as 2-safe replication in computer science theory, and group-1-safe >> + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to >> + more than <literal>remote_write</>. >> >> Why "more than"? I think those two words should be changed to "at >> least", or removed. Removed. >> + <para> >> + This syntax allows us to define a synchronous group that will wait for at >> + least N standbys of them, and a comma-separated list of group >> members that are surrounded by >> + parantheses. The special value <literal>*</> for server name >> matches any standby. >> + By surrounding list of group members using parantheses, >> synchronous standbys are chosen from >> + that group using priority method. >> + </para> >> >> s/parantheses/parentheses/ (two occurrences) Confirmed that this typo doesn't exist in the latest patch. >> >> + <sect2 id="dedicated-language-for-multi-sync-replication-priority"> >> + <title>Prioirty Method</title> >> >> s/Prioirty Method/Priority Method/ Confirmed that this typo doesn't exist in the latest patch. > A couple more comments: > > /* > - * If we aren't managing the highest priority standby then just leave. > + * If the number of sync standbys is less than requested or we aren't > + * managing the sync standby then just leave. > */ > - if (syncWalSnd != MyWalSnd) > + if (!got_oldest || !am_sync) > > s/ the sync / a sync / Fixed. > + /* > + * Consider all pending standbys as sync if the number of them plus > + * already-found sync ones is lower than the configuration requests. > + */ > + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync) > + return list_concat(result, pending); > > The cells from 'pending' will be attached to 'result', and 'result' > will be freed by the caller. But won't the List header object from > 'pending' be leaked? Yes if 'result' is not NIL. I added pfree(pending) for that case. > + result = lappend_int(result, i); > + if (list_length(result) == SyncRepConfig->num_sync) > + { > + list_free(pending); > + return result; /* Exit if got enough sync standbys */ > + } > > If we didn't take the early return in the list-not-long-enough case > mentioned above, we should *always* exit via this return statement, > right? Since we know that the pending list had enough elements to > reach num_sync. I think that is worth a comment, and also a "not > reached" comment at the bottom of the function, if it is true. Good catch! I added the comments. Also added Assert(false) at the bottom of the function. > As a future improvement, I wonder if we could avoid recomputing the > current set of sync standbys in every walsender every time we call > SyncRepReleaseWaiters, perhaps by maintaining that set incrementally > in shmem when walsender states change etc. +1 > I don't have any other comments, other than to say: thank you to all > the people who have contributed to this feature so far and I really > really hope it goes into 9.6! +1000 Regards, -- Fujii Masao
Attachment
At 2016-04-04 17:28:07 +0900, masao.fujii@gmail.com wrote: > > Barring any objections, I'll commit this patch. No objections, just a minor wording tweak: doc/src/sgml/config.sgml: "The synchronous standbys will be the standbys that their names appear early in this list" should be "The synchronous standbys will be those whose names appear earlier in this list". doc/src/sgml/high-availability.sgml: "The standbys that their names appear early in this list are given higher priority and will be considered as synchronous" should be "The standbys whose names appear earlier in the list are given higher priority and will be considered as synchronous". "The standbys that their names appear early in the list will be used as the synchronous standby" should be "The standbys whose names appear earlier in the list will be used as synchronous standbys". You may prefer to reword this in some other way, but the current "that their names appear" wording should be changed. -- Abhijit
Hello, thank you for testing. At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com> > >>> Attached latest patch incorporate some review comments so far, and is > >>> rebased against current HEAD. > >>> > >> > >> Sorry I attached wrong patch. > >> Attached patch is correct patch. > >> > >> [mulit_sync_replication_v21.patch] > > > > Here are some TPS numbers from some quick tests I ran on a set of > > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured > > as primary + 3 standbys, to try out different combinations of > > synchronous_commit levels and synchronous_standby_names numbers. They > > were run for a short time only and these are of course systems with > > limited and perhaps uneven IO and CPU, but they still give some idea > > of the trends. And reassuringly, the trends are travelling in the > > expected directions. > > > > All default settings except shared_buffers = 1GB, and the GUCs > > required for replication. > > > > pgbench postgres -j2 -c2 -N bench2 -T 600 > > > > 1(*) 2(*) 3(*) > > ==== ==== ==== > > off = 4056 4096 4092 > > local = 1323 1299 1312 > > remote_write = 1130 1046 958 > > on = 860 744 701 > > remote_apply = 785 725 604 > > > > pgbench postgres -j16 -c16 -N bench2 -T 600 > > > > 1(*) 2(*) 3(*) > > ==== ==== ==== > > off = 3952 3943 3933 > > local = 2964 2984 3026 > > remote_write = 2790 2724 2675 > > on = 2731 2627 2523 > > remote_apply = 2627 2501 2432 > > > > One thing I noticed is that there are LOG messages telling me when a > > standby becomes a synchronous standby, but nothing to tell me if a > > standby stops being a standby (ie because a higher priority one has > > taken its place in the quorum). Would that be interesting? A walsender exits by proc_exit() for any operational termination so wrapping proc_exit() should work. (Attached file 1) For the setting "2(Sby1, Sby2, Sby3)", the master says that all of the standbys are sync-standbys and no message is emited on failure of Sby1, which should cause a promotion of Sby3. > standby "Sby3" is now the synchronous standby with priority 3 > standby "Sby2" is now the synchronous standby with priority 2 > standby "Sby1" is now the synchronous standby with priority 1 ..<Sby 1 failure> > standby "Sby3" is now the synchronous standby with priority 3 Sby3 becomes sync standby twice:p This was a behavior taken over from the single-sync-rep era but it should be confusing for the new sync-rep selection mechanism. The second attached diff makes this as the following. > 17:48:21.969 LOG: standby "Sby3" is now a synchronous standby with priority 3 > 17:48:23.087 LOG: standby "Sby2" is now a synchronous standby with priority 2 > 17:48:25.617 LOG: standby "Sby1" is now a synchronous standby with priority 1 > 17:48:31.990 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > 17:48:43.905 LOG: standby "Sby3" is now a synchronous standby with priority 3 > 17:49:10.262 LOG: standby "Sby1" is now a synchronous standby with priority 1 > 17:49:13.865 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 Since this status check is taken place for every reply from stanbys, the message of downgrading to "potential" may be diferred or even fail to occur but it should be no problem. Applying the both of the above patches, the message would be like the following. > 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3 > 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1 > 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2 > 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited > 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3 > 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1 > 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 Does this make sense? By the way, Sawada-san, you have changed the parentheses for the priority method from '[]' to '()'. And I mistankenly defined s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that is, only Sby2 is registed as mandatory synchronous standby. For this case, the tree members of SyncRepConfig are '2[Sby1,', 'Sby2', "Sby3]'. This syntax is valid for the current specification but will surely get different meaning by the future changes. We should refuse this known-to-be-wrong-in-future syntax from now. And, this error was very hard to know. pg_setting only shows the string itself =# select name, setting from pg_settings where name = 'synchronous_standby_names'; name | setting ---------------------------+---------------------synchronous_standby_names | 2[Sby1, Sby2, Sby3] (1 row) Since the sintax is no longer so simple, we may need some means to see the current standby-group setting clearly, but it wont'be if refusing the known....-future syntax now. > > Also, I spotted some tiny mistakes: > > > > + <indexterm zone="high-availability"> > > + <primary>Dedicated language for multiple synchornous replication</primary> > > + </indexterm> > > > > s/synchornous/synchronous/ > > > > + /* > > + * If we are managing the sync standby, though we weren't > > + * prior to this, then announce we are now the sync standby. > > + */ > > > > s/ the / a / (two occurrences) > > > > + ereport(LOG, > > + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", > > + application_name, MyWalSnd->sync_standby_priority))); > > > > s/ the / a / > > > > offered by a transaction commit. This level of protection is referred > > - to as 2-safe replication in computer science theory. > > + to as 2-safe replication in computer science theory, and group-1-safe > > + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to > > + more than <literal>remote_write</>. > > > > Why "more than"? I think those two words should be changed to "at > > least", or removed. > > > > + <para> > > + This syntax allows us to define a synchronous group that will wait for at > > + least N standbys of them, and a comma-separated list of group > > members that are surrounded by > > + parantheses. The special value <literal>*</> for server name > > matches any standby. > > + By surrounding list of group members using parantheses, > > synchronous standbys are chosen from > > + that group using priority method. > > + </para> > > > > s/parantheses/parentheses/ (two occurrences) > > > > + <sect2 id="dedicated-language-for-multi-sync-replication-priority"> > > + <title>Prioirty Method</title> > > > > s/Prioirty Method/Priority Method/ > > A couple more comments: > > /* > - * If we aren't managing the highest priority standby then just leave. > + * If the number of sync standbys is less than requested or we aren't > + * managing the sync standby then just leave. > */ > - if (syncWalSnd != MyWalSnd) > + if (!got_oldest || !am_sync) > > s/ the sync / a sync / > > + /* > + * Consider all pending standbys as sync if the number of them plus > + * already-found sync ones is lower than the configuration requests. > + */ > + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync) > + return list_concat(result, pending); > > The cells from 'pending' will be attached to 'result', and 'result' > will be freed by the caller. But won't the List header object from > 'pending' be leaked? > > + result = lappend_int(result, i); > + if (list_length(result) == SyncRepConfig->num_sync) > + { > + list_free(pending); > + return result; /* Exit if got enough sync standbys */ > + } > > If we didn't take the early return in the list-not-long-enough case > mentioned above, we should *always* exit via this return statement, > right? Since we know that the pending list had enough elements to > reach num_sync. I think that is worth a comment, and also a "not > reached" comment at the bottom of the function, if it is true. > > As a future improvement, I wonder if we could avoid recomputing the > current set of sync standbys in every walsender every time we call > SyncRepReleaseWaiters, perhaps by maintaining that set incrementally > in shmem when walsender states change etc. > > I don't have any other comments, other than to say: thank you to all > the people who have contributed to this feature so far and I really > really hope it goes into 9.6! regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 0867cc4..77d24f5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -184,6 +184,8 @@ static volatile sig_atomic_t replication_active = false;static LogicalDecodingContext *logical_decoding_ctx= NULL;static XLogRecPtr logical_startptr = InvalidXLogRecPtr; +static void walsnd_proc_exit(int code); +/* Signal handlers */static void WalSndSigHupHandler(SIGNAL_ARGS);static void WalSndXLogSendHandler(SIGNAL_ARGS); @@ -242,6 +244,23 @@ InitWalSender(void) SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);} +static void +walsnd_proc_exit(int code) +{ + WalSnd *walsnd = MyWalSnd; + int mypriority = 0; + + SpinLockAcquire(&walsnd->mutex); + mypriority = walsnd->sync_standby_priority; + SpinLockRelease(&walsnd->mutex); + + if (mypriority > 0) + ereport(LOG, + (errmsg("synchronous standby \"%s\" with priority %d exited", + application_name, mypriority))); + proc_exit(code); +} +/* * Clean up after an error. * @@ -266,7 +285,7 @@ WalSndErrorCleanup(void) replication_active = false; if (walsender_ready_to_stop) - proc_exit(0); + walsnd_proc_exit(0); /* Revert back to startup state */ WalSndSetState(WALSNDSTATE_STARTUP); @@ -285,7 +304,7 @@ WalSndShutdown(void) if (whereToSendOutput == DestRemote) whereToSendOutput = DestNone; - proc_exit(0); + walsnd_proc_exit(0); abort(); /* keep the compiler quiet */} @@ -673,7 +692,7 @@ StartReplication(StartReplicationCmd *cmd) replication_active = false; if (walsender_ready_to_stop) - proc_exit(0); + walsnd_proc_exit(0); WalSndSetState(WALSNDSTATE_STARTUP); Assert(streamingDoneSending && streamingDoneReceiving); @@ -1027,7 +1046,7 @@ StartLogicalReplication(StartReplicationCmd *cmd) replication_active = false; if (walsender_ready_to_stop) - proc_exit(0); + walsnd_proc_exit(0); WalSndSetState(WALSNDSTATE_STARTUP); /* Get out of COPY mode (CommandComplete). */ @@ -1391,7 +1410,7 @@ ProcessRepliesIfAny(void) ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("unexpected EOF on standby connection"))); - proc_exit(0); + walsnd_proc_exit(0); } if (r == 0) { @@ -1407,7 +1426,7 @@ ProcessRepliesIfAny(void) ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("unexpected EOF on standby connection"))); - proc_exit(0); + walsnd_proc_exit(0); } /* @@ -1453,7 +1472,7 @@ ProcessRepliesIfAny(void) * 'X' means that the standby is closing down the socket. */ case 'X': - proc_exit(0); + walsnd_proc_exit(0); default: ereport(FATAL, @@ -1500,7 +1519,7 @@ ProcessStandbyMessage(void) ereport(COMMERROR, (errcode(ERRCODE_PROTOCOL_VIOLATION), errmsg("unexpected message type \"%c\"", msgtype))); - proc_exit(0); + walsnd_proc_exit(0); }} @@ -2501,7 +2520,7 @@ WalSndDone(WalSndSendDataCallback send_data) EndCommand("COPY 0", DestRemote); pq_flush(); - proc_exit(0); + walsnd_proc_exit(0); } if (!waiting_for_ping_response) WalSndKeepalive(true); diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 6692027..6e120f3 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -64,7 +64,13 @@ char *SyncRepStandbyNames;#define SyncStandbysDefined() \ (SyncRepStandbyNames != NULL && SyncRepStandbyNames[0]!= '\0') -static bool announce_next_takeover = true; +typedef enum syncrep_state { + SSTATE_NONE, + SSTATE_POTENTIAL, + SSTATE_SYNC, +} sync_state; + +static sync_state syncrep_state = SSTATE_NONE;SyncRepConfigData *SyncRepConfig;static int SyncRepWaitMode = SYNC_REP_NO_WAIT; @@ -416,22 +422,26 @@ SyncRepReleaseWaiters(void) * If we are managing the sync standby, though we weren't * priorto this, then announce we are now the sync standby. */ - if (announce_next_takeover && am_sync) + if ((syncrep_state != SSTATE_POTENTIAL && + !am_sync && MyWalSnd->sync_standby_priority > 0) || + (syncrep_state != SSTATE_SYNC && am_sync)) { - announce_next_takeover = false; ereport(LOG, - (errmsg("standby \"%s\" is now the synchronous standby with priority %u", - application_name, MyWalSnd->sync_standby_priority))); + (errmsg("standby \"%s\" is now a %ssynchronous standby with priority %u", + application_name, + am_sync ? "" : "potential ", + MyWalSnd->sync_standby_priority))); + syncrep_state = (am_sync ? SSTATE_SYNC : SSTATE_POTENTIAL); } /* * If the number of sync standbys is lessthan requested or we aren't * managing the sync standby then just leave. */ - if (!got_oldest || !am_sync) + if (!got_oldest || MyWalSnd->sync_standby_priority == 0) { LWLockRelease(SyncRepLock); - announce_next_takeover = !am_sync; + syncrep_state = SSTATE_NONE; return; }
Barring any objections, I'll commit this patch.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-04-04 10:35:34 +0100, Simon Riggs wrote: > On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote: > > Barring any objections, I'll commit this patch. No objection here either, just one question: Has anybody thought about the ability to extend this to do per-database syncrep? Logical decoding works on a database level, and that can cause some problems with global configuration. > That sounds good. > > May I have one more day to review this? Actually more like 3-4 hours. > I have no comments on an initial read, so I'm hopeful of having nothing at > all to say on it. Simon, perhaps you could hold the above question in your mind while looking through this? Thanks, Andres
On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, thank you for testing. > > At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com> >> >>> Attached latest patch incorporate some review comments so far, and is >> >>> rebased against current HEAD. >> >>> >> >> >> >> Sorry I attached wrong patch. >> >> Attached patch is correct patch. >> >> >> >> [mulit_sync_replication_v21.patch] >> > >> > Here are some TPS numbers from some quick tests I ran on a set of >> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured >> > as primary + 3 standbys, to try out different combinations of >> > synchronous_commit levels and synchronous_standby_names numbers. They >> > were run for a short time only and these are of course systems with >> > limited and perhaps uneven IO and CPU, but they still give some idea >> > of the trends. And reassuringly, the trends are travelling in the >> > expected directions. >> > >> > All default settings except shared_buffers = 1GB, and the GUCs >> > required for replication. >> > >> > pgbench postgres -j2 -c2 -N bench2 -T 600 >> > >> > 1(*) 2(*) 3(*) >> > ==== ==== ==== >> > off = 4056 4096 4092 >> > local = 1323 1299 1312 >> > remote_write = 1130 1046 958 >> > on = 860 744 701 >> > remote_apply = 785 725 604 >> > >> > pgbench postgres -j16 -c16 -N bench2 -T 600 >> > >> > 1(*) 2(*) 3(*) >> > ==== ==== ==== >> > off = 3952 3943 3933 >> > local = 2964 2984 3026 >> > remote_write = 2790 2724 2675 >> > on = 2731 2627 2523 >> > remote_apply = 2627 2501 2432 >> > >> > One thing I noticed is that there are LOG messages telling me when a >> > standby becomes a synchronous standby, but nothing to tell me if a >> > standby stops being a standby (ie because a higher priority one has >> > taken its place in the quorum). Would that be interesting? > > A walsender exits by proc_exit() for any operational > termination so wrapping proc_exit() should work. (Attached file 1) > > For the setting "2(Sby1, Sby2, Sby3)", the master says that all > of the standbys are sync-standbys and no message is emited on > failure of Sby1, which should cause a promotion of Sby3. > >> standby "Sby3" is now the synchronous standby with priority 3 >> standby "Sby2" is now the synchronous standby with priority 2 >> standby "Sby1" is now the synchronous standby with priority 1 > ..<Sby 1 failure> >> standby "Sby3" is now the synchronous standby with priority 3 > > Sby3 becomes sync standby twice:p > > This was a behavior taken over from the single-sync-rep era but > it should be confusing for the new sync-rep selection mechanism. > The second attached diff makes this as the following. > > >> 17:48:21.969 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> 17:48:23.087 LOG: standby "Sby2" is now a synchronous standby with priority 2 >> 17:48:25.617 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> 17:48:31.990 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> 17:48:43.905 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> 17:49:10.262 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> 17:49:13.865 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > > Since this status check is taken place for every reply from > stanbys, the message of downgrading to "potential" may be > diferred or even fail to occur but it should be no problem. > > Applying the both of the above patches, the message would be like > the following. > >> 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2 >> 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited >> 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > > Does this make sense? > > By the way, Sawada-san, you have changed the parentheses for the > priority method from '[]' to '()'. And I mistankenly defined > s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that > is, only Sby2 is registed as mandatory synchronous standby. > > For this case, the tree members of SyncRepConfig are '2[Sby1,', > 'Sby2', "Sby3]'. This syntax is valid for the current > specification but will surely get different meaning by the future > changes. We should refuse this known-to-be-wrong-in-future syntax > from now. > I have no objection about current version patch. But one optimise idea I came up with is to return false before calculation of lowest LSN from sync standby if MyWalSnd is not listed in sync_standby. For example in SyncRepGetOldestSyncRecPtr(), == sync_standby = SyncRepGetSyncStandbys(); if (list_length(sync_standbys) <SyncRepConfig->num_sync() { (snip) } /* Here if MyWalSnd is not listed in sync_standby, quick exit. */ if (list_member_int(sync_standbys, MyWalSnd->slotno)) return false; foreach(cell, sync_standbys) { (snip) } == > For this case, the tree members of SyncRepConfig are '2[Sby1,', > 'Sby2', "Sby3]'. This syntax is valid for the current > specification but will surely get different meaning by the future > changes. We should refuse this known-to-be-wrong-in-future syntax > from now. I couldn't get your point but why will the above syntax meaning be different from current meaning by future change? I thought that another method uses another kind of parentheses. Regards, -- Masahiko Sawada
Simon, perhaps you could hold the above question in your mind while
looking through this?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> Thanks for updating the patch!
>
> I applied the following changes to the patch.
> Attached is the revised version of the patch.
On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:Barring any objections, I'll commit this patch.That sounds good.May I have one more day to review this? Actually more like 3-4 hours.
I am in favour of committing something for 9.6, though I do have some objective comments
1. Header comments in syncrep.c need changes, not just additions.
2. We need tests to ensure that k >=1 and k<=N
3. There should be a WARNING if k == N to say that we don't yet provide a level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1, n2, n3) etc
4. How does it work?
It's pretty strange, but that isn't documented anywhere. It took me a while to figure it out even though I know that code. My thought is its a lot slower than before, which is a concern when we know by definition that k >=2 for the new feature. I was going to mention the fact that this code only needs to be executed by standbys mentioned in s_s_n, so we can avoid overhead and contention for async standbys (But Masahiko just mentioned that also).
5. Timing – k > 1 will be slower by definition and more complex to configure, yet there is no timing facility to measure the effect of this, even though we have a new timing facility in 9.6. It would be useful to have a track_syncrep option to keep track of typical response times from nodes.
6. Meaning of k (n1, n2, n3) with N servers
It's clearly documented that this means k replies IN SEQUENCE. I believe the typical meaning of would be “any k out of N”, which would be faster than what we have, e.g.
3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3) acknowledge.
The “any k” option is not currently possible, but could be fairly easily. The syntax should also be easily extensible.
I would call what we have now “first” semantics, and we could have both of these...
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the responses from k out of N standbys. “any k” would be faster, so is desirable for performance and resilience
>>> So I am suggesting we put an extra keyword in front of the “k”, to explain how the k responses should be gathered as an extension to the the syntax. I also think implementing “any k” is actually fairly trivial and could be done for 9.6 (rather than just "first k").
Future thoughts that relate to syntax choices now, not for 9.6
Eventually I would want to be able to specify this…
2 ( any (london1, london2), any (nyc1, nyc2))
meaning I want a response from at least 1 London server and at least one NYC server, but whichever one responds first doesn't matter.
And I also want to be able to specify node groups in there. So elsewhere we would specify London node group as (London1, London2) and NYC node group as (NYC1, NYC2) and then specify
any 2 (London, NYC, Tokyo).
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016/04/05 16:35, Simon Riggs wrote: > 6. Meaning of k (n1, n2, n3) with N servers > > It's clearly documented that this means k replies IN SEQUENCE. I believe > the typical meaning of would be “any k out of N”, which would be faster > than what we have, e.g. > 3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3) > acknowledge. > > The “any k” option is not currently possible, but could be fairly easily. > The syntax should also be easily extensible. > > I would call what we have now “first” semantics, and we could have both of > these... > > * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now > * any k (n1, n2, n3) – would release waiters as soon as we have the > responses from k out of N standbys. “any k” would be faster, so is > desirable for performance and resilience > >>>> So I am suggesting we put an extra keyword in front of the “k”, to > explain how the k responses should be gathered as an extension to the the > syntax. I also think implementing “any k” is actually fairly trivial and > could be done for 9.6 (rather than just "first k"). +1 for 'first/any k (...)', with possibly only 'first' supported for now, if the 'any' case is more involved than we would like to spend time on, given the time considerations. IMHO, the extra keyword adds to clarity of the syntax. Thanks, Amit
On Mon, Apr 4, 2016 at 5:59 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > At 2016-04-04 17:28:07 +0900, masao.fujii@gmail.com wrote: >> >> Barring any objections, I'll commit this patch. > > No objections, just a minor wording tweak: > > doc/src/sgml/config.sgml: > > "The synchronous standbys will be the standbys that their names appear > early in this list" should be "The synchronous standbys will be those > whose names appear earlier in this list". > > doc/src/sgml/high-availability.sgml: > > "The standbys that their names appear early in this list are given > higher priority and will be considered as synchronous" should be "The > standbys whose names appear earlier in the list are given higher > priority and will be considered as synchronous". > > "The standbys that their names appear early in the list will be used as > the synchronous standby" should be "The standbys whose names appear > earlier in the list will be used as synchronous standbys". > > You may prefer to reword this in some other way, but the current "that > their names appear" wording should be changed. Thanks for the review! Will apply these comments to new patch. Regards, -- Fujii Masao
On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Hello, thank you for testing. >> >> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com> >>> >>> Attached latest patch incorporate some review comments so far, and is >>> >>> rebased against current HEAD. >>> >>> >>> >> >>> >> Sorry I attached wrong patch. >>> >> Attached patch is correct patch. >>> >> >>> >> [mulit_sync_replication_v21.patch] >>> > >>> > Here are some TPS numbers from some quick tests I ran on a set of >>> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured >>> > as primary + 3 standbys, to try out different combinations of >>> > synchronous_commit levels and synchronous_standby_names numbers. They >>> > were run for a short time only and these are of course systems with >>> > limited and perhaps uneven IO and CPU, but they still give some idea >>> > of the trends. And reassuringly, the trends are travelling in the >>> > expected directions. >>> > >>> > All default settings except shared_buffers = 1GB, and the GUCs >>> > required for replication. >>> > >>> > pgbench postgres -j2 -c2 -N bench2 -T 600 >>> > >>> > 1(*) 2(*) 3(*) >>> > ==== ==== ==== >>> > off = 4056 4096 4092 >>> > local = 1323 1299 1312 >>> > remote_write = 1130 1046 958 >>> > on = 860 744 701 >>> > remote_apply = 785 725 604 >>> > >>> > pgbench postgres -j16 -c16 -N bench2 -T 600 >>> > >>> > 1(*) 2(*) 3(*) >>> > ==== ==== ==== >>> > off = 3952 3943 3933 >>> > local = 2964 2984 3026 >>> > remote_write = 2790 2724 2675 >>> > on = 2731 2627 2523 >>> > remote_apply = 2627 2501 2432 >>> > >>> > One thing I noticed is that there are LOG messages telling me when a >>> > standby becomes a synchronous standby, but nothing to tell me if a >>> > standby stops being a standby (ie because a higher priority one has >>> > taken its place in the quorum). Would that be interesting? >> >> A walsender exits by proc_exit() for any operational >> termination so wrapping proc_exit() should work. (Attached file 1) >> >> For the setting "2(Sby1, Sby2, Sby3)", the master says that all >> of the standbys are sync-standbys and no message is emited on >> failure of Sby1, which should cause a promotion of Sby3. >> >>> standby "Sby3" is now the synchronous standby with priority 3 >>> standby "Sby2" is now the synchronous standby with priority 2 >>> standby "Sby1" is now the synchronous standby with priority 1 >> ..<Sby 1 failure> >>> standby "Sby3" is now the synchronous standby with priority 3 >> >> Sby3 becomes sync standby twice:p >> >> This was a behavior taken over from the single-sync-rep era but >> it should be confusing for the new sync-rep selection mechanism. >> The second attached diff makes this as the following. >> >> >>> 17:48:21.969 LOG: standby "Sby3" is now a synchronous standby with priority 3 >>> 17:48:23.087 LOG: standby "Sby2" is now a synchronous standby with priority 2 >>> 17:48:25.617 LOG: standby "Sby1" is now a synchronous standby with priority 1 >>> 17:48:31.990 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >>> 17:48:43.905 LOG: standby "Sby3" is now a synchronous standby with priority 3 >>> 17:49:10.262 LOG: standby "Sby1" is now a synchronous standby with priority 1 >>> 17:49:13.865 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> >> Since this status check is taken place for every reply from >> stanbys, the message of downgrading to "potential" may be >> diferred or even fail to occur but it should be no problem. >> >> Applying the both of the above patches, the message would be like >> the following. >> >>> 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3 >>> 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1 >>> 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2 >>> 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >>> 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited >>> 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3 >>> 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1 >>> 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> >> Does this make sense? >> >> By the way, Sawada-san, you have changed the parentheses for the >> priority method from '[]' to '()'. And I mistankenly defined >> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that >> is, only Sby2 is registed as mandatory synchronous standby. >> >> For this case, the tree members of SyncRepConfig are '2[Sby1,', >> 'Sby2', "Sby3]'. This syntax is valid for the current >> specification but will surely get different meaning by the future >> changes. We should refuse this known-to-be-wrong-in-future syntax >> from now. >> > > I have no objection about current version patch. > But one optimise idea I came up with is to return false before > calculation of lowest LSN from sync standby if MyWalSnd is not listed > in sync_standby. > For example in SyncRepGetOldestSyncRecPtr(), > > == > sync_standby = SyncRepGetSyncStandbys(); > > if (list_length(sync_standbys) <SyncRepConfig->num_sync() > { > (snip) > } > > /* Here if MyWalSnd is not listed in sync_standby, quick exit. */ > if (list_member_int(sync_standbys, MyWalSnd->slotno)) > return false; list_member_int() performs the loop internally. So I'm not sure how much adding extra list_member_int() here can optimize this processing. Another idea is to make SyncRepGetSyncStandby() check whether I'm sync standby or not. In this idea, without adding extra loop, we can exit earilier in the case where I'm not a sync standby. Does this make sense? Regards, -- Fujii Masao
>>>> So I am suggesting we put an extra keyword in front of the “k”, to
> explain how the k responses should be gathered as an extension to the the
> syntax. I also think implementing “any k” is actually fairly trivial and
> could be done for 9.6 (rather than just "first k").
+1 for 'first/any k (...)', with possibly only 'first' supported for now,
if the 'any' case is more involved than we would like to spend time on,
given the time considerations. IMHO, the extra keyword adds to clarity of
the syntax.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 4, 2016 at 6:45 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-04-04 10:35:34 +0100, Simon Riggs wrote: >> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote: >> > Barring any objections, I'll commit this patch. > > No objection here either, just one question: Has anybody thought about > the ability to extend this to do per-database syncrep? Nope at least for me... You'd like to extend synchronous_standby_names so that users can specify that per-database? Regards, -- Fujii Masao
On Mon, Apr 4, 2016 at 6:45 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-04 10:35:34 +0100, Simon Riggs wrote:
>> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > Barring any objections, I'll commit this patch.
>
> No objection here either, just one question: Has anybody thought about
> the ability to extend this to do per-database syncrep?
Nope at least for me... You'd like to extend synchronous_standby_names
so that users can specify that per-database?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-04-05 10:13:50 +0100, Simon Riggs wrote: > The lack of per-database settings is not a blocker for me. Just to clarify: Neither is it for me.
On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> >> Thanks for updating the patch! >> >> I applied the following changes to the patch. >> Attached is the revised version of the patch. >> > > 1. > { > {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER, > gettext_noop("List of names of potential synchronous standbys."), > NULL, > GUC_LIST_INPUT > }, > &SyncRepStandbyNames, > "", > check_synchronous_standby_names, NULL, NULL > }, > > Isn't it better to modify the description of synchronous_standby_names in > guc.c based on new usage? What about "Number of synchronous standbys and list of names of potential synchronous ones"? Better idea? > 2. > pg_stat_get_wal_senders() > { > .. > /* > ! * Allocate and update the config data of synchronous replication, > ! * and then get the currently active synchronous standbys. > */ > + SyncRepUpdateConfig(); > LWLockAcquire(SyncRepLock, LW_SHARED); > ! sync_standbys = SyncRepGetSyncStandbys(); > LWLockRelease(SyncRepLock); > .. > } > > Why is it important to update the config with patch? Earlier also any > update to config between calls wouldn't have been visible. Because a backend has no chance to call SyncRepUpdateConfig() and parse the latest value of s_s_names if SyncRepUpdateConfig() is not called here. This means that pg_stat_replication may return the information based on the old value of s_s_names. > 3. > <title>Planning for High Availability</title> > > <para> > ! <varname>synchronous_standby_names</> specifies the number of > ! synchronous standbys that transaction commits made when > > Is it better to say like: <varname>synchronous_standby_names</> specifies > the number and names of Precisely s_s_names specifies a list of names of potential sync standbys not sync ones. > 4. > + /* > + * Return the list of sync standbys, or NIL if no sync standby is > connected. > + * > + * If there are multiple standbys with the same priority, > + * the first one found is considered as higher priority. > > Here line indentation of second line can be improved. What about "the first one found is selected first"? Or better idea? > > ! /* > ! * syncrep_yyparse sets the global syncrep_parse_result as side effect. > ! * But this function is required to just check, so frees it > ! * once parsing parameter. > ! */ > ! SyncRepFreeConfig(syncrep_parse_result); > > How about below change in comment? > /so frees it once parsing parameter/so frees it after parsing the parameter Will apply this to the patch. Thanks for the review! Regards, -- Fujii Masao
At Tue, 5 Apr 2016 18:08:20 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG+DM=LCctG6q_Uxkgk17CbLKrHBwtPfUN3+Hu9QbvNuQ@mail.gmail.com> > On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> Hello, thank you for testing. > >> > >> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com> > >>> > One thing I noticed is that there are LOG messages telling me when a > >>> > standby becomes a synchronous standby, but nothing to tell me if a > >>> > standby stops being a standby (ie because a higher priority one has > >>> > taken its place in the quorum). Would that be interesting? > >> > >> A walsender exits by proc_exit() for any operational > >> termination so wrapping proc_exit() should work. (Attached file 1) > >> > >> For the setting "2(Sby1, Sby2, Sby3)", the master says that all > >> of the standbys are sync-standbys and no message is emited on > >> failure of Sby1, which should cause a promotion of Sby3. > >> > >>> standby "Sby3" is now the synchronous standby with priority 3 > >>> standby "Sby2" is now the synchronous standby with priority 2 > >>> standby "Sby1" is now the synchronous standby with priority 1 > >> ..<Sby 1 failure> > >>> standby "Sby3" is now the synchronous standby with priority 3 > >> > >> Sby3 becomes sync standby twice:p > >> > >> This was a behavior taken over from the single-sync-rep era but > >> it should be confusing for the new sync-rep selection mechanism. > >> The second attached diff makes this as the following. ... > >> Applying the both of the above patches, the message would be like > >> the following. > >> > >>> 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3 > >>> 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1 > >>> 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2 > >>> 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > >>> 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited > >>> 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3 > >>> 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1 > >>> 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 > >> > >> Does this make sense? > >> > >> By the way, Sawada-san, you have changed the parentheses for the > >> priority method from '[]' to '()'. And I mistankenly defined > >> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that > >> is, only Sby2 is registed as mandatory synchronous standby. > >> > >> For this case, the tree members of SyncRepConfig are '2[Sby1,', > >> 'Sby2', "Sby3]'. This syntax is valid for the current > >> specification but will surely get different meaning by the future > >> changes. We should refuse this known-to-be-wrong-in-future syntax > >> from now. > >> > > > > I have no objection about current version patch. > > But one optimise idea I came up with is to return false before > > calculation of lowest LSN from sync standby if MyWalSnd is not listed > > in sync_standby. > > For example in SyncRepGetOldestSyncRecPtr(), > > > > == > > sync_standby = SyncRepGetSyncStandbys(); > > > > if (list_length(sync_standbys) <SyncRepConfig->num_sync() > > { > > (snip) > > } > > > > /* Here if MyWalSnd is not listed in sync_standby, quick exit. */ > > if (list_member_int(sync_standbys, MyWalSnd->slotno)) > > return false; > > list_member_int() performs the loop internally. So I'm not sure how much > adding extra list_member_int() here can optimize this processing. > Another idea is to make SyncRepGetSyncStandby() check whether I'm sync > standby or not. In this idea, without adding extra loop, we can exit earilier > in the case where I'm not a sync standby. Does this make sense? The list_member_int() is also performed in the "(snip)" part. So SyncRepGetSyncStandbys() returning am_sync seems making sense. sync_standbys = SyncRepGetSyncStandbys(am_sync); /** Quick exit if I am not synchronous or there's not* enough synchronous standbys* / if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) { list_free(sync_standbys); return false; } regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 5, 2016 at 4:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 4 April 2016 at 10:35, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> On 4 April 2016 at 09:28, Fujii Masao <masao.fujii@gmail.com> wrote: >> >>> >>> Barring any objections, I'll commit this patch. >> >> >> That sounds good. >> >> May I have one more day to review this? Actually more like 3-4 hours. > > > What we have here is useful and elegant. I love the simplicity and backwards > compatibility of the design. Very nice, chef. > > I am in favour of committing something for 9.6, though I do have some > objective comments Thanks for the review! > 1. Header comments in syncrep.c need changes, not just additions. Okay, will consider this later. And I'd appreciate if you elaborate what changes are necessary specifically. > 2. We need tests to ensure that k >=1 and k<=N The changes to replication test framework was included in the patch before, but I excluded it from the patch because I'd like to commit the core part of the patch first. Will review the test part later. > > 3. There should be a WARNING if k == N to say that we don't yet provide a > level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1, > n2, n3) etc Sorry I failed to get your point. Could you tell me what Apply consistency and why we cannot provide it when k = N? > 4. How does it work? > It's pretty strange, but that isn't documented anywhere. It took me a while > to figure it out even though I know that code. My thought is its a lot > slower than before, which is a concern when we know by definition that k >=2 > for the new feature. I was going to mention the fact that this code only > needs to be executed by standbys mentioned in s_s_n, so we can avoid > overhead and contention for async standbys (But Masahiko just mentioned that > also). Unless I'm missing something, the patch already avoids the overhead of async standbys. Please see the top of SyncRepReleaseWaiters(). Since async standbys exit at the beginning of SyncRepReleaseWaiters(), they don't need to perform any operations that the patch adds (e.g., find out which standbys are synchronous). > 5. Timing – k > 1 will be slower by definition and more complex to > configure, yet there is no timing facility to measure the effect of this, > even though we have a new timing facility in 9.6. It would be useful to have > a track_syncrep option to keep track of typical response times from nodes. Maybe it's useful. But it seems completely new feature, so I'm not sure if we have enough time to push it to 9.6. Probably it's for 9.7. > 6. Meaning of k (n1, n2, n3) with N servers > > It's clearly documented that this means k replies IN SEQUENCE. I believe the > typical meaning of would be “any k out of N”, which would be faster than > what we have, e.g. > 3 (n1, n2, n3) would release as soon as (n1, n2) or (n2, n3) or (n1, n3) > acknowledge. > > The “any k” option is not currently possible, but could be fairly easily. > The syntax should also be easily extensible. > > I would call what we have now “first” semantics, and we could have both of > these... > > * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now > * any k (n1, n2, n3) – would release waiters as soon as we have the > responses from k out of N standbys. “any k” would be faster, so is desirable > for performance and resilience We discussed the syntax very long time, so restarting the discussion and keeping the patch uncommited is not good. We might fail to commit anything about N-sync rep in 9.6. So let's commit the current patch first and restart the discussion later. Regards, -- Fujii Masao
At Mon, 4 Apr 2016 22:00:24 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDoq1ubY4KkKhrA9jzaVXekwAT7gV5pQJbS+wj98b9-3A@mail.gmail.com> > > For this case, the tree members of SyncRepConfig are '2[Sby1,', > > 'Sby2', "Sby3]'. This syntax is valid for the current > > specification but will surely get different meaning by the future > > changes. We should refuse this known-to-be-wrong-in-future syntax > > from now. > > I couldn't get your point but why will the above syntax meaning be > different from current meaning by future change? > I thought that another method uses another kind of parentheses. If the 'another kind of parehtheses' is a pair of brackets, an application_name 'tokyo[A]', for example, is currently allowed to occur unquoted in the list but will become disallowed by the syntax change. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 5, 2016 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 April 2016 at 08:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: > >> >> >>>> So I am suggesting we put an extra keyword in front of the “k”, to >> > explain how the k responses should be gathered as an extension to the >> > the >> > syntax. I also think implementing “any k” is actually fairly trivial and >> > could be done for 9.6 (rather than just "first k"). >> >> +1 for 'first/any k (...)', with possibly only 'first' supported for now, >> if the 'any' case is more involved than we would like to spend time on, >> given the time considerations. IMHO, the extra keyword adds to clarity of >> the syntax. > > > Further thoughts: > > I said "any k" was faster, though what I mean is both faster and more > robust. If you have network peaks from any of the k sync standbys then the > user will wait longer. With "any k", if a network peak occurs, then another > standby response will work just as well. So the performance of "any k" will > be both faster, more consistent and less prone to misconfiguration. > > I also didn't explain why I think it is easy to implement "any k". > > All we need to do is change SyncRepGetOldestSyncRecPtr() so that it returns > the k'th oldest pointer of any named standby. s/oldest/newest ? > Then use that to wake up user > backends. So the change requires only slightly modified logic in a very > isolated part of the code, almost all of which would be code inserts to cope > with the new option. Yes. Probably we need to use some time to find what algorithm is the best for searching the k'th newest pointer. > The syntax and doc changes would take a couple of > hours. Yes, the updates of documentation would need more time. Regards, -- Fujii Masao
> 1. Header comments in syncrep.c need changes, not just additions.
Okay, will consider this later. And I'd appreciate if you elaborate what
changes are necessary specifically.
> 2. We need tests to ensure that k >=1 and k<=N
The changes to replication test framework was included in the patch before,
but I excluded it from the patch because I'd like to commit the core part of
the patch first. Will review the test part later.
>
> 3. There should be a WARNING if k == N to say that we don't yet provide a
> level to give Apply consistency. (I mean if we specify 2 (n1, n2) or 3(n1,
> n2, n3) etc
Sorry I failed to get your point. Could you tell me what Apply consistency
and why we cannot provide it when k = N?
> 4. How does it work?
> It's pretty strange, but that isn't documented anywhere. It took me a while
> to figure it out even though I know that code. My thought is its a lot
> slower than before, which is a concern when we know by definition that k >=2
> for the new feature. I was going to mention the fact that this code only
> needs to be executed by standbys mentioned in s_s_n, so we can avoid
> overhead and contention for async standbys (But Masahiko just mentioned that
> also).
Unless I'm missing something, the patch already avoids the overhead
of async standbys. Please see the top of SyncRepReleaseWaiters().
Since async standbys exit at the beginning of SyncRepReleaseWaiters(),
they don't need to perform any operations that the patch adds
(e.g., find out which standbys are synchronous).
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 5, 2016 at 7:17 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Tue, 5 Apr 2016 18:08:20 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG+DM=LCctG6q_Uxkgk17CbLKrHBwtPfUN3+Hu9QbvNuQ@mail.gmail.com> >> On Mon, Apr 4, 2016 at 10:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> > On Mon, Apr 4, 2016 at 6:03 PM, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> Hello, thank you for testing. >> >> >> >> At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com> >> >>> > One thing I noticed is that there are LOG messages telling me when a >> >>> > standby becomes a synchronous standby, but nothing to tell me if a >> >>> > standby stops being a standby (ie because a higher priority one has >> >>> > taken its place in the quorum). Would that be interesting? >> >> >> >> A walsender exits by proc_exit() for any operational >> >> termination so wrapping proc_exit() should work. (Attached file 1) >> >> >> >> For the setting "2(Sby1, Sby2, Sby3)", the master says that all >> >> of the standbys are sync-standbys and no message is emited on >> >> failure of Sby1, which should cause a promotion of Sby3. >> >> >> >>> standby "Sby3" is now the synchronous standby with priority 3 >> >>> standby "Sby2" is now the synchronous standby with priority 2 >> >>> standby "Sby1" is now the synchronous standby with priority 1 >> >> ..<Sby 1 failure> >> >>> standby "Sby3" is now the synchronous standby with priority 3 >> >> >> >> Sby3 becomes sync standby twice:p >> >> >> >> This was a behavior taken over from the single-sync-rep era but >> >> it should be confusing for the new sync-rep selection mechanism. >> >> The second attached diff makes this as the following. > ... >> >> Applying the both of the above patches, the message would be like >> >> the following. >> >> >> >>> 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> >>> 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> >>> 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2 >> >>> 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> >>> 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited >> >>> 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3 >> >>> 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1 >> >>> 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3 >> >> >> >> Does this make sense? >> >> >> >> By the way, Sawada-san, you have changed the parentheses for the >> >> priority method from '[]' to '()'. And I mistankenly defined >> >> s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that >> >> is, only Sby2 is registed as mandatory synchronous standby. >> >> >> >> For this case, the tree members of SyncRepConfig are '2[Sby1,', >> >> 'Sby2', "Sby3]'. This syntax is valid for the current >> >> specification but will surely get different meaning by the future >> >> changes. We should refuse this known-to-be-wrong-in-future syntax >> >> from now. >> >> >> > >> > I have no objection about current version patch. >> > But one optimise idea I came up with is to return false before >> > calculation of lowest LSN from sync standby if MyWalSnd is not listed >> > in sync_standby. >> > For example in SyncRepGetOldestSyncRecPtr(), >> > >> > == >> > sync_standby = SyncRepGetSyncStandbys(); >> > >> > if (list_length(sync_standbys) <SyncRepConfig->num_sync() >> > { >> > (snip) >> > } >> > >> > /* Here if MyWalSnd is not listed in sync_standby, quick exit. */ >> > if (list_member_int(sync_standbys, MyWalSnd->slotno)) >> > return false; >> >> list_member_int() performs the loop internally. So I'm not sure how much >> adding extra list_member_int() here can optimize this processing. >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync >> standby or not. In this idea, without adding extra loop, we can exit earilier >> in the case where I'm not a sync standby. Does this make sense? > > The list_member_int() is also performed in the "(snip)" part. So > SyncRepGetSyncStandbys() returning am_sync seems making sense. > > sync_standbys = SyncRepGetSyncStandbys(am_sync); > > /* > * Quick exit if I am not synchronous or there's not > * enough synchronous standbys > * / > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) > { > list_free(sync_standbys); > return false; > } Thanks for the comment! I changed SyncRepGetSyncStandbys() so that it checks whether we're managing a sync standby or not. Attached is the updated version of the patch. I also applied several review comments to the patch. Regards, -- Fujii Masao
Attachment
On Tue, Apr 5, 2016 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 08:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>
>>
>> >>>> So I am suggesting we put an extra keyword in front of the “k”, to
>> > explain how the k responses should be gathered as an extension to the
>> > the
>> > syntax. I also think implementing “any k” is actually fairly trivial and
>> > could be done for 9.6 (rather than just "first k").
>>
>> +1 for 'first/any k (...)', with possibly only 'first' supported for now,
>> if the 'any' case is more involved than we would like to spend time on,
>> given the time considerations. IMHO, the extra keyword adds to clarity of
>> the syntax.
>
>
> Further thoughts:
>
> I said "any k" was faster, though what I mean is both faster and more
> robust. If you have network peaks from any of the k sync standbys then the
> user will wait longer. With "any k", if a network peak occurs, then another
> standby response will work just as well. So the performance of "any k" will
> be both faster, more consistent and less prone to misconfiguration.
>
> I also didn't explain why I think it is easy to implement "any k".
>
> All we need to do is change SyncRepGetOldestSyncRecPtr() so that it returns
> the k'th oldest pointer of any named standby.
s/oldest/newest ?
> Then use that to wake up user
> backends. So the change requires only slightly modified logic in a very
> isolated part of the code, almost all of which would be code inserts to cope
> with the new option.
Yes. Probably we need to use some time to find what algorithm is the best
for searching the k'th newest pointer.
> The syntax and doc changes would take a couple of
> hours.
Yes, the updates of documentation would need more time.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 5, 2016 at 8:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 April 2016 at 11:18, Fujii Masao <masao.fujii@gmail.com> wrote: > >> >> > 1. Header comments in syncrep.c need changes, not just additions. >> >> Okay, will consider this later. And I'd appreciate if you elaborate what >> changes are necessary specifically. > > > Some of the old header comments are now wrong. Okay, will check. >> > 2. We need tests to ensure that k >=1 and k<=N >> >> The changes to replication test framework was included in the patch >> before, >> but I excluded it from the patch because I'd like to commit the core part >> of >> the patch first. Will review the test part later. > > > I meant tests of setting the parameters, not tests of the feature itself. k<=0 causes an error while parsing s_s_names in current patch. Regarding the test of k<=N, you mean that an error should be emitted when k is larger than or equal to the number of standby names in the list? Multiple standbys with the same name may connect to the master. In this case, users might want to specifiy k<=N. So k<=N seems not invalid setting. >> > 3. There should be a WARNING if k == N to say that we don't yet provide >> > a >> > level to give Apply consistency. (I mean if we specify 2 (n1, n2) or >> > 3(n1, >> > n2, n3) etc >> >> Sorry I failed to get your point. Could you tell me what Apply consistency >> and why we cannot provide it when k = N? >> >> > 4. How does it work? >> > It's pretty strange, but that isn't documented anywhere. It took me a >> > while >> > to figure it out even though I know that code. My thought is its a lot >> > slower than before, which is a concern when we know by definition that k >> > >=2 >> > for the new feature. I was going to mention the fact that this code only >> > needs to be executed by standbys mentioned in s_s_n, so we can avoid >> > overhead and contention for async standbys (But Masahiko just mentioned >> > that >> > also). >> >> Unless I'm missing something, the patch already avoids the overhead >> of async standbys. Please see the top of SyncRepReleaseWaiters(). >> Since async standbys exit at the beginning of SyncRepReleaseWaiters(), >> they don't need to perform any operations that the patch adds >> (e.g., find out which standbys are synchronous). > > > I was thinking about the overhead of scanning through the full list of > WALSenders for each message, when it is a sync standby. This is true even in current release or before. Regards, -- Fujii Masao
Multiple standbys with the same name may connect to the master.
In this case, users might want to specifiy k<=N. So k<=N seems not invalid
setting.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >>
> >> Thanks for updating the patch!
> >>
> >> I applied the following changes to the patch.
> >> Attached is the revised version of the patch.
> >>
> >
> > 1.
> > {
> > {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
> > gettext_noop("List of names of potential synchronous standbys."),
> > NULL,
> > GUC_LIST_INPUT
> > },
> > &SyncRepStandbyNames,
> > "",
> > check_synchronous_standby_names, NULL, NULL
> > },
> >
> > Isn't it better to modify the description of synchronous_standby_names in
> > guc.c based on new usage?
>
> What about "Number of synchronous standbys and list of names of
> potential synchronous ones"? Better idea?
>
>
> > 2.
> > pg_stat_get_wal_senders()
> > {
> > ..
> > /*
> > ! * Allocate and update the config data of synchronous replication,
> > ! * and then get the currently active synchronous standbys.
> > */
> > + SyncRepUpdateConfig();
> > LWLockAcquire(SyncRepLock, LW_SHARED);
> > ! sync_standbys = SyncRepGetSyncStandbys();
> > LWLockRelease(SyncRepLock);
> > ..
> > }
> >
> > Why is it important to update the config with patch? Earlier also any
> > update to config between calls wouldn't have been visible.
>
> Because a backend has no chance to call SyncRepUpdateConfig() and
> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
> called here. This means that pg_stat_replication may return the information
> based on the old value of s_s_names.
>
> > 3.
> > <title>Planning for High Availability</title>
> >
> > <para>
> > ! <varname>synchronous_standby_names</> specifies the number of
> > ! synchronous standbys that transaction commits made when
> >
> > Is it better to say like: <varname>synchronous_standby_names</> specifies
> > the number and names of
>
> Precisely s_s_names specifies a list of names of potential sync standbys
> not sync ones.
>
> > 4.
> > + /*
> > + * Return the list of sync standbys, or NIL if no sync standby is
> > connected.
> > + *
> > + * If there are multiple standbys with the same priority,
> > + * the first one found is considered as higher priority.
> >
> > Here line indentation of second line can be improved.
>
> What about "the first one found is selected first"? Or better idea?
>
On Mon, Apr 4, 2016 at 4:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> + ereport(LOG, >>> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", >>> + application_name, MyWalSnd->sync_standby_priority))); >>> >>> s/ the / a / > > I have no objection to this change itself. But we have used this message > in 9.5 or before, so if we apply this change, probably we need > back-patching. "the" implies that there can be only one synchronous standby at that priority, while "a" implies that there could be more than one. So the situation might be different with this patch than previously. (I haven't read the patch so I don't know whether this is actually true, but it might be what Thomas was going for.) Also, I'd like to associate myself with the general happiness about the prospect of having this feature in 9.6 (but without specifically endorsing the code, since I have not read it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 5, 2016 at 7:23 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Mon, 4 Apr 2016 22:00:24 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDoq1ubY4KkKhrA9jzaVXekwAT7gV5pQJbS+wj98b9-3A@mail.gmail.com> >> > For this case, the tree members of SyncRepConfig are '2[Sby1,', >> > 'Sby2', "Sby3]'. This syntax is valid for the current >> > specification but will surely get different meaning by the future >> > changes. We should refuse this known-to-be-wrong-in-future syntax >> > from now. >> >> I couldn't get your point but why will the above syntax meaning be >> different from current meaning by future change? >> I thought that another method uses another kind of parentheses. > > If the 'another kind of parehtheses' is a pair of brackets, an > application_name 'tokyo[A]', for example, is currently allowed to > occur unquoted in the list but will become disallowed by the > syntax change. > > Thank you for explaining. I understood but since the future syntax is yet to be reached consensus, I thought that it would be difficult to refuse particular kind of parentheses for now. > > list_member_int() performs the loop internally. So I'm not sure how much > > adding extra list_member_int() here can optimize this processing. > > Another idea is to make SyncRepGetSyncStandby() check whether I'm sync > > standby or not. In this idea, without adding extra loop, we can exit earilier > > in the case where I'm not a sync standby. Does this make sense? > The list_member_int() is also performed in the "(snip)" part. So > SyncRepGetSyncStandbys() returning am_sync seems making sense. > > sync_standbys = SyncRepGetSyncStandbys(am_sync); > > /* > * Quick exit if I am not synchronous or there's not > * enough synchronous standbys > * / > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) > { > list_free(sync_standbys); > return false; I meant that it can skip to acquire spin lock at least, so it will optimise that logic. But anyway I agree with making SyncRepGetSyncStandbys returns am_sync variable. -- Regards, -- Masahiko Sawada
At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com> > >> list_member_int() performs the loop internally. So I'm not sure how much > >> adding extra list_member_int() here can optimize this processing. > >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync > >> standby or not. In this idea, without adding extra loop, we can exit earilier > >> in the case where I'm not a sync standby. Does this make sense? > > > > The list_member_int() is also performed in the "(snip)" part. So > > SyncRepGetSyncStandbys() returning am_sync seems making sense. > > > > sync_standbys = SyncRepGetSyncStandbys(am_sync); > > > > /* > > * Quick exit if I am not synchronous or there's not > > * enough synchronous standbys > > * / > > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) > > { > > list_free(sync_standbys); > > return false; > > } > > Thanks for the comment! I changed SyncRepGetSyncStandbys() so that > it checks whether we're managing a sync standby or not. > Attached is the updated version of the patch. I also applied several > review comments to the patch. It still does list_member_int but it can be gotten rid of as the attached patch. regards, diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 9b2137a..6998bb8 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync) if (XLogRecPtrIsInvalid(walsnd->flush)) continue; + /* Notify myself as 'synchonized' if I am */ + if (am_sync != NULL && walsnd == MyWalSnd) + *am_sync = true; + /* * If the priority is equal to 1, consider this standby as sync * and append it to the result.Otherwise append this standby @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync) if (this_priority == 1) { result = lappend_int(result,i); - if (am_sync != NULL && walsnd == MyWalSnd) - *am_sync = true; if (list_length(result) == SyncRepConfig->num_sync) { list_free(pending); @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync) { bool needfree = (result != NIL && pending !=NIL); - if (am_sync != NULL && !(*am_sync)) - *am_sync = list_member_int(pending, MyWalSnd->slotno); - result = list_concat(result, pending); if (needfree) pfree(pending); @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) } /* + * The pending list contains eventually potentially-synchronized standbys + * and this walsender may be one of them. So once reset am_sync. + */ + if (am_sync != NULL) + *am_sync = false; + + /* * Find the sync standbys from the pending list. */ priority = next_highest_priority;
On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote: > >> >> Multiple standbys with the same name may connect to the master. >> In this case, users might want to specifiy k<=N. So k<=N seems not invalid >> setting. > > > Confusing as that is, it is already the case; k > N could make sense. ;-( > > However, in most cases, k > N would not make sense and we should issue a > WARNING. Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread and the code for that test was included in the old patch (but I excluded it). Now the majority seems to prefer to add that test, so I just revived and revised that test code. Attached is the updated version of the patch. I also completed Amit's and Robert's comments. Regards, -- Fujii Masao
Attachment
On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Apr 5, 2016 at 3:15 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Tue, Apr 5, 2016 at 4:31 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Mon, Apr 4, 2016 at 1:58 PM, Fujii Masao <masao.fujii@gmail.com> >> > wrote: >> >> >> >> >> >> Thanks for updating the patch! >> >> >> >> I applied the following changes to the patch. >> >> Attached is the revised version of the patch. >> >> >> > >> > 1. >> > { >> > {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER, >> > gettext_noop("List of names of potential synchronous standbys."), >> > NULL, >> > GUC_LIST_INPUT >> > }, >> > &SyncRepStandbyNames, >> > "", >> > check_synchronous_standby_names, NULL, NULL >> > }, >> > >> > Isn't it better to modify the description of synchronous_standby_names >> > in >> > guc.c based on new usage? >> >> What about "Number of synchronous standbys and list of names of >> potential synchronous ones"? Better idea? >> > > Looks good. > >> >> > 2. >> > pg_stat_get_wal_senders() >> > { >> > .. >> > /* >> > ! * Allocate and update the config data of synchronous replication, >> > ! * and then get the currently active synchronous standbys. >> > */ >> > + SyncRepUpdateConfig(); >> > LWLockAcquire(SyncRepLock, LW_SHARED); >> > ! sync_standbys = SyncRepGetSyncStandbys(); >> > LWLockRelease(SyncRepLock); >> > .. >> > } >> > >> > Why is it important to update the config with patch? Earlier also any >> > update to config between calls wouldn't have been visible. >> >> Because a backend has no chance to call SyncRepUpdateConfig() and >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not >> called here. This means that pg_stat_replication may return the >> information >> based on the old value of s_s_names. >> > > Thats right, but without this patch also won't pg_stat_replication can show > old information? If no, why so? Without the patch, when s_s_names is changed and SIGHUP is sent, a backend calls ProcessConfigFile(), parse the configuration file and set the global variable SyncRepStandbyNames to the latest value of s_s_names. When pg_stat_replication is accessed, a backend calculates which standby is synchronous based on that latest value in SyncRepStandbyNames, and then displays the information of sync replication. With the patch, basically the same steps are executed when s_s_names is changed. But the difference is that, with the patch, SyncRepUpdateConfig() must be called after ProcessConfigFile() is called before the calculation of sync standbys. So I just added the call of SyncRepUpdateConfig() to pg_stat_get_wal_senders(). BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile() from pg_stat_get_wal_senders() and every backends always parse the value of s_s_names when the setting is changed. >> > 3. >> > <title>Planning for High Availability</title> >> > >> > <para> >> > ! <varname>synchronous_standby_names</> specifies the number of >> > ! synchronous standbys that transaction commits made when >> > >> > Is it better to say like: <varname>synchronous_standby_names</> >> > specifies >> > the number and names of >> >> Precisely s_s_names specifies a list of names of potential sync standbys >> not sync ones. >> > > Okay, but you doesn't seem to have updated this in your latest patch. I applied the change you suggested, to the patch. Thanks! Regards, -- Fujii Masao
On Tue, Apr 5, 2016 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 4, 2016 at 4:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> + ereport(LOG, >>>> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u", >>>> + application_name, MyWalSnd->sync_standby_priority))); >>>> >>>> s/ the / a / >> >> I have no objection to this change itself. But we have used this message >> in 9.5 or before, so if we apply this change, probably we need >> back-patching. > > "the" implies that there can be only one synchronous standby at that > priority, while "a" implies that there could be more than one. So the > situation might be different with this patch than previously. (I > haven't read the patch so I don't know whether this is actually true, > but it might be what Thomas was going for.) Thanks for the explanation! I applied that change, to the latest patch I posted upthread. Regards, -- Fujii Masao
On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote: >> >>> >>> Multiple standbys with the same name may connect to the master. >>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid >>> setting. >> >> >> Confusing as that is, it is already the case; k > N could make sense. ;-( >> >> However, in most cases, k > N would not make sense and we should issue a >> WARNING. > > Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread > and the code for that test was included in the old patch (but I excluded it). > Now the majority seems to prefer to add that test, so I just revived and > revised that test code. The regression test codes seems not to be included in latest patch, no? Regards, -- Masahiko Sawada
On Wed, Apr 6, 2016 at 2:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote: >>> >>>> >>>> Multiple standbys with the same name may connect to the master. >>>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid >>>> setting. >>> >>> >>> Confusing as that is, it is already the case; k > N could make sense. ;-( >>> >>> However, in most cases, k > N would not make sense and we should issue a >>> WARNING. >> >> Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread >> and the code for that test was included in the old patch (but I excluded it). >> Now the majority seems to prefer to add that test, so I just revived and >> revised that test code. > > The regression test codes seems not to be included in latest patch, no? I am looking at the latest patch now, and they are not included. It would be good to get those tests bundled in for a last lookup I think. -- Michael
On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com> >> >> list_member_int() performs the loop internally. So I'm not sure how much >> >> adding extra list_member_int() here can optimize this processing. >> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync >> >> standby or not. In this idea, without adding extra loop, we can exit earilier >> >> in the case where I'm not a sync standby. Does this make sense? >> > >> > The list_member_int() is also performed in the "(snip)" part. So >> > SyncRepGetSyncStandbys() returning am_sync seems making sense. >> > >> > sync_standbys = SyncRepGetSyncStandbys(am_sync); >> > >> > /* >> > * Quick exit if I am not synchronous or there's not >> > * enough synchronous standbys >> > * / >> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) >> > { >> > list_free(sync_standbys); >> > return false; >> > } >> >> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that >> it checks whether we're managing a sync standby or not. >> Attached is the updated version of the patch. I also applied several >> review comments to the patch. > > It still does list_member_int but it can be gotten rid of as the > attached patch. Thanks for the review! > > regards, > > diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c > index 9b2137a..6998bb8 100644 > --- a/src/backend/replication/syncrep.c > +++ b/src/backend/replication/syncrep.c > @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync) > if (XLogRecPtrIsInvalid(walsnd->flush)) > continue; > > + /* Notify myself as 'synchonized' if I am */ > + if (am_sync != NULL && walsnd == MyWalSnd) > + *am_sync = true; > + > /* > * If the priority is equal to 1, consider this standby as sync > * and append it to the result. Otherwise append this standby > @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync) > if (this_priority == 1) > { > result = lappend_int(result, i); > - if (am_sync != NULL && walsnd == MyWalSnd) > - *am_sync = true; > if (list_length(result) == SyncRepConfig->num_sync) > { > list_free(pending); > @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync) > { > bool needfree = (result != NIL && pending != NIL); > > - if (am_sync != NULL && !(*am_sync)) > - *am_sync = list_member_int(pending, MyWalSnd->slotno); > - > result = list_concat(result, pending); > if (needfree) > pfree(pending); > @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) > } > > /* > + * The pending list contains eventually potentially-synchronized standbys > + * and this walsender may be one of them. So once reset am_sync. > + */ > + if (am_sync != NULL) > + *am_sync = false; > + > + /* This code seems wrong in the case where this walsender is in the result list. So I adopted another logic. Attached is the updated version of the patch. Regards, -- Fujii Masao
Attachment
On Wed, Apr 6, 2016 at 2:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Apr 6, 2016 at 2:21 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Apr 5, 2016 at 8:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 5 April 2016 at 12:26, Fujii Masao <masao.fujii@gmail.com> wrote: >>> >>>> >>>> Multiple standbys with the same name may connect to the master. >>>> In this case, users might want to specifiy k<=N. So k<=N seems not invalid >>>> setting. >>> >>> >>> Confusing as that is, it is already the case; k > N could make sense. ;-( >>> >>> However, in most cases, k > N would not make sense and we should issue a >>> WARNING. >> >> Somebody (maybe Horiguchi-san and Sawada-san) commented this upthread >> and the code for that test was included in the old patch (but I excluded it). >> Now the majority seems to prefer to add that test, so I just revived and >> revised that test code. > > The regression test codes seems not to be included in latest patch, no? I intentionally excluded the regression test from the patch because I'd like to review and commit it separately from the main part of the feature. I'd appreciate if you read through the regression test which was included in previous patch and update it if required. Regards, -- Fujii Masao
On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com> >>> >> list_member_int() performs the loop internally. So I'm not sure how much >>> >> adding extra list_member_int() here can optimize this processing. >>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync >>> >> standby or not. In this idea, without adding extra loop, we can exit earilier >>> >> in the case where I'm not a sync standby. Does this make sense? >>> > >>> > The list_member_int() is also performed in the "(snip)" part. So >>> > SyncRepGetSyncStandbys() returning am_sync seems making sense. >>> > >>> > sync_standbys = SyncRepGetSyncStandbys(am_sync); >>> > >>> > /* >>> > * Quick exit if I am not synchronous or there's not >>> > * enough synchronous standbys >>> > * / >>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) >>> > { >>> > list_free(sync_standbys); >>> > return false; >>> > } >>> >>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that >>> it checks whether we're managing a sync standby or not. >>> Attached is the updated version of the patch. I also applied several >>> review comments to the patch. >> >> It still does list_member_int but it can be gotten rid of as the >> attached patch. > > Thanks for the review! > >> >> regards, >> >> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c >> index 9b2137a..6998bb8 100644 >> --- a/src/backend/replication/syncrep.c >> +++ b/src/backend/replication/syncrep.c >> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync) >> if (XLogRecPtrIsInvalid(walsnd->flush)) >> continue; >> >> + /* Notify myself as 'synchonized' if I am */ >> + if (am_sync != NULL && walsnd == MyWalSnd) >> + *am_sync = true; >> + >> /* >> * If the priority is equal to 1, consider this standby as sync >> * and append it to the result. Otherwise append this standby >> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >> if (this_priority == 1) >> { >> result = lappend_int(result, i); >> - if (am_sync != NULL && walsnd == MyWalSnd) >> - *am_sync = true; >> if (list_length(result) == SyncRepConfig->num_sync) >> { >> list_free(pending); >> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >> { >> bool needfree = (result != NIL && pending != NIL); >> >> - if (am_sync != NULL && !(*am_sync)) >> - *am_sync = list_member_int(pending, MyWalSnd->slotno); >> - >> result = list_concat(result, pending); >> if (needfree) >> pfree(pending); >> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) >> } >> >> /* >> + * The pending list contains eventually potentially-synchronized standbys >> + * and this walsender may be one of them. So once reset am_sync. >> + */ >> + if (am_sync != NULL) >> + *am_sync = false; >> + >> + /* > > This code seems wrong in the case where this walsender is in the result list. > So I adopted another logic. Attached is the updated version of the patch. To be honest, this is a nice patch that we have here, and it received a fair amount of work. I have been playing with it a bit but I could not break it. Here are few things I have noticed: + for (i = 0; i < max_wal_senders; i++) + { + walsnd = &WalSndCtl->walsnds[i]; No volatile pointer to prevent code reordering? */typedef struct WalSnd{ + int slotno; /* index of this slot in WalSnd array */ pid_t pid; /* this walsender's processid, or 0 */ slotno is used nowhere. I'll grab the tests and look at them. -- Michael
At Wed, 6 Apr 2016 15:29:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHGQEwH2c9buiZ=G7Ko8PQYwiU7=NsDkvCjRKUPSN8j7A@mail.gmail.com> > > @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) > > } > > > > /* > > + * The pending list contains eventually potentially-synchronized standbys > > + * and this walsender may be one of them. So once reset am_sync. > > + */ > > + if (am_sync != NULL) > > + *am_sync = false; > > + > > + /* > > This code seems wrong in the case where this walsender is in the result list. > So I adopted another logic. Attached is the updated version of the patch. You must misread the patch. am_sync is originally set in the loop just after that for the case. ! while (priority <= lowest_priority) ! { .. ! for (cell = list_head(pending); cell != NULL; cell = next) ! { ... ! if (this_priority == priority) ! { ! result = lappend_int(result, i); ! if (am_sync != NULL && walsnd == MyWalSnd) ! *am_sync = true; -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Here are few things I have noticed: > + for (i = 0; i < max_wal_senders; i++) > + { > + walsnd = &WalSndCtl->walsnds[i]; > No volatile pointer to prevent code reordering? > > */ > typedef struct WalSnd > { > + int slotno; /* index of this slot in WalSnd array */ > pid_t pid; /* this walsender's process id, or 0 */ > slotno is used nowhere. > > I'll grab the tests and look at them. So I had a look at those tests and finished with the attached: - patch 1 adds a reload routine to PostgresNode - patch 2 the list of tests. I took the tests from patch 21 and did many tweaks on them: - Use of qq() instead of quotes - Removal of hardcoded newlines - typo fixes and sanity fixes - etc. Regards, -- Michael
Attachment
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com> >>>> >> list_member_int() performs the loop internally. So I'm not sure how much >>>> >> adding extra list_member_int() here can optimize this processing. >>>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync >>>> >> standby or not. In this idea, without adding extra loop, we can exit earilier >>>> >> in the case where I'm not a sync standby. Does this make sense? >>>> > >>>> > The list_member_int() is also performed in the "(snip)" part. So >>>> > SyncRepGetSyncStandbys() returning am_sync seems making sense. >>>> > >>>> > sync_standbys = SyncRepGetSyncStandbys(am_sync); >>>> > >>>> > /* >>>> > * Quick exit if I am not synchronous or there's not >>>> > * enough synchronous standbys >>>> > * / >>>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) >>>> > { >>>> > list_free(sync_standbys); >>>> > return false; >>>> > } >>>> >>>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that >>>> it checks whether we're managing a sync standby or not. >>>> Attached is the updated version of the patch. I also applied several >>>> review comments to the patch. >>> >>> It still does list_member_int but it can be gotten rid of as the >>> attached patch. >> >> Thanks for the review! >> >>> >>> regards, >>> >>> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c >>> index 9b2137a..6998bb8 100644 >>> --- a/src/backend/replication/syncrep.c >>> +++ b/src/backend/replication/syncrep.c >>> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync) >>> if (XLogRecPtrIsInvalid(walsnd->flush)) >>> continue; >>> >>> + /* Notify myself as 'synchonized' if I am */ >>> + if (am_sync != NULL && walsnd == MyWalSnd) >>> + *am_sync = true; >>> + >>> /* >>> * If the priority is equal to 1, consider this standby as sync >>> * and append it to the result. Otherwise append this standby >>> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >>> if (this_priority == 1) >>> { >>> result = lappend_int(result, i); >>> - if (am_sync != NULL && walsnd == MyWalSnd) >>> - *am_sync = true; >>> if (list_length(result) == SyncRepConfig->num_sync) >>> { >>> list_free(pending); >>> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >>> { >>> bool needfree = (result != NIL && pending != NIL); >>> >>> - if (am_sync != NULL && !(*am_sync)) >>> - *am_sync = list_member_int(pending, MyWalSnd->slotno); >>> - >>> result = list_concat(result, pending); >>> if (needfree) >>> pfree(pending); >>> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) >>> } >>> >>> /* >>> + * The pending list contains eventually potentially-synchronized standbys >>> + * and this walsender may be one of them. So once reset am_sync. >>> + */ >>> + if (am_sync != NULL) >>> + *am_sync = false; >>> + >>> + /* >> >> This code seems wrong in the case where this walsender is in the result list. >> So I adopted another logic. Attached is the updated version of the patch. > > To be honest, this is a nice patch that we have here, and it received > a fair amount of work. I have been playing with it a bit but I could > not break it. > > Here are few things I have noticed: Thanks for the review! > + for (i = 0; i < max_wal_senders; i++) > + { > + walsnd = &WalSndCtl->walsnds[i]; > No volatile pointer to prevent code reordering? Yes. Since spin lock is not taken there, volatile is necessary. > */ > typedef struct WalSnd > { > + int slotno; /* index of this slot in WalSnd array */ > pid_t pid; /* this walsender's process id, or 0 */ > slotno is used nowhere. Yep. Attached is the updated version of the patch. > I'll grab the tests and look at them. Many thanks! Regards, -- Fujii Masao
Attachment
On Wed, Apr 6, 2016 at 5:01 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Wed, 6 Apr 2016 15:29:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHGQEwH2c9buiZ=G7Ko8PQYwiU7=NsDkvCjRKUPSN8j7A@mail.gmail.com> >> > @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) >> > } >> > >> > /* >> > + * The pending list contains eventually potentially-synchronized standbys >> > + * and this walsender may be one of them. So once reset am_sync. >> > + */ >> > + if (am_sync != NULL) >> > + *am_sync = false; >> > + >> > + /* >> >> This code seems wrong in the case where this walsender is in the result list. >> So I adopted another logic. Attached is the updated version of the patch. > > You must misread the patch. am_sync is originally set in the loop > just after that for the case. > > ! while (priority <= lowest_priority) > ! { > .. > ! for (cell = list_head(pending); cell != NULL; cell = next) > ! { > ... > ! if (this_priority == priority) > ! { > ! result = lappend_int(result, i); > ! if (am_sync != NULL && walsnd == MyWalSnd) > ! *am_sync = true; But if this walsender has the priority 1, *am_sync is set to true in the first loop not the second one. No? Regards, -- Fujii Masao
Sorry, my code was wrong in the case that the total numer of synchronous standby exceeds required number and the wansender is at priority 1. Sorry for the noise. At Wed, 06 Apr 2016 17:01:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160406.170151.246853881.horiguchi.kyotaro@lab.ntt.co.jp> > You must misread the patch. am_sync is originally set in the loop > just after that for the case. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Apr 6, 2016 at 5:07 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE8_F79BUpC5TmJ7aazXU=Uju0VznFCCKDK57-wNpHV-g@mail.gmail.com> >>>>> >> list_member_int() performs the loop internally. So I'm not sure how much >>>>> >> adding extra list_member_int() here can optimize this processing. >>>>> >> Another idea is to make SyncRepGetSyncStandby() check whether I'm sync >>>>> >> standby or not. In this idea, without adding extra loop, we can exit earilier >>>>> >> in the case where I'm not a sync standby. Does this make sense? >>>>> > >>>>> > The list_member_int() is also performed in the "(snip)" part. So >>>>> > SyncRepGetSyncStandbys() returning am_sync seems making sense. >>>>> > >>>>> > sync_standbys = SyncRepGetSyncStandbys(am_sync); >>>>> > >>>>> > /* >>>>> > * Quick exit if I am not synchronous or there's not >>>>> > * enough synchronous standbys >>>>> > * / >>>>> > if (!*am_sync || list_length(sync_standbys) < SyncRepConfig->num_sync) >>>>> > { >>>>> > list_free(sync_standbys); >>>>> > return false; >>>>> > } >>>>> >>>>> Thanks for the comment! I changed SyncRepGetSyncStandbys() so that >>>>> it checks whether we're managing a sync standby or not. >>>>> Attached is the updated version of the patch. I also applied several >>>>> review comments to the patch. >>>> >>>> It still does list_member_int but it can be gotten rid of as the >>>> attached patch. >>> >>> Thanks for the review! >>> >>>> >>>> regards, >>>> >>>> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c >>>> index 9b2137a..6998bb8 100644 >>>> --- a/src/backend/replication/syncrep.c >>>> +++ b/src/backend/replication/syncrep.c >>>> @@ -590,6 +590,10 @@ SyncRepGetSyncStandbys(bool *am_sync) >>>> if (XLogRecPtrIsInvalid(walsnd->flush)) >>>> continue; >>>> >>>> + /* Notify myself as 'synchonized' if I am */ >>>> + if (am_sync != NULL && walsnd == MyWalSnd) >>>> + *am_sync = true; >>>> + >>>> /* >>>> * If the priority is equal to 1, consider this standby as sync >>>> * and append it to the result. Otherwise append this standby >>>> @@ -598,8 +602,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >>>> if (this_priority == 1) >>>> { >>>> result = lappend_int(result, i); >>>> - if (am_sync != NULL && walsnd == MyWalSnd) >>>> - *am_sync = true; >>>> if (list_length(result) == SyncRepConfig->num_sync) >>>> { >>>> list_free(pending); >>>> @@ -630,9 +632,6 @@ SyncRepGetSyncStandbys(bool *am_sync) >>>> { >>>> bool needfree = (result != NIL && pending != NIL); >>>> >>>> - if (am_sync != NULL && !(*am_sync)) >>>> - *am_sync = list_member_int(pending, MyWalSnd->slotno); >>>> - >>>> result = list_concat(result, pending); >>>> if (needfree) >>>> pfree(pending); >>>> @@ -640,6 +639,13 @@ SyncRepGetSyncStandbys(bool *am_sync) >>>> } >>>> >>>> /* >>>> + * The pending list contains eventually potentially-synchronized standbys >>>> + * and this walsender may be one of them. So once reset am_sync. >>>> + */ >>>> + if (am_sync != NULL) >>>> + *am_sync = false; >>>> + >>>> + /* >>> >>> This code seems wrong in the case where this walsender is in the result list. >>> So I adopted another logic. Attached is the updated version of the patch. >> >> To be honest, this is a nice patch that we have here, and it received >> a fair amount of work. I have been playing with it a bit but I could >> not break it. >> >> Here are few things I have noticed: > > Thanks for the review! > >> + for (i = 0; i < max_wal_senders; i++) >> + { >> + walsnd = &WalSndCtl->walsnds[i]; >> No volatile pointer to prevent code reordering? > > Yes. Since spin lock is not taken there, volatile is necessary. > >> */ >> typedef struct WalSnd >> { >> + int slotno; /* index of this slot in WalSnd array */ >> pid_t pid; /* this walsender's process id, or 0 */ >> slotno is used nowhere. > > Yep. Attached is the updated version of the patch. Okay, I pushed the patch! Many thanks to all involved in the development of this feature! Regards, -- Fujii Masao
On Wed, Apr 6, 2016 at 5:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Okay, I pushed the patch! > Many thanks to all involved in the development of this feature! I think that I am crying... Really cool to see this milestone accomplished. -- Michael
Many thanks to all involved in the development of this feature!Okay, I pushed the patch!
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> > 2.
> >> > pg_stat_get_wal_senders()
> >> > {
> >> > ..
> >> > /*
> >> > ! * Allocate and update the config data of synchronous replication,
> >> > ! * and then get the currently active synchronous standbys.
> >> > */
> >> > + SyncRepUpdateConfig();
> >> > LWLockAcquire(SyncRepLock, LW_SHARED);
> >> > ! sync_standbys = SyncRepGetSyncStandbys();
> >> > LWLockRelease(SyncRepLock);
> >> > ..
> >> > }
> >> >
> >> > Why is it important to update the config with patch? Earlier also any
> >> > update to config between calls wouldn't have been visible.
> >>
> >> Because a backend has no chance to call SyncRepUpdateConfig() and
> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
> >> called here. This means that pg_stat_replication may return the
> >> information
> >> based on the old value of s_s_names.
> >>
> >
> > Thats right, but without this patch also won't pg_stat_replication can show
> > old information? If no, why so?
>
> Without the patch, when s_s_names is changed and SIGHUP is sent,
> a backend calls ProcessConfigFile(), parse the configuration file and
> set the global variable SyncRepStandbyNames to the latest value of
> s_s_names. When pg_stat_replication is accessed, a backend calculates
> which standby is synchronous based on that latest value in SyncRepStandbyNames,
> and then displays the information of sync replication.
>
> With the patch, basically the same steps are executed when s_s_names is
> changed. But the difference is that, with the patch, SyncRepUpdateConfig()
> must be called after ProcessConfigFile() is called before the calculation of
> sync standbys. So I just added the call of SyncRepUpdateConfig() to
> pg_stat_get_wal_senders().
>
> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> from pg_stat_get_wal_senders() and every backends always parse the value
> of s_s_names when the setting is changed.
>
On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 6, 2016 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >> >> >> > 2. >> >> > pg_stat_get_wal_senders() >> >> > { >> >> > .. >> >> > /* >> >> > ! * Allocate and update the config data of synchronous replication, >> >> > ! * and then get the currently active synchronous standbys. >> >> > */ >> >> > + SyncRepUpdateConfig(); >> >> > LWLockAcquire(SyncRepLock, LW_SHARED); >> >> > ! sync_standbys = SyncRepGetSyncStandbys(); >> >> > LWLockRelease(SyncRepLock); >> >> > .. >> >> > } >> >> > >> >> > Why is it important to update the config with patch? Earlier also >> >> > any >> >> > update to config between calls wouldn't have been visible. >> >> >> >> Because a backend has no chance to call SyncRepUpdateConfig() and >> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not >> >> called here. This means that pg_stat_replication may return the >> >> information >> >> based on the old value of s_s_names. >> >> >> > >> > Thats right, but without this patch also won't pg_stat_replication can >> > show >> > old information? If no, why so? >> >> Without the patch, when s_s_names is changed and SIGHUP is sent, >> a backend calls ProcessConfigFile(), parse the configuration file and >> set the global variable SyncRepStandbyNames to the latest value of >> s_s_names. When pg_stat_replication is accessed, a backend calculates >> which standby is synchronous based on that latest value in >> SyncRepStandbyNames, >> and then displays the information of sync replication. >> >> With the patch, basically the same steps are executed when s_s_names is >> changed. But the difference is that, with the patch, SyncRepUpdateConfig() >> must be called after ProcessConfigFile() is called before the calculation >> of >> sync standbys. So I just added the call of SyncRepUpdateConfig() to >> pg_stat_get_wal_senders(). >> > > Then why to call it just in pg_stat_get_wal_senders(), isn't it better if we > call it always after ProcessConfigFile() (after setting SyncRepStandbyNames) > >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile() >> from pg_stat_get_wal_senders() and every backends always parse the value >> of s_s_names when the setting is changed. >> > > That sounds appropriate, but not sure what is exact place to call it. Maybe just after the following ProcessConfigFile(). ----------------------------------------- /* * (6) check for any other interesting events that happened while we * slept. */ if (got_SIGHUP) { got_SIGHUP = false; ProcessConfigFile(PGC_SIGHUP); } ----------------------------------------- If we do the move, we also need to either (1) make postmaster call SyncRepUpdateConfig() and pass the parsed result to any forked backends via a file like write_nondefault_variables() does for EXEC_BACKEND environment, or (2) make a backend call SyncRepUpdateConfig() during its initialization phase so that the first call of pg_stat_replication can use the parsed result. (1) seems complicated and overkill. (2) may add very small overhead into the fork of a backend. It would be almost negligible, though. So which logic should we adopt? Regards, -- Fujii Masao
>
> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> >> from pg_stat_get_wal_senders() and every backends always parse the value
> >> of s_s_names when the setting is changed.
> >>
> >
> > That sounds appropriate, but not sure what is exact place to call it.
>
> Maybe just after the following ProcessConfigFile().
>
> -----------------------------------------
> /*
> * (6) check for any other interesting events that happened while we
> * slept.
> */
> if (got_SIGHUP)
> {
> got_SIGHUP = false;
> ProcessConfigFile(PGC_SIGHUP);
> }
> -----------------------------------------
>
> If we do the move, we also need to either (1) make postmaster call
> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> via a file like write_nondefault_variables() does for EXEC_BACKEND
> environment, or (2) make a backend call SyncRepUpdateConfig() during
> its initialization phase so that the first call of pg_stat_replication
> can use the parsed result. (1) seems complicated and overkill.
> (2) may add very small overhead into the fork of a backend. It would
> be almost negligible, though. So which logic should we adopt?
>
On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile() >> >> from pg_stat_get_wal_senders() and every backends always parse the >> >> value >> >> of s_s_names when the setting is changed. >> >> >> > >> > That sounds appropriate, but not sure what is exact place to call it. >> >> Maybe just after the following ProcessConfigFile(). >> >> ----------------------------------------- >> /* >> * (6) check for any other interesting events that happened while we >> * slept. >> */ >> if (got_SIGHUP) >> { >> got_SIGHUP = false; >> ProcessConfigFile(PGC_SIGHUP); >> } >> ----------------------------------------- >> >> If we do the move, we also need to either (1) make postmaster call >> SyncRepUpdateConfig() and pass the parsed result to any forked backends >> via a file like write_nondefault_variables() does for EXEC_BACKEND >> environment, or (2) make a backend call SyncRepUpdateConfig() during >> its initialization phase so that the first call of pg_stat_replication >> can use the parsed result. (1) seems complicated and overkill. >> (2) may add very small overhead into the fork of a backend. It would >> be almost negligible, though. So which logic should we adopt? >> > > Won't it be possible to have assign_* function for synchronous_standby_names > as we have for some of the other settings like assign_XactIsoLevel and then > call SyncRepUpdateConfig() in that function? It's possible, but still seems to need (1), i.e., the variable that assign_XXX function assigned needs to be passed to a backend via file for EXEC_BACKEND environment. Regards, -- Fujii Masao
>
> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> >> >> from pg_stat_get_wal_senders() and every backends always parse the
> >> >> value
> >> >> of s_s_names when the setting is changed.
> >> >>
> >> >
> >> > That sounds appropriate, but not sure what is exact place to call it.
> >>
> >> Maybe just after the following ProcessConfigFile().
> >>
> >> -----------------------------------------
> >> /*
> >> * (6) check for any other interesting events that happened while we
> >> * slept.
> >> */
> >> if (got_SIGHUP)
> >> {
> >> got_SIGHUP = false;
> >> ProcessConfigFile(PGC_SIGHUP);
> >> }
> >> -----------------------------------------
> >>
> >> If we do the move, we also need to either (1) make postmaster call
> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> >> via a file like write_nondefault_variables() does for EXEC_BACKEND
> >> environment, or (2) make a backend call SyncRepUpdateConfig() during
> >> its initialization phase so that the first call of pg_stat_replication
> >> can use the parsed result. (1) seems complicated and overkill.
> >> (2) may add very small overhead into the fork of a backend. It would
> >> be almost negligible, though. So which logic should we adopt?
> >>
> >
> > Won't it be possible to have assign_* function for synchronous_standby_names
> > as we have for some of the other settings like assign_XactIsoLevel and then
> > call SyncRepUpdateConfig() in that function?
>
> It's possible, but still seems to need (1), i.e., the variable that assign_XXX
> function assigned needs to be passed to a backend via file for EXEC_BACKEND
> environment.
>
On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 6, 2016 at 8:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao <masao.fujii@gmail.com> >> > wrote: >> >> >> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> >> >> wrote: >> >> > >> >> >> BTW, we can move SyncRepUpdateConfig() just after >> >> >> ProcessConfigFile() >> >> >> from pg_stat_get_wal_senders() and every backends always parse the >> >> >> value >> >> >> of s_s_names when the setting is changed. >> >> >> >> >> > >> >> > That sounds appropriate, but not sure what is exact place to call it. >> >> >> >> Maybe just after the following ProcessConfigFile(). >> >> >> >> ----------------------------------------- >> >> /* >> >> * (6) check for any other interesting events that happened while we >> >> * slept. >> >> */ >> >> if (got_SIGHUP) >> >> { >> >> got_SIGHUP = false; >> >> ProcessConfigFile(PGC_SIGHUP); >> >> } >> >> ----------------------------------------- >> >> >> >> If we do the move, we also need to either (1) make postmaster call >> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends >> >> via a file like write_nondefault_variables() does for EXEC_BACKEND >> >> environment, or (2) make a backend call SyncRepUpdateConfig() during >> >> its initialization phase so that the first call of pg_stat_replication >> >> can use the parsed result. (1) seems complicated and overkill. >> >> (2) may add very small overhead into the fork of a backend. It would >> >> be almost negligible, though. So which logic should we adopt? >> >> >> > >> > Won't it be possible to have assign_* function for >> > synchronous_standby_names >> > as we have for some of the other settings like assign_XactIsoLevel and >> > then >> > call SyncRepUpdateConfig() in that function? >> >> It's possible, but still seems to need (1), i.e., the variable that >> assign_XXX >> function assigned needs to be passed to a backend via file for >> EXEC_BACKEND >> environment. >> > > But for that, I think we don't need to do anything extra. I mean > write_nondefault_variables() will automatically write the non-default value > of variable and then during backend initialization, it will call > read_nondefault_variables which will call set_config_option for non-default > parameters and that should set the required value if we have assign_* > function defined for the variable. Yes if the variable that we'd like to pass to a backend is BOOL, INT, REAL, STRING or ENUM. But SyncRepConfig variable is a bit more complicated. So ISTM that write_one_nondefault_variable() needs to be updated so that SyncRepConfig is written to a file. Regards, -- Fujii Masao
>
> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > But for that, I think we don't need to do anything extra. I mean
> > write_nondefault_variables() will automatically write the non-default value
> > of variable and then during backend initialization, it will call
> > read_nondefault_variables which will call set_config_option for non-default
> > parameters and that should set the required value if we have assign_*
> > function defined for the variable.
>
> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
> complicated.
>
>
> So ISTM that write_one_nondefault_variable() needs to
> be updated so that SyncRepConfig is written to a file.
>
On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > But for that, I think we don't need to do anything extra. I mean >> > write_nondefault_variables() will automatically write the non-default >> > value >> > of variable and then during backend initialization, it will call >> > read_nondefault_variables which will call set_config_option for >> > non-default >> > parameters and that should set the required value if we have assign_* >> > function defined for the variable. >> >> Yes if the variable that we'd like to pass to a backend is BOOL, INT, >> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more >> complicated. >> > > SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to > pass that? I assume that current non-default value of SyncRepStandbyNames > will be passed via write_nondefault_variables(), so we can use that to > regenerate SyncRepConfig. Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork, to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames. This is the approach of (2) which I explained upthread. assign_XXX function doesn't seem to be helpful for this case. Regards, -- Fujii Masao
>
> On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> >>
> >> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> >
> >> > But for that, I think we don't need to do anything extra. I mean
> >> > write_nondefault_variables() will automatically write the non-default
> >> > value
> >> > of variable and then during backend initialization, it will call
> >> > read_nondefault_variables which will call set_config_option for
> >> > non-default
> >> > parameters and that should set the required value if we have assign_*
> >> > function defined for the variable.
> >>
> >> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
> >> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
> >> complicated.
> >>
> >
> > SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to
> > pass that? I assume that current non-default value of SyncRepStandbyNames
> > will be passed via write_nondefault_variables(), so we can use that to
> > regenerate SyncRepConfig.
>
> Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork,
> to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames.
> This is the approach of (2) which I explained upthread. assign_XXX function
> doesn't seem to be helpful for this case.
>
On 2016/04/07 15:26, Fujii Masao wrote: > On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Yes if the variable that we'd like to pass to a backend is BOOL, INT, >>> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more >>> complicated. >> SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to >> pass that? I assume that current non-default value of SyncRepStandbyNames >> will be passed via write_nondefault_variables(), so we can use that to >> regenerate SyncRepConfig. > > Yes, so SyncRepUpdateConfig() needs to be called by a backend after fork, > to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames. > This is the approach of (2) which I explained upthread. assign_XXX function > doesn't seem to be helpful for this case. I don't see why there is need to SyncRepUpdateConfig() after every fork or anywhere outside syncrep.c/walsender.c for that matter. AIUI, only walsender or a backend that runs pg_stat_get_wal_senders() ever needs to run SyncRepUpdateConfig() to get parsed synchronous standbys info from the string that is SyncRepStandbyNames. For rest of the world, it's just a string guc and is written to and read from any external file as one (e.g. the file that write_nondefault_variables() writes to in the EXEC_BACKEND case). I hope I'm not entirely missing the point of the discussion you and Amit are having. Thanks, Amit
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Apr 7, 2016 at 1:30 PM, Amit Langote <<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a>>wrote:<br />><br />> On 2016/04/0715:26, Fujii Masao wrote:<br />> > On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <<a href="mailto:amit.kapila16@gmail.com">amit.kapila16@gmail.com</a>>wrote:<br />> >> On Thu, Apr 7, 2016 at 10:02AM, Fujii Masao <<a href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>> wrote:<br />> >>>Yes if the variable that we'd like to pass to a backend is BOOL, INT,<br />> >>> REAL, STRING orENUM. But SyncRepConfig variable is a bit more<br />> >>> complicated.<br />> >> SyncRepConfig isa parsed result of SyncRepStandbyNames, why you want to<br />> >> pass that? I assume that current non-defaultvalue of SyncRepStandbyNames<br />> >> will be passed via write_nondefault_variables(), so we can usethat to<br />> >> regenerate SyncRepConfig.<br />> ><br />> > Yes, so SyncRepUpdateConfig() needsto be called by a backend after fork,<br />> > to regenerate SyncRepConfig from the passed value of SyncRepStandbyNames.<br/>> > This is the approach of (2) which I explained upthread. assign_XXX function<br />>> doesn't seem to be helpful for this case.<br />><br />> I don't see why there is need to SyncRepUpdateConfig()after every fork or<br />> anywhere outside syncrep.c/walsender.c for that matter. AIUI, only<br/>> walsender or a backend that runs pg_stat_get_wal_senders() ever needs to<br />> run SyncRepUpdateConfig()to get parsed synchronous standbys info from the<br />> string that is SyncRepStandbyNames.</div><divclass="gmail_quote">></div><div class="gmail_quote"><br /></div><div class="gmail_quote">Soif we go by this each time backend calls pg_stat_get_wal_senders, it needs to do parsing to form SyncRepConfigwhether it's changed or not from previous time. I understand that this is not a performance critical path,but still if we can do it in some other optimal way which doesn't hurt any other path, then it will be better.<br /></div><divclass="gmail_quote"><br class="" /><br />With Regards,<br />Amit Kapila.<br />EnterpriseDB: <a href="http://www.enterprisedb.com/"target="_blank">http://www.enterprisedb.com</a><br /></div></div></div>
On Thu, Apr 7, 2016 at 7:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Apr 7, 2016 at 1:30 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: >> >> On 2016/04/07 15:26, Fujii Masao wrote: >> > On Thu, Apr 7, 2016 at 2:48 PM, Amit Kapila <amit.kapila16@gmail.com> >> > wrote: >> >> On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao <masao.fujii@gmail.com> >> >> wrote: >> >>> Yes if the variable that we'd like to pass to a backend is BOOL, INT, >> >>> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more >> >>> complicated. >> >> SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want >> >> to >> >> pass that? I assume that current non-default value of >> >> SyncRepStandbyNames >> >> will be passed via write_nondefault_variables(), so we can use that to >> >> regenerate SyncRepConfig. >> > >> > Yes, so SyncRepUpdateConfig() needs to be called by a backend after >> > fork, >> > to regenerate SyncRepConfig from the passed value of >> > SyncRepStandbyNames. >> > This is the approach of (2) which I explained upthread. assign_XXX >> > function >> > doesn't seem to be helpful for this case. >> >> I don't see why there is need to SyncRepUpdateConfig() after every fork or >> anywhere outside syncrep.c/walsender.c for that matter. AIUI, only >> walsender or a backend that runs pg_stat_get_wal_senders() ever needs to >> run SyncRepUpdateConfig() to get parsed synchronous standbys info from the >> string that is SyncRepStandbyNames. >> > > So if we go by this each time backend calls pg_stat_get_wal_senders, it > needs to do parsing to form SyncRepConfig whether it's changed or not from > previous time. I understand that this is not a performance critical path, > but still if we can do it in some other optimal way which doesn't hurt any > other path, then it will be better. So, will you write the patch? Either current implementation or the approach you're suggesting works to me. If you really want to change the current one, I'm happy to review that. Regards, -- Fujii Masao
On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Here are few things I have noticed: >> + for (i = 0; i < max_wal_senders; i++) >> + { >> + walsnd = &WalSndCtl->walsnds[i]; >> No volatile pointer to prevent code reordering? >> >> */ >> typedef struct WalSnd >> { >> + int slotno; /* index of this slot in WalSnd array */ >> pid_t pid; /* this walsender's process id, or 0 */ >> slotno is used nowhere. >> >> I'll grab the tests and look at them. > > So I had a look at those tests and finished with the attached: > - patch 1 adds a reload routine to PostgresNode > - patch 2 the list of tests. Thanks for updating the patches! Attached is the refactored version of the patch. Regards, -- Fujii Masao
Attachment
Okay, I pushed the patch!
Many thanks to all involved in the development of this feature!
Hi,
I spotted a couple of places in the documentation that still implied there was only one synchronous standby. Please see suggested fixes attached.
http://www.enterprisedb.com
Attachment
On Fri, Apr 8, 2016 at 12:55 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Apr 6, 2016 at 8:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> Okay, I pushed the patch! >> Many thanks to all involved in the development of this feature! > > > Hi, > > I spotted a couple of places in the documentation that still implied there > was only one synchronous standby. Please see suggested fixes attached. Thanks! Applied. Regards, -- Fujii Masao
On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Here are few things I have noticed: >>> + for (i = 0; i < max_wal_senders; i++) >>> + { >>> + walsnd = &WalSndCtl->walsnds[i]; >>> No volatile pointer to prevent code reordering? >>> >>> */ >>> typedef struct WalSnd >>> { >>> + int slotno; /* index of this slot in WalSnd array */ >>> pid_t pid; /* this walsender's process id, or 0 */ >>> slotno is used nowhere. >>> >>> I'll grab the tests and look at them. >> >> So I had a look at those tests and finished with the attached: >> - patch 1 adds a reload routine to PostgresNode >> - patch 2 the list of tests. > > Thanks for updating the patches! > > Attached is the refactored version of the patch. Thanks. This looks good to me. .gitattributes complains a bit: $ git diff n_sync --check src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace. + $self->reload; -- Michael
On Fri, Apr 8, 2016 at 2:26 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> Here are few things I have noticed: >>>> + for (i = 0; i < max_wal_senders; i++) >>>> + { >>>> + walsnd = &WalSndCtl->walsnds[i]; >>>> No volatile pointer to prevent code reordering? >>>> >>>> */ >>>> typedef struct WalSnd >>>> { >>>> + int slotno; /* index of this slot in WalSnd array */ >>>> pid_t pid; /* this walsender's process id, or 0 */ >>>> slotno is used nowhere. >>>> >>>> I'll grab the tests and look at them. >>> >>> So I had a look at those tests and finished with the attached: >>> - patch 1 adds a reload routine to PostgresNode >>> - patch 2 the list of tests. >> >> Thanks for updating the patches! >> >> Attached is the refactored version of the patch. > > Thanks. This looks good to me. > > .gitattributes complains a bit: > $ git diff n_sync --check > src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace. > + $self->reload; Thanks for the review! I've finally pushed the patch. Regards, -- Fujii Masao
On Fri, Apr 8, 2016 at 4:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Apr 8, 2016 at 2:26 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Thu, Apr 7, 2016 at 11:43 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Wed, Apr 6, 2016 at 5:04 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier >>>> <michael.paquier@gmail.com> wrote: >>>>> Here are few things I have noticed: >>>>> + for (i = 0; i < max_wal_senders; i++) >>>>> + { >>>>> + walsnd = &WalSndCtl->walsnds[i]; >>>>> No volatile pointer to prevent code reordering? >>>>> >>>>> */ >>>>> typedef struct WalSnd >>>>> { >>>>> + int slotno; /* index of this slot in WalSnd array */ >>>>> pid_t pid; /* this walsender's process id, or 0 */ >>>>> slotno is used nowhere. >>>>> >>>>> I'll grab the tests and look at them. >>>> >>>> So I had a look at those tests and finished with the attached: >>>> - patch 1 adds a reload routine to PostgresNode >>>> - patch 2 the list of tests. >>> >>> Thanks for updating the patches! >>> >>> Attached is the refactored version of the patch. >> >> Thanks. This looks good to me. >> >> .gitattributes complains a bit: >> $ git diff n_sync --check >> src/test/recovery/t/007_sync_rep.pl:22: trailing whitespace. >> + $self->reload; > > Thanks for the review! I've finally pushed the patch. > Thank you! Regards, -- Masahiko Sawada
>
> On Thu, Apr 7, 2016 at 7:29 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > So if we go by this each time backend calls pg_stat_get_wal_senders, it
> > needs to do parsing to form SyncRepConfig whether it's changed or not from
> > previous time. I understand that this is not a performance critical path,
> > but still if we can do it in some other optimal way which doesn't hurt any
> > other path, then it will be better.
>
> So, will you write the patch? Either current implementation or
> the approach you're suggesting works to me. If you really want
> to change the current one, I'm happy to review that.
>
Attachment
On Wed, Apr 6, 2016 at 1:23 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Okay, I pushed the patch! > Many thanks to all involved in the development of this feature! Thanks, a nice feature. When I compile now without cassert, I get the compiler warning: syncrep.c: In function 'SyncRepUpdateConfig': syncrep.c:878:6: warning: variable 'parse_rc' set but not used [-Wunused-but-set-variable] Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > When I compile now without cassert, I get the compiler warning: > syncrep.c: In function 'SyncRepUpdateConfig': > syncrep.c:878:6: warning: variable 'parse_rc' set but not used > [-Wunused-but-set-variable] If there's a good reason for that to be an Assert, I don't see it. There are no callers of SyncRepUpdateConfig that look like they need to, or should expect not to have to, tolerate errors. I think the way to fix this is to turn the Assert into a plain old test-and-ereport-ERROR. regards, tom lane
On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> When I compile now without cassert, I get the compiler warning: > >> syncrep.c: In function 'SyncRepUpdateConfig': >> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >> [-Wunused-but-set-variable] > > If there's a good reason for that to be an Assert, I don't see it. > There are no callers of SyncRepUpdateConfig that look like they > need to, or should expect not to have to, tolerate errors. > I think the way to fix this is to turn the Assert into a plain > old test-and-ereport-ERROR. > I've changed the draft patch Amit implemented so that it doesn't parse twice(check_hook and assign_hook). So assertion that was in assign_hook is no longer necessary. Please find attached. Regards, -- Masahiko Sawada
Attachment
On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jeff Janes <jeff.janes@gmail.com> writes: >>> When I compile now without cassert, I get the compiler warning: >> >>> syncrep.c: In function 'SyncRepUpdateConfig': >>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >>> [-Wunused-but-set-variable] >> >> If there's a good reason for that to be an Assert, I don't see it. >> There are no callers of SyncRepUpdateConfig that look like they >> need to, or should expect not to have to, tolerate errors. >> I think the way to fix this is to turn the Assert into a plain >> old test-and-ereport-ERROR. >> > > I've changed the draft patch Amit implemented so that it doesn't parse > twice(check_hook and assign_hook). > So assertion that was in assign_hook is no longer necessary. > > Please find attached. Thanks for the patch! When I emptied s_s_names, reloaded the configration file, set it to 'standby1' and reloaded the configuration file again, the master crashed with the following error. *** glibc detected *** postgres: wal sender process postgres [local] streaming 0/3015F18: munmap_chunk(): invalid pointer: 0x00000000024d9a40 *** ======= Backtrace: ========= *** glibc detected *** postgres: wal sender process postgres [local] streaming 0/3015F18: munmap_chunk(): invalid pointer: 0x00000000024d9a40 *** /lib64/libc.so.6[0x3be8e75f4e] postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] ======= Backtrace: ========= /lib64/libc.so.6[0x3be8e75f4e] postgres: wal sender process postgres [local] streaming 0/3015F18(set_config_option+0x12cb)[0x982242] postgres: wal sender process postgres [local] streaming 0/3015F18(SetConfigOption+0x4b)[0x9827ff] postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] postgres: wal sender process postgres [local] streaming 0/3015F18(set_config_option+0x12cb)[0x982242] postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] postgres: wal sender process postgres [local] streaming 0/3015F18(SetConfigOption+0x4b)[0x9827ff] postgres: wal sender process postgres [local] streaming 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] postgres: wal sender process postgres [local] streaming 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] postgres: wal sender process postgres [local] streaming 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] postgres: wal sender process postgres [local] streaming 0/3015F18(PostgresMain+0x772)[0x8141b6] postgres: wal sender process postgres [local] streaming 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] postgres: wal sender process postgres [local] streaming 0/3015F18(PostgresMain+0x772)[0x8141b6] postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] postgres: wal sender process postgres [local] streaming 0/3015F18(PostmasterMain+0x1134)[0x784c08] postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d] postgres: wal sender process postgres [local] streaming 0/3015F18(PostmasterMain+0x1134)[0x784c08] postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99] Regards, -- Fujii Masao
On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> When I compile now without cassert, I get the compiler warning: > >> syncrep.c: In function 'SyncRepUpdateConfig': >> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >> [-Wunused-but-set-variable] Thanks for the report! > If there's a good reason for that to be an Assert, I don't see it. > There are no callers of SyncRepUpdateConfig that look like they > need to, or should expect not to have to, tolerate errors. > I think the way to fix this is to turn the Assert into a plain > old test-and-ereport-ERROR. Okay, I pushed that change. Thanks for the suggestion! Regards, -- Fujii Masao
On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Jeff Janes <jeff.janes@gmail.com> writes: >>>> When I compile now without cassert, I get the compiler warning: >>> >>>> syncrep.c: In function 'SyncRepUpdateConfig': >>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >>>> [-Wunused-but-set-variable] >>> >>> If there's a good reason for that to be an Assert, I don't see it. >>> There are no callers of SyncRepUpdateConfig that look like they >>> need to, or should expect not to have to, tolerate errors. >>> I think the way to fix this is to turn the Assert into a plain >>> old test-and-ereport-ERROR. >>> >> >> I've changed the draft patch Amit implemented so that it doesn't parse >> twice(check_hook and assign_hook). >> So assertion that was in assign_hook is no longer necessary. >> >> Please find attached. > > Thanks for the patch! > > When I emptied s_s_names, reloaded the configration file, set it to 'standby1' > and reloaded the configuration file again, the master crashed with > the following error. > > *** glibc detected *** postgres: wal sender process postgres [local] > streaming 0/3015F18: munmap_chunk(): invalid pointer: > 0x00000000024d9a40 *** > ======= Backtrace: ========= > *** glibc detected *** postgres: wal sender process postgres [local] > streaming 0/3015F18: munmap_chunk(): invalid pointer: > 0x00000000024d9a40 *** > /lib64/libc.so.6[0x3be8e75f4e] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] > ======= Backtrace: ========= > /lib64/libc.so.6[0x3be8e75f4e] > postgres: wal sender process postgres [local] streaming > 0/3015F18(set_config_option+0x12cb)[0x982242] > postgres: wal sender process postgres [local] streaming > 0/3015F18(SetConfigOption+0x4b)[0x9827ff] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] > postgres: wal sender process postgres [local] streaming > 0/3015F18(set_config_option+0x12cb)[0x982242] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] > postgres: wal sender process postgres [local] streaming > 0/3015F18(SetConfigOption+0x4b)[0x9827ff] > postgres: wal sender process postgres [local] streaming > 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] > postgres: wal sender process postgres [local] streaming > 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] > postgres: wal sender process postgres [local] streaming > 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] > postgres: wal sender process postgres [local] streaming > 0/3015F18(PostgresMain+0x772)[0x8141b6] > postgres: wal sender process postgres [local] streaming > 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] > postgres: wal sender process postgres [local] streaming > 0/3015F18(PostgresMain+0x772)[0x8141b6] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] > postgres: wal sender process postgres [local] streaming > 0/3015F18(PostmasterMain+0x1134)[0x784c08] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d] > postgres: wal sender process postgres [local] streaming > 0/3015F18(PostmasterMain+0x1134)[0x784c08] > postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99] > Thank you for reviewing. SyncRepUpdateConfig() seems to be no longer necessary. Attached updated version. Regards, -- Masahiko Sawada
Attachment
On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Jeff Janes <jeff.janes@gmail.com> writes: >>>>> When I compile now without cassert, I get the compiler warning: >>>> >>>>> syncrep.c: In function 'SyncRepUpdateConfig': >>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >>>>> [-Wunused-but-set-variable] >>>> >>>> If there's a good reason for that to be an Assert, I don't see it. >>>> There are no callers of SyncRepUpdateConfig that look like they >>>> need to, or should expect not to have to, tolerate errors. >>>> I think the way to fix this is to turn the Assert into a plain >>>> old test-and-ereport-ERROR. >>>> >>> >>> I've changed the draft patch Amit implemented so that it doesn't parse >>> twice(check_hook and assign_hook). >>> So assertion that was in assign_hook is no longer necessary. >>> >>> Please find attached. >> >> Thanks for the patch! >> >> When I emptied s_s_names, reloaded the configration file, set it to 'standby1' >> and reloaded the configuration file again, the master crashed with >> the following error. >> >> *** glibc detected *** postgres: wal sender process postgres [local] >> streaming 0/3015F18: munmap_chunk(): invalid pointer: >> 0x00000000024d9a40 *** >> ======= Backtrace: ========= >> *** glibc detected *** postgres: wal sender process postgres [local] >> streaming 0/3015F18: munmap_chunk(): invalid pointer: >> 0x00000000024d9a40 *** >> /lib64/libc.so.6[0x3be8e75f4e] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >> ======= Backtrace: ========= >> /lib64/libc.so.6[0x3be8e75f4e] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(set_config_option+0x12cb)[0x982242] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(set_config_option+0x12cb)[0x982242] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(PostgresMain+0x772)[0x8141b6] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(PostgresMain+0x772)[0x8141b6] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d] >> postgres: wal sender process postgres [local] streaming >> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99] >> > > Thank you for reviewing. > > SyncRepUpdateConfig() seems to be no longer necessary. Really? I was thinking that something like that function needs to be called at the beginning of a backend and walsender in EXEC_BACKEND case. No? Regards, -- Fujii Masao
On Mon, Apr 11, 2016 at 8:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> Jeff Janes <jeff.janes@gmail.com> writes: >>>>>> When I compile now without cassert, I get the compiler warning: >>>>> >>>>>> syncrep.c: In function 'SyncRepUpdateConfig': >>>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >>>>>> [-Wunused-but-set-variable] >>>>> >>>>> If there's a good reason for that to be an Assert, I don't see it. >>>>> There are no callers of SyncRepUpdateConfig that look like they >>>>> need to, or should expect not to have to, tolerate errors. >>>>> I think the way to fix this is to turn the Assert into a plain >>>>> old test-and-ereport-ERROR. >>>>> >>>> >>>> I've changed the draft patch Amit implemented so that it doesn't parse >>>> twice(check_hook and assign_hook). >>>> So assertion that was in assign_hook is no longer necessary. >>>> >>>> Please find attached. >>> >>> Thanks for the patch! >>> >>> When I emptied s_s_names, reloaded the configration file, set it to 'standby1' >>> and reloaded the configuration file again, the master crashed with >>> the following error. >>> >>> *** glibc detected *** postgres: wal sender process postgres [local] >>> streaming 0/3015F18: munmap_chunk(): invalid pointer: >>> 0x00000000024d9a40 *** >>> ======= Backtrace: ========= >>> *** glibc detected *** postgres: wal sender process postgres [local] >>> streaming 0/3015F18: munmap_chunk(): invalid pointer: >>> 0x00000000024d9a40 *** >>> /lib64/libc.so.6[0x3be8e75f4e] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >>> ======= Backtrace: ========= >>> /lib64/libc.so.6[0x3be8e75f4e] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(set_config_option+0x12cb)[0x982242] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(set_config_option+0x12cb)[0x982242] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(PostgresMain+0x772)[0x8141b6] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(PostgresMain+0x772)[0x8141b6] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d] >>> postgres: wal sender process postgres [local] streaming >>> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99] >>> >> >> Thank you for reviewing. >> >> SyncRepUpdateConfig() seems to be no longer necessary. > > Really? I was thinking that something like that function needs to > be called at the beginning of a backend and walsender in > EXEC_BACKEND case. No? > Hmm, in EXEC_BACKEND case, I guess that each child process calls read_nondefault_variables that parses and validates these configuration parameters in SubPostmasterMain. Previous patch didn't apply to HEAD cleanly, attached updated version. Regards, -- Masahiko Sawada
Attachment
On Tue, Apr 12, 2016 at 9:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Apr 11, 2016 at 8:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Apr 11, 2016 at 5:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Mon, Apr 11, 2016 at 1:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> On Mon, Apr 11, 2016 at 10:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Sat, Apr 9, 2016 at 12:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>>> Jeff Janes <jeff.janes@gmail.com> writes: >>>>>>> When I compile now without cassert, I get the compiler warning: >>>>>> >>>>>>> syncrep.c: In function 'SyncRepUpdateConfig': >>>>>>> syncrep.c:878:6: warning: variable 'parse_rc' set but not used >>>>>>> [-Wunused-but-set-variable] >>>>>> >>>>>> If there's a good reason for that to be an Assert, I don't see it. >>>>>> There are no callers of SyncRepUpdateConfig that look like they >>>>>> need to, or should expect not to have to, tolerate errors. >>>>>> I think the way to fix this is to turn the Assert into a plain >>>>>> old test-and-ereport-ERROR. >>>>>> >>>>> >>>>> I've changed the draft patch Amit implemented so that it doesn't parse >>>>> twice(check_hook and assign_hook). >>>>> So assertion that was in assign_hook is no longer necessary. >>>>> >>>>> Please find attached. >>>> >>>> Thanks for the patch! >>>> >>>> When I emptied s_s_names, reloaded the configration file, set it to 'standby1' >>>> and reloaded the configuration file again, the master crashed with >>>> the following error. >>>> >>>> *** glibc detected *** postgres: wal sender process postgres [local] >>>> streaming 0/3015F18: munmap_chunk(): invalid pointer: >>>> 0x00000000024d9a40 *** >>>> ======= Backtrace: ========= >>>> *** glibc detected *** postgres: wal sender process postgres [local] >>>> streaming 0/3015F18: munmap_chunk(): invalid pointer: >>>> 0x00000000024d9a40 *** >>>> /lib64/libc.so.6[0x3be8e75f4e] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >>>> ======= Backtrace: ========= >>>> /lib64/libc.so.6[0x3be8e75f4e] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(set_config_option+0x12cb)[0x982242] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x97dae2] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(set_config_option+0x12cb)[0x982242] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(SetConfigOption+0x4b)[0x9827ff] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x988b4e] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x98af40] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(ProcessConfigFile+0x9f)[0x98a98b] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b50fd] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7b359c] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(PostgresMain+0x772)[0x8141b6] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(exec_replication_command+0x1a7)[0x7b47b6] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(PostgresMain+0x772)[0x8141b6] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x7896f7] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x788e62] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x785544] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x6ce12e] >>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3be8e1ed5d] >>>> postgres: wal sender process postgres [local] streaming >>>> 0/3015F18(PostmasterMain+0x1134)[0x784c08] >>>> postgres: wal sender process postgres [local] streaming 0/3015F18[0x467e99] >>>> >>> >>> Thank you for reviewing. >>> >>> SyncRepUpdateConfig() seems to be no longer necessary. >> >> Really? I was thinking that something like that function needs to >> be called at the beginning of a backend and walsender in >> EXEC_BACKEND case. No? >> > > Hmm, in EXEC_BACKEND case, I guess that each child process calls > read_nondefault_variables that parses and validates these > configuration parameters in SubPostmasterMain. SyncRepStandbyNames is passed but SyncRepConfig is not, I think. Regards, -- Fujii Masao
At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com> > >>> Thank you for reviewing. > >>> > >>> SyncRepUpdateConfig() seems to be no longer necessary. > >> > >> Really? I was thinking that something like that function needs to > >> be called at the beginning of a backend and walsender in > >> EXEC_BACKEND case. No? > >> > > > > Hmm, in EXEC_BACKEND case, I guess that each child process calls > > read_nondefault_variables that parses and validates these > > configuration parameters in SubPostmasterMain. > > SyncRepStandbyNames is passed but SyncRepConfig is not, I think. SyncRepStandbyNames is passed to exec'ed backends by read_nondefault_variables, which calls set_config_option, which calls check/assign_s_s_names then syncrep_yyparse, which sets SyncRepConfig. Since guess battle is a waste of time, I actually built and ran on Windows7 and observed that SyncRepConfig has been set before WalSndLoop starts. > LOG: check_s_s_names(pid=20596, newval=) > LOG: assign_s_s_names(pid=20596, newval=, SyncRepConfig=00000000) > LOG: read_nondefault_variables(pid=20596) > LOG: set_config_option(synchronous_standby_names)(pid=20596) > LOG: check_s_s_names(pid=20596, newval=2[standby,sby2,sby3]) > LOG: assign_s_s_names(pid=20596, newval=2[standby,sby2,sby3], SyncRepConfig=01383598) > LOG: WalSndLoop(pid=20596) By the way, the patch assumes that one check_s_s_names is followed by exactly one assign_s_s_names. I suppose that myextra should be handled without such assumption. Plus, the name myextra should be any saner name.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
>
> At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com>
> > >>> Thank you for reviewing.
> > >>>
> > >>> SyncRepUpdateConfig() seems to be no longer necessary.
> > >>
> > >> Really? I was thinking that something like that function needs to
> > >> be called at the beginning of a backend and walsender in
> > >> EXEC_BACKEND case. No?
> > >>
> > >
> > > Hmm, in EXEC_BACKEND case, I guess that each child process calls
> > > read_nondefault_variables that parses and validates these
> > > configuration parameters in SubPostmasterMain.
> >
> > SyncRepStandbyNames is passed but SyncRepConfig is not, I think.
>
> SyncRepStandbyNames is passed to exec'ed backends by
> read_nondefault_variables, which calls set_config_option, which
> calls check/assign_s_s_names then syncrep_yyparse, which sets
> SyncRepConfig.
>
> Since guess battle is a waste of time, I actually built and ran
> on Windows7 and observed that SyncRepConfig has been set before
> WalSndLoop starts.
>
On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 13, 2016 at 1:44 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> At Wed, 13 Apr 2016 04:43:35 +0900, Fujii Masao <masao.fujii@gmail.com> >> wrote in >> <CAHGQGwEmZhBdjb1x3+KtUU9VV5xnhgCBO4TejibOXF_VHaeVXg@mail.gmail.com> >> > >>> Thank you for reviewing. >> > >>> >> > >>> SyncRepUpdateConfig() seems to be no longer necessary. >> > >> >> > >> Really? I was thinking that something like that function needs to >> > >> be called at the beginning of a backend and walsender in >> > >> EXEC_BACKEND case. No? >> > >> >> > > >> > > Hmm, in EXEC_BACKEND case, I guess that each child process calls >> > > read_nondefault_variables that parses and validates these >> > > configuration parameters in SubPostmasterMain. >> > >> > SyncRepStandbyNames is passed but SyncRepConfig is not, I think. >> >> SyncRepStandbyNames is passed to exec'ed backends by >> read_nondefault_variables, which calls set_config_option, which >> calls check/assign_s_s_names then syncrep_yyparse, which sets >> SyncRepConfig. >> >> Since guess battle is a waste of time, I actually built and ran >> on Windows7 and observed that SyncRepConfig has been set before >> WalSndLoop starts. >> > > Yes, this is what I was trying to explain to Fujii-san upthread and I have > also verified that the same works on Windows. Oh, okay, understood. Thanks for explaining that! > I think one point which we > should try to ensure in this patch is whether it is good to use > TopMemoryContext to allocate the memory in the check or assign function or > should we allocate some temporary context (like we do in load_tzoffsets()) > to perform parsing and then delete the same at end. Seems yes if some memories are allocated by palloc and they are not free'd while parsing s_s_names. Here are another comment for the patch. -SyncRepFreeConfig(SyncRepConfigData *config) +SyncRepFreeConfig(SyncRepConfigData *config, bool itself) SyncRepFreeConfig() was extended so that it accepts the second boolean argument. But it's always called with the second argument = false. So, I just wonder why that second argument is required. SyncRepConfigData *config = - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); Why should we use malloc instead of palloc here? *If* we use malloc, its return value must be checked. Regards, -- Fujii Masao
At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com> > > Yes, this is what I was trying to explain to Fujii-san upthread and I have > > also verified that the same works on Windows. > > Oh, okay, understood. Thanks for explaining that! > > > I think one point which we > > should try to ensure in this patch is whether it is good to use > > TopMemoryContext to allocate the memory in the check or assign function or > > should we allocate some temporary context (like we do in load_tzoffsets()) > > to perform parsing and then delete the same at end. > > Seems yes if some memories are allocated by palloc and they are not > free'd while parsing s_s_names. > > Here are another comment for the patch. > > -SyncRepFreeConfig(SyncRepConfigData *config) > +SyncRepFreeConfig(SyncRepConfigData *config, bool itself) > > SyncRepFreeConfig() was extended so that it accepts the second boolean > argument. But it's always called with the second argument = false. So, > I just wonder why that second argument is required. > > SyncRepConfigData *config = > - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); > + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); > > Why should we use malloc instead of palloc here? > > *If* we use malloc, its return value must be checked. Because it should live irrelevant to any memory context, as guc values are so. guc.c provides guc_malloc for this purpose, which is a malloc having some simple error handling, so having walsender_malloc would be reasonable. I don't think it's good to use TopMemoryContext for syncrep parser. syncrep_scanner.l uses palloc. This basically causes a memory leak on all postgres processes. It might be better if the parser works on the current memory context and the caller copies the result on the malloc'ed memory. But some list-creation functions using palloc.. Changing SyncRepConfigData.members to be char** would be messier.. Any idea? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yes, this is what I was trying to explain to Fujii-san upthread and I have > also verified that the same works on Windows. If you could, it would be nice as well to check that nothing breaks with VS when using vcregress recoverycheck. -- Michael
On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com> >> > Yes, this is what I was trying to explain to Fujii-san upthread and I have >> > also verified that the same works on Windows. >> >> Oh, okay, understood. Thanks for explaining that! >> >> > I think one point which we >> > should try to ensure in this patch is whether it is good to use >> > TopMemoryContext to allocate the memory in the check or assign function or >> > should we allocate some temporary context (like we do in load_tzoffsets()) >> > to perform parsing and then delete the same at end. >> >> Seems yes if some memories are allocated by palloc and they are not >> free'd while parsing s_s_names. >> >> Here are another comment for the patch. >> >> -SyncRepFreeConfig(SyncRepConfigData *config) >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself) >> >> SyncRepFreeConfig() was extended so that it accepts the second boolean >> argument. But it's always called with the second argument = false. So, >> I just wonder why that second argument is required. >> >> SyncRepConfigData *config = >> - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); >> + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); >> >> Why should we use malloc instead of palloc here? >> >> *If* we use malloc, its return value must be checked. > > Because it should live irrelevant to any memory context, as guc > values are so. guc.c provides guc_malloc for this purpose, which > is a malloc having some simple error handling, so having > walsender_malloc would be reasonable. > > I don't think it's good to use TopMemoryContext for syncrep > parser. syncrep_scanner.l uses palloc. This basically causes a > memory leak on all postgres processes. > > It might be better if the parser works on the current memory > context and the caller copies the result on the malloc'ed > memory. But some list-creation functions using palloc.. Changing > SyncRepConfigData.members to be char** would be messier.. SyncRepGetSyncStandby logic assumes deeply that the sync standby names are constructed as a list. I think that it would entail a radical change in SyncRepGetStandby Another idea is to prepare the some functions that allocate/free element of list using by malloc, free. Regards, -- Masahiko Sawada
Hello, At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com> > On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yes, this is what I was trying to explain to Fujii-san upthread and I have > > also verified that the same works on Windows. > > If you could, it would be nice as well to check that nothing breaks > with VS when using vcregress recoverycheck. I failed the test because of not preparing for TAP tests. But instead, I noticed that vcregress.pl shows a bit wrong help message. > >vcregress > Usage: vcregress.pl <check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule] The new messages in the following diff is the same to the regexp to check the parameter of vcregress. ====== diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl index 3d14544..08e2acc 100644 --- a/src/tools/msvc/vcregress.pl +++ b/src/tools/msvc/vcregress.pl @@ -548,6 +548,6 @@ sub usage{ print STDERR "Usage: vcregress.pl ", -"<check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule]\n"; +"<check|installcheck|plcheck|contribcheck|modulescheck|ecpgcheck|isolationcheck|upgradecheck|bincheck|recoverycheck> [schedule]\n"; exit(1);} ===== regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 14 Apr 2016 17:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160414.172539.34325458.horiguchi.kyotaro@lab.ntt.co.jp> > Hello, > > At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com> > > On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Yes, this is what I was trying to explain to Fujii-san upthread and I have > > > also verified that the same works on Windows. > > > > If you could, it would be nice as well to check that nothing breaks > > with VS when using vcregress recoverycheck. IPC::Run is not installed on Active Perl on my environment and Active state seems to be saying that IPC-Run cannot be compiled on Windows. ppm doesn't show IPC-Run. Is there any means to do TAP test other than this way? https://code.activestate.com/ppm/IPC-Run/ regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Apr 14, 2016 at 5:25 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl > index 3d14544..08e2acc 100644 > --- a/src/tools/msvc/vcregress.pl > +++ b/src/tools/msvc/vcregress.pl > @@ -548,6 +548,6 @@ sub usage > { > print STDERR > "Usage: vcregress.pl ", > -"<check|installcheck|plcheck|contribcheck|isolationcheck|ecpgcheck|upgradecheck> [schedule]\n"; > +"<check|installcheck|plcheck|contribcheck|modulescheck|ecpgcheck|isolationcheck|upgradecheck|bincheck|recoverycheck> [schedule]\n"; > exit(1); > } Right, this is missing modulescheck, bincheck and recoverycheck. All 3 are actually mainly my fault, or perhaps Andrew scored once on bincheck. Honestly, this is unreadable and that's always tiring to decrypt it, so why not changing it to something more explicit like the attached? See by yourself: $ perl vcregress.pl Usage: vcregress.pl <mode> [ <schedule> ] Options for <mode>: bincheck run tests of utilities in src/bin/ check deploy instance and run regression tests on it contribcheck run tests of modules in contrib/ ecpgcheck run regression tests of ECPG driver installcheck run regression tests on existing instance isolationcheck run isolation tests modulescheck run tests of modules in src/test/modules plcheck run tests of PL languages recoverycheck run recovery test suite upgradecheck run tests of pg_upgrade Options for <schedule>: serial serial mode parallel parallel mode -- Michael
Attachment
On Thu, Apr 14, 2016 at 5:48 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Thu, 14 Apr 2016 17:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20160414.172539.34325458.horiguchi.kyotaro@lab.ntt.co.jp> >> Hello, >> >> At Thu, 14 Apr 2016 13:24:34 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqThcdv+CrWyWbFQGYL0GJFZeWVGXs5K9x65WWgbqkJ7YQ@mail.gmail.com> >> > On Thu, Apr 14, 2016 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > Yes, this is what I was trying to explain to Fujii-san upthread and I have >> > > also verified that the same works on Windows. >> > >> > If you could, it would be nice as well to check that nothing breaks >> > with VS when using vcregress recoverycheck. > > IPC::Run is not installed on Active Perl on my environment and > Active state seems to be saying that IPC-Run cannot be compiled > on Windows. ppm doesn't show IPC-Run. Is there any means to do > TAP test other than this way? > > https://code.activestate.com/ppm/IPC-Run/ IPC::Run is a mandatory dependency I am afraid. You could just download it from cpan and install it manually in your PERL5LIB path. That's what I did, and it proves to work just fine. -- Michael
>
> On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com>
> >> > Yes, this is what I was trying to explain to Fujii-san upthread and I have
> >> > also verified that the same works on Windows.
> >>
> >> Oh, okay, understood. Thanks for explaining that!
> >>
> >> > I think one point which we
> >> > should try to ensure in this patch is whether it is good to use
> >> > TopMemoryContext to allocate the memory in the check or assign function or
> >> > should we allocate some temporary context (like we do in load_tzoffsets())
> >> > to perform parsing and then delete the same at end.
> >>
> >> Seems yes if some memories are allocated by palloc and they are not
> >> free'd while parsing s_s_names.
> >>
> >> Here are another comment for the patch.
> >>
> >> -SyncRepFreeConfig(SyncRepConfigData *config)
> >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself)
> >>
> >> SyncRepFreeConfig() was extended so that it accepts the second boolean
> >> argument. But it's always called with the second argument = false. So,
> >> I just wonder why that second argument is required.
> >>
> >> SyncRepConfigData *config =
> >> - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData));
> >> + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
> >>
> >> Why should we use malloc instead of palloc here?
> >>
> >> *If* we use malloc, its return value must be checked.
> >
> > Because it should live irrelevant to any memory context, as guc
> > values are so. guc.c provides guc_malloc for this purpose, which
> > is a malloc having some simple error handling, so having
> > walsender_malloc would be reasonable.
> >
> > I don't think it's good to use TopMemoryContext for syncrep
> > parser. syncrep_scanner.l uses palloc. This basically causes a
> > memory leak on all postgres processes.
> >
> > It might be better if the parser works on the current memory
> > context and the caller copies the result on the malloc'ed
> > memory. But some list-creation functions using palloc..
How about if we do all the parsing stuff in temporary context and then copy the results using TopMemoryContext? I don't think it will be a leak in TopMemoryContext, because next time we try to check/assign s_s_names, it will free the previous result.
>
> Changing
> > SyncRepConfigData.members to be char** would be messier..
>
> SyncRepGetSyncStandby logic assumes deeply that the sync standby names
> are constructed as a list.
> I think that it would entail a radical change in SyncRepGetStandby
> Another idea is to prepare the some functions that allocate/free
> element of list using by malloc, free.
>
At Thu, 14 Apr 2016 21:05:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSWLyP5ObQz_9Y=kezi0oGeZHaCPn6FT9BYK9tB3HbiVg@mail.gmail.com> > > IPC::Run is not installed on Active Perl on my environment and > > Active state seems to be saying that IPC-Run cannot be compiled > > on Windows. ppm doesn't show IPC-Run. Is there any means to do > > TAP test other than this way? > > > > https://code.activestate.com/ppm/IPC-Run/ > > IPC::Run is a mandatory dependency I am afraid. You could just > download it from cpan and install it manually in your PERL5LIB path. > That's what I did, and it proves to work just fine. Hmm. I got an error that dmake is not found for the first time but I could successfully install it this time. Thank you for letting me retry. I confirmed that fix_sync_rep_update_conf_v4.patch doesn't make nothing to be broken in vcregress recoverycheck. And I will be able to recheck for revised versions. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+Qsw2hLEhrEBvveKC91uZQhDce9i-4dB8VPz87Ciz+OQ@mail.gmail.com> > On Thu, Apr 14, 2016 at 1:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > > On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> > wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com > > > > >> > Yes, this is what I was trying to explain to Fujii-san upthread and > I have > > >> > also verified that the same works on Windows. > > >> > > >> Oh, okay, understood. Thanks for explaining that! > > >> > > >> > I think one point which we > > >> > should try to ensure in this patch is whether it is good to use > > >> > TopMemoryContext to allocate the memory in the check or assign > function or > > >> > should we allocate some temporary context (like we do in > load_tzoffsets()) > > >> > to perform parsing and then delete the same at end. > > >> > > >> Seems yes if some memories are allocated by palloc and they are not > > >> free'd while parsing s_s_names. > > >> > > >> Here are another comment for the patch. > > >> > > >> -SyncRepFreeConfig(SyncRepConfigData *config) > > >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself) > > >> > > >> SyncRepFreeConfig() was extended so that it accepts the second boolean > > >> argument. But it's always called with the second argument = false. So, > > >> I just wonder why that second argument is required. > > >> > > >> SyncRepConfigData *config = > > >> - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); > > >> + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); > > >> > > >> Why should we use malloc instead of palloc here? > > >> > > >> *If* we use malloc, its return value must be checked. > > > > > > Because it should live irrelevant to any memory context, as guc > > > values are so. guc.c provides guc_malloc for this purpose, which > > > is a malloc having some simple error handling, so having > > > walsender_malloc would be reasonable. > > > > > > I don't think it's good to use TopMemoryContext for syncrep > > > parser. syncrep_scanner.l uses palloc. This basically causes a > > > memory leak on all postgres processes. > > > > > > It might be better if the parser works on the current memory > > > context and the caller copies the result on the malloc'ed > > > memory. But some list-creation functions using palloc.. > > How about if we do all the parsing stuff in temporary context and then copy > the results using TopMemoryContext? I don't think it will be a leak in > TopMemoryContext, because next time we try to check/assign s_s_names, it > will free the previous result. I agree with you. A temporary context for the parser seems reasonable. TopMemoryContext is created very early in main() so palloc on it is effectively the same with malloc. One problem is that only the top memory block is assumed to be free()'d, not pfree()'d by guc_set_extra. It makes this quite ugly.. Maybe we shouldn't use the extra for this purpose. Thoughts? > > Changing > > > SyncRepConfigData.members to be char** would be messier.. > > > > SyncRepGetSyncStandby logic assumes deeply that the sync standby names > > are constructed as a list. > > I think that it would entail a radical change in SyncRepGetStandby > > Another idea is to prepare the some functions that allocate/free > > element of list using by malloc, free. > > > > Yeah, that could be another way of doing it, but seems like much more work. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 3c9142e..3778c94 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h" +#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */ @@ -361,11 +362,6 @@ SyncRepInitConfig(void){ int priority; - /* Update the config data of synchronous replication */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - SyncRepUpdateConfig(); - /* * Determine if we are a potential sync standby and remember the result * for handling replies from standby. @@ -868,47 +864,61 @@ SyncRepUpdateSyncStandbysDefined(void)}/* - * Parse synchronous_standby_names and update the config data - * of synchronous standbys. + * Free a previously-allocated config data of synchronous replication. */void -SyncRepUpdateConfig(void) +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt){ - int parse_rc; + MemoryContext oldcxt = NULL; - if (!SyncStandbysDefined()) + if (!config) return; - /* - * check_synchronous_standby_names() verifies the setting value of - * synchronous_standby_names before this function is called. So - * syncrep_yyparse() must not cause an error here. - */ - syncrep_scanner_init(SyncRepStandbyNames); - parse_rc = syncrep_yyparse(); - syncrep_scanner_finish(); - - if (parse_rc != 0) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg_internal("synchronous_standby_names parser returned %d", - parse_rc))); - - SyncRepConfig = syncrep_parse_result; - syncrep_parse_result = NULL; + if (cxt) + oldcxt = MemoryContextSwitchTo(cxt); + list_free_deep(config->members); + + if(oldcxt) + MemoryContextSwitchTo(oldcxt); + + if (itself) + free(config);}/* - * Free a previously-allocated config data of synchronous replication. + * Returns a copy of a replication config data in the specified memory + * context. Note that only the top block should be malloc'ed, because it is + * assumed to be freed by set_ */ -void -SyncRepFreeConfig(SyncRepConfigData *config) +SyncRepConfigData * +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt){ - if (!config) - return; + MemoryContext oldcxt; + SyncRepConfigData *newconfig; + ListCell *lc; - list_free_deep(config->members); - pfree(config); + if (!oldconfig) + return NULL; + + oldcxt = MemoryContextSwitchTo(targetcxt); + + newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); + newconfig->num_sync = oldconfig->num_sync; + newconfig->members = list_copy(oldconfig->members); + + /* + * The new members list is a combination of list cells on new context and + * data pointed from each cell on the old context. So we explicitly copy + * the data. + */ + foreach (lc, newconfig->members) + { + lfirst(lc) = pstrdup((char *) lfirst(lc)); + } + + MemoryContextSwitchTo(oldcxt); + + return newconfig;}#ifdef USE_ASSERT_CHECKING @@ -959,12 +969,32 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) if (*newval != NULL&& (*newval)[0] != '\0') { + MemoryContext oldcxt; + MemoryContext repparse_cxt; + + /* + * The result of syncrep_yyparse should live for the lifetime of the + * process and syncrep_yyparse may abandon a certain amount of + * palloc'ed memory * blocks. So we provide a temporary memory context + * for the playground of syncrep_yyparse and copy the result to + * TopMmeoryContext. + */ + repparse_cxt = AllocSetContextCreate(CurrentMemoryContext, + "syncrep parser", + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); + oldcxt = MemoryContextSwitchTo(repparse_cxt); + syncrep_scanner_init(*newval); parse_rc = syncrep_yyparse(); syncrep_scanner_finish(); + MemoryContextSwitchTo(oldcxt); + if (parse_rc != 0) { + MemoryContextDelete(repparse_cxt); GUC_check_errcode(ERRCODE_SYNTAX_ERROR); GUC_check_errdetail("synchronous_standby_namesparser returned %d", parse_rc); @@ -1017,17 +1047,38 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) } /* - * syncrep_yyparse sets the global syncrep_parse_result as side effect. - * But this function is required to just check, so frees it - * after parsing the parameter. + * syncrep_yyparse sets the global syncrep_parse_result. + * Save syncrep_parse_result to extra in order to use it in + * assign_synchronous_standby_names hook as well. */ - SyncRepFreeConfig(syncrep_parse_result); + *extra = (void *)SyncRepCopyConfig(syncrep_parse_result, + TopMemoryContext); + MemoryContextDelete(repparse_cxt); } return true;}void +assign_synchronous_standby_names(const char *newval, void *extra) +{ + SyncRepConfigData *myextra = extra; + + /* + * Free members of SyncRepConfig if it already refers somewhere, but + * SyncRepConfig itself is freed by set_extra_field. The content of + * SyncRepConfit is on TopMemoryContext. See + * check_synchronous_standby_names. + */ + if (SyncRepConfig) + SyncRepFreeConfig(SyncRepConfig, false, TopMemoryContext); + + SyncRepConfig = myextra; + + return; +} + +voidassign_synchronous_commit(int newval, void *extra){ switch (newval) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 81d3d28..20d23d5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) MemoryContextSwitchTo(oldcontext); /* - * Allocate and update the config data of synchronous replication, - * and then get the currently active synchronous standbys. + * Get the currently active synchronous standbys. */ - SyncRepUpdateConfig(); LWLockAcquire(SyncRepLock, LW_SHARED); sync_standbys = SyncRepGetSyncStandbys(NULL); LWLockRelease(SyncRepLock); - /* - * Free the previously-allocated config data because a backend - * no longer needs it. The next call of this function needs to - * allocate and update the config data newly because the setting - * of sync replication might be changed between the calls. - */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - for (i = 0; i < max_wal_senders; i++) { WalSnd *walsnd = &WalSndCtl->walsnds[i]; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index fb091bc..3ce83bf 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] = }, &SyncRepStandbyNames, "", - check_synchronous_standby_names, NULL, NULL + check_synchronous_standby_names, assign_synchronous_standby_names, NULL }, { diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h index 14b5664..97368f8 100644 --- a/src/include/replication/syncrep.h +++ b/src/include/replication/syncrep.h @@ -59,13 +59,16 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List *SyncRepGetSyncStandbys(bool*am_sync); -extern void SyncRepUpdateConfig(void); -extern void SyncRepFreeConfig(SyncRepConfigData *config); +extern void SyncRepFreeConfig(SyncRepConfigData *config, bool itself, + MemoryContext targetcxt); +extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig, + MemoryContext targetcxt);/* called by checkpointer */extern void SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra, GucSourcesource); +extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void assign_synchronous_commit(intnewval, void *extra);/*
On Fri, Apr 15, 2016 at 3:00 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+Qsw2hLEhrEBvveKC91uZQhDce9i-4dB8VPz87Ciz+OQ@mail.gmail.com> >> On Thu, Apr 14, 2016 at 1:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> wrote: >> > >> > On Thu, Apr 14, 2016 at 1:11 PM, Kyotaro HORIGUCHI >> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > > At Thu, 14 Apr 2016 12:42:06 +0900, Fujii Masao <masao.fujii@gmail.com> >> wrote in <CAHGQGwH7F5gWfdCT71Ucix_w+8ipR1Owzv9k4VnA1fcMYyfr6w@mail.gmail.com >> > >> > >> > Yes, this is what I was trying to explain to Fujii-san upthread and >> I have >> > >> > also verified that the same works on Windows. >> > >> >> > >> Oh, okay, understood. Thanks for explaining that! >> > >> >> > >> > I think one point which we >> > >> > should try to ensure in this patch is whether it is good to use >> > >> > TopMemoryContext to allocate the memory in the check or assign >> function or >> > >> > should we allocate some temporary context (like we do in >> load_tzoffsets()) >> > >> > to perform parsing and then delete the same at end. >> > >> >> > >> Seems yes if some memories are allocated by palloc and they are not >> > >> free'd while parsing s_s_names. >> > >> >> > >> Here are another comment for the patch. >> > >> >> > >> -SyncRepFreeConfig(SyncRepConfigData *config) >> > >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself) >> > >> >> > >> SyncRepFreeConfig() was extended so that it accepts the second boolean >> > >> argument. But it's always called with the second argument = false. So, >> > >> I just wonder why that second argument is required. >> > >> >> > >> SyncRepConfigData *config = >> > >> - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); >> > >> + (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); >> > >> >> > >> Why should we use malloc instead of palloc here? >> > >> >> > >> *If* we use malloc, its return value must be checked. >> > > >> > > Because it should live irrelevant to any memory context, as guc >> > > values are so. guc.c provides guc_malloc for this purpose, which >> > > is a malloc having some simple error handling, so having >> > > walsender_malloc would be reasonable. >> > > >> > > I don't think it's good to use TopMemoryContext for syncrep >> > > parser. syncrep_scanner.l uses palloc. This basically causes a >> > > memory leak on all postgres processes. >> > > >> > > It might be better if the parser works on the current memory >> > > context and the caller copies the result on the malloc'ed >> > > memory. But some list-creation functions using palloc.. >> >> How about if we do all the parsing stuff in temporary context and then copy >> the results using TopMemoryContext? I don't think it will be a leak in >> TopMemoryContext, because next time we try to check/assign s_s_names, it >> will free the previous result. > > I agree with you. A temporary context for the parser seems > reasonable. TopMemoryContext is created very early in main() so > palloc on it is effectively the same with malloc. > One problem is that only the top memory block is assumed to be > free()'d, not pfree()'d by guc_set_extra. It makes this quite > ugly.. > > Maybe we shouldn't use the extra for this purpose. > > Thoughts? > How about if check_hook just parses parameter in CurrentMemoryContext(i.g., T_AllocSetContext), and then the assign_hook copies syncrep_parse_result to TopMemoryContext. Because syncrep_parse_result is a global variable, these hooks can see it. Here are some comments. -SyncRepUpdateConfig(void) +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt) Sorry, it's my bad. itself variables is no longer needed because SyncRepFreeConfig is called by only one function. -void -SyncRepFreeConfig(SyncRepConfigData *config) +SyncRepConfigData * +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt) I'm not sure targetcxt argument is necessary. Regards, -- Masahiko Sawada
>
> At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote :
> >
> > How about if we do all the parsing stuff in temporary context and then copy
> > the results using TopMemoryContext? I don't think it will be a leak in
> > TopMemoryContext, because next time we try to check/assign s_s_names, it
> > will free the previous result.
>
> I agree with you. A temporary context for the parser seems
> reasonable. TopMemoryContext is created very early in main() so
> palloc on it is effectively the same with malloc.
>
> One problem is that only the top memory block is assumed to be
> free()'d, not pfree()'d by guc_set_extra. It makes this quite
> ugly..
>
+ newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData));
+SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt)
Do we really need 'bool itself' parameter in above function?
+ if (cxt)
+ oldcxt = MemoryContextSwitchTo(cxt);
+ list_free_deep(config->members);
+
+ if(oldcxt)
+ MemoryContextSwitchTo(oldcxt);
At Sat, 16 Apr 2016 12:50:30 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1LzC=6-EEVuCZhoYnKDHSqKUptV6F+5SavSR5P6jHdfXw@mail.gmail.com> > On Fri, Apr 15, 2016 at 11:30 AM, Kyotaro HORIGUCHI < > horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > > At Fri, 15 Apr 2016 08:52:56 +0530, Amit Kapila <amit.kapila16@gmail.com> > wrote : > > > > > > How about if we do all the parsing stuff in temporary context and then > copy > > > the results using TopMemoryContext? I don't think it will be a leak in > > > TopMemoryContext, because next time we try to check/assign s_s_names, it > > > will free the previous result. > > > > I agree with you. A temporary context for the parser seems > > reasonable. TopMemoryContext is created very early in main() so > > palloc on it is effectively the same with malloc. > > > > One problem is that only the top memory block is assumed to be > > free()'d, not pfree()'d by guc_set_extra. It makes this quite > > ugly.. > > > > + newconfig = (SyncRepConfigData *) malloc(sizeof(SyncRepConfigData)); > Is there a reason to use malloc here, can't we use palloc directly? The reason is the memory block is to released using free() in guc_extra_field (not guc_set_extra). Even if we allocate and deallocate it using palloc/pfree, the 'extra' pointer to the block in gconf cannot be NULLed there and guc_extra_field tries freeing it again using free() then bang. > Also > for both the functions SyncRepCopyConfig() and SyncRepFreeConfig(), if we > directly use TopMemoryContext inside the function (if required) rather than > taking it as argument, then it will simplify the code a lot. Either is fine. I placed the parameter in order to emphasize where the memory block is placed on, other than current memory context nor bare heap, rather than for some practical reasons. > +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext > cxt) > > Do we really need 'bool itself' parameter in above function? > > + if (cxt) > > + oldcxt = MemoryContextSwitchTo(cxt); > > + list_free_deep(config->members); > > + > > + if(oldcxt) > > + MemoryContextSwitchTo(oldcxt); > Why do you need MemoryContextSwitchTo for freeing members? Ah, sorry. It's just a slip of my fingers. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com> > >> How about if we do all the parsing stuff in temporary context and then copy > >> the results using TopMemoryContext? I don't think it will be a leak in > >> TopMemoryContext, because next time we try to check/assign s_s_names, it > >> will free the previous result. > > > > I agree with you. A temporary context for the parser seems > > reasonable. TopMemoryContext is created very early in main() so > > palloc on it is effectively the same with malloc. > > One problem is that only the top memory block is assumed to be > > free()'d, not pfree()'d by guc_set_extra. It makes this quite > > ugly.. > > > > Maybe we shouldn't use the extra for this purpose. > > > > Thoughts? > > > > How about if check_hook just parses parameter in > CurrentMemoryContext(i.g., T_AllocSetContext), and then the > assign_hook copies syncrep_parse_result to TopMemoryContext. > Because syncrep_parse_result is a global variable, these hooks can see it. Hmm. Somewhat uneasy but should work. The attached patch does it. > Here are some comments. > > -SyncRepUpdateConfig(void) > +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt) > > Sorry, it's my bad. itself variables is no longer needed because > SyncRepFreeConfig is called by only one function. > > -void > -SyncRepFreeConfig(SyncRepConfigData *config) > +SyncRepConfigData * > +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt) > > I'm not sure targetcxt argument is necessary. Yes, these are just for signalling so removal doesn't harm. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 3c9142e..3d68fb5 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h" +#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */ @@ -361,11 +362,6 @@ SyncRepInitConfig(void){ int priority; - /* Update the config data of synchronous replication */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - SyncRepUpdateConfig(); - /* * Determine if we are a potential sync standby and remember the result * for handling replies from standby. @@ -868,47 +864,50 @@ SyncRepUpdateSyncStandbysDefined(void)}/* - * Parse synchronous_standby_names and update the config data - * of synchronous standbys. + * Free a previously-allocated config data of synchronous replication. */void -SyncRepUpdateConfig(void) +SyncRepFreeConfig(SyncRepConfigData *config){ - int parse_rc; - - if (!SyncStandbysDefined()) + if (!config) return; - /* - * check_synchronous_standby_names() verifies the setting value of - * synchronous_standby_names before this function is called. So - * syncrep_yyparse() must not cause an error here. - */ - syncrep_scanner_init(SyncRepStandbyNames); - parse_rc = syncrep_yyparse(); - syncrep_scanner_finish(); - - if (parse_rc != 0) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg_internal("synchronous_standby_names parser returned %d", - parse_rc))); - - SyncRepConfig = syncrep_parse_result; - syncrep_parse_result = NULL; + list_free_deep(config->members); + pfree(config);}/* - * Free a previously-allocated config data of synchronous replication. + * Returns a copy of a replication config data into the TopMemoryContext. */ -void -SyncRepFreeConfig(SyncRepConfigData *config) +SyncRepConfigData * +SyncRepCopyConfig(SyncRepConfigData *oldconfig){ - if (!config) - return; + MemoryContext oldcxt; + SyncRepConfigData *newconfig; + ListCell *lc; - list_free_deep(config->members); - pfree(config); + if (!oldconfig) + return NULL; + + oldcxt = MemoryContextSwitchTo(TopMemoryContext); + + newconfig = (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); + newconfig->num_sync = oldconfig->num_sync; + newconfig->members = list_copy(oldconfig->members); + + /* + * The new members list is a combination of list cells on the new context + * and data pointed from each cell on the old context. So we explicitly + * copy the data. + */ + foreach (lc, newconfig->members) + { + lfirst(lc) = pstrdup((char *) lfirst(lc)); + } + + MemoryContextSwitchTo(oldcxt); + + return newconfig;}#ifdef USE_ASSERT_CHECKING @@ -957,6 +956,8 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source){ int parse_rc; + Assert(syncrep_parse_result == NULL); + if (*newval != NULL && (*newval)[0] != '\0') { syncrep_scanner_init(*newval); @@ -965,6 +966,7 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) if (parse_rc !=0) { + syncrep_parse_result = NULL; GUC_check_errcode(ERRCODE_SYNTAX_ERROR); GUC_check_errdetail("synchronous_standby_namesparser returned %d", parse_rc); @@ -1017,17 +1019,39 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) } /* - * syncrep_yyparse sets the global syncrep_parse_result as side effect. - * But this function is required to just check, so frees it - * after parsing the parameter. + * We leave syncrep_parse_result for the use in + * assign_synchronous_standby_names. */ - SyncRepFreeConfig(syncrep_parse_result); } return true;}void +assign_synchronous_standby_names(const char *newval, void *extra) +{ + /* Free the old SyncRepConfig if exists */ + if (SyncRepConfig) + SyncRepFreeConfig(SyncRepConfig); + + SyncRepConfig = NULL; + + /* Copy the parsed config into TopMemoryContext if exists */ + if (syncrep_parse_result) + { + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result); + + /* + * this memory block will be freed as a part of the memory contxt for + * config file processing. + */ + syncrep_parse_result = NULL; + } + + return; +} + +voidassign_synchronous_commit(int newval, void *extra){ switch (newval) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 81d3d28..20d23d5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) MemoryContextSwitchTo(oldcontext); /* - * Allocate and update the config data of synchronous replication, - * and then get the currently active synchronous standbys. + * Get the currently active synchronous standbys. */ - SyncRepUpdateConfig(); LWLockAcquire(SyncRepLock, LW_SHARED); sync_standbys = SyncRepGetSyncStandbys(NULL); LWLockRelease(SyncRepLock); - /* - * Free the previously-allocated config data because a backend - * no longer needs it. The next call of this function needs to - * allocate and update the config data newly because the setting - * of sync replication might be changed between the calls. - */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - for (i = 0; i < max_wal_senders; i++) { WalSnd *walsnd = &WalSndCtl->walsnds[i]; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index fb091bc..3ce83bf 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] = }, &SyncRepStandbyNames, "", - check_synchronous_standby_names, NULL, NULL + check_synchronous_standby_names, assign_synchronous_standby_names, NULL }, { diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h index 14b5664..9a1eb2f 100644 --- a/src/include/replication/syncrep.h +++ b/src/include/replication/syncrep.h @@ -59,13 +59,14 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List *SyncRepGetSyncStandbys(bool*am_sync); -extern void SyncRepUpdateConfig(void);extern void SyncRepFreeConfig(SyncRepConfigData *config); +extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig);/* called by checkpointer */extern void SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra, GucSourcesource); +extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void assign_synchronous_commit(intnewval, void *extra);/*
On Mon, Apr 18, 2016 at 2:15 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com> >> >> How about if we do all the parsing stuff in temporary context and then copy >> >> the results using TopMemoryContext? I don't think it will be a leak in >> >> TopMemoryContext, because next time we try to check/assign s_s_names, it >> >> will free the previous result. >> > >> > I agree with you. A temporary context for the parser seems >> > reasonable. TopMemoryContext is created very early in main() so >> > palloc on it is effectively the same with malloc. >> > One problem is that only the top memory block is assumed to be >> > free()'d, not pfree()'d by guc_set_extra. It makes this quite >> > ugly.. >> > >> > Maybe we shouldn't use the extra for this purpose. >> > >> > Thoughts? >> > >> >> How about if check_hook just parses parameter in >> CurrentMemoryContext(i.g., T_AllocSetContext), and then the >> assign_hook copies syncrep_parse_result to TopMemoryContext. >> Because syncrep_parse_result is a global variable, these hooks can see it. > > Hmm. Somewhat uneasy but should work. The attached patch does it. > >> Here are some comments. >> >> -SyncRepUpdateConfig(void) >> +SyncRepFreeConfig(SyncRepConfigData *config, bool itself, MemoryContext cxt) >> >> Sorry, it's my bad. itself variables is no longer needed because >> SyncRepFreeConfig is called by only one function. >> >> -void >> -SyncRepFreeConfig(SyncRepConfigData *config) >> +SyncRepConfigData * >> +SyncRepCopyConfig(SyncRepConfigData *oldconfig, MemoryContext targetcxt) >> >> I'm not sure targetcxt argument is necessary. > > Yes, these are just for signalling so removal doesn't harm. > Thank you for updating the patch. Here are some comments. + Assert(syncrep_parse_result == NULL); + Why do we need Assert at this point? It's possible that syncrep_parse_result is not NULL after setting s_s_names by ALTER SYSTEM. + /* + * this memory block will be freed as a part of the memory contxt for + * config file processing. + */ s/contxt/context/ Regards, -- Masahiko Sawada
At Wed, 20 Apr 2016 11:51:09 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoC5rrWSk-V79xjVfYr2UqQYrrCKsXkSxZrN9p5YAaeKJA@mail.gmail.com> > On Mon, Apr 18, 2016 at 2:15 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > At Fri, 15 Apr 2016 17:36:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCOL6BCC+FWNCZH_XPgtWc_otnvShMx6_uAcU7Bwb16Rw@mail.gmail.com> > >> How about if check_hook just parses parameter in > >> CurrentMemoryContext(i.g., T_AllocSetContext), and then the > >> assign_hook copies syncrep_parse_result to TopMemoryContext. > >> Because syncrep_parse_result is a global variable, these hooks can see it. > > > > Hmm. Somewhat uneasy but should work. The attached patch does it. .. > Thank you for updating the patch. > > Here are some comments. > > + Assert(syncrep_parse_result == NULL); > + > > Why do we need Assert at this point? > It's possible that syncrep_parse_result is not NULL after setting > s_s_names by ALTER SYSTEM. Thank you for pointing it out. It is just a trace of an assumption no longer useful. > + /* > + * this memory block will be freed as a part of the > memory contxt for > + * config file processing. > + */ > > s/contxt/context/ Thanks. I removed whole the comment and the corresponding code since it's meaningless. assign_s_s_names causes SEGV when it is called without calling check_s_s_names. I think that's not the case for this varialbe because it is unresettable amid a session. It is very uneasy for me but I don't see a proper means to reset syncrep_parse_result. MemoryContext deletion hook would work but it seems to be an overkill for this single use. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 3c9142e..bdd6de0 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -68,6 +68,7 @@#include "storage/proc.h"#include "tcop/tcopprot.h"#include "utils/builtins.h" +#include "utils/memutils.h"#include "utils/ps_status.h"/* User-settable parameters for sync rep */ @@ -361,11 +362,6 @@ SyncRepInitConfig(void){ int priority; - /* Update the config data of synchronous replication */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - SyncRepUpdateConfig(); - /* * Determine if we are a potential sync standby and remember the result * for handling replies from standby. @@ -868,47 +864,50 @@ SyncRepUpdateSyncStandbysDefined(void)}/* - * Parse synchronous_standby_names and update the config data - * of synchronous standbys. + * Free a previously-allocated config data of synchronous replication. */void -SyncRepUpdateConfig(void) +SyncRepFreeConfig(SyncRepConfigData *config){ - int parse_rc; - - if (!SyncStandbysDefined()) + if (!config) return; - /* - * check_synchronous_standby_names() verifies the setting value of - * synchronous_standby_names before this function is called. So - * syncrep_yyparse() must not cause an error here. - */ - syncrep_scanner_init(SyncRepStandbyNames); - parse_rc = syncrep_yyparse(); - syncrep_scanner_finish(); - - if (parse_rc != 0) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg_internal("synchronous_standby_names parser returned %d", - parse_rc))); - - SyncRepConfig = syncrep_parse_result; - syncrep_parse_result = NULL; + list_free_deep(config->members); + pfree(config);}/* - * Free a previously-allocated config data of synchronous replication. + * Returns a copy of a replication config data into the TopMemoryContext. */ -void -SyncRepFreeConfig(SyncRepConfigData *config) +SyncRepConfigData * +SyncRepCopyConfig(SyncRepConfigData *oldconfig){ - if (!config) - return; + MemoryContext oldcxt; + SyncRepConfigData *newconfig; + ListCell *lc; - list_free_deep(config->members); - pfree(config); + if (!oldconfig) + return NULL; + + oldcxt = MemoryContextSwitchTo(TopMemoryContext); + + newconfig = (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); + newconfig->num_sync = oldconfig->num_sync; + newconfig->members = list_copy(oldconfig->members); + + /* + * The new members list is a combination of list cells on the new context + * and data pointed from each cell on the old context. So we explicitly + * copy the data. + */ + foreach (lc, newconfig->members) + { + lfirst(lc) = pstrdup((char *) lfirst(lc)); + } + + MemoryContextSwitchTo(oldcxt); + + return newconfig;}#ifdef USE_ASSERT_CHECKING @@ -952,13 +951,30 @@ SyncRepQueueIsOrderedByLSN(int mode) * ===========================================================*/ +/* + * check_synchronous_standby_names and assign_synchronous_standby_names are to + * be used from guc.c. The former generates a result pointed by + * syncrep_parse_result in the current memory context as the side effect and + * the latter reads it. This won't be a problem as long as the guc variable + * synchronous_standby_names cannot be set during a session. + */ +boolcheck_synchronous_standby_names(char **newval, void **extra, GucSource source){ int parse_rc; + syncrep_parse_result = NULL; + if (*newval != NULL && (*newval)[0] != '\0') { + /* + * syncrep_yyparse generates a result on the current memory context as + * the side effect and points it using the global + * syncrep_prase_result. We don't clear the pointer even after the + * result is invalidated by discarding the context so make sure not to + * use it after invalidation. + */ syncrep_scanner_init(*newval); parse_rc = syncrep_yyparse(); syncrep_scanner_finish(); @@ -1015,19 +1031,28 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) syncrep_parse_result->num_sync, list_length(syncrep_parse_result->members)), errhint("Specifymore names of potential synchronous standbys in synchronous_standby_names."))); } - - /* - * syncrep_yyparse sets the global syncrep_parse_result as side effect. - * But this function is required to just check, so frees it - * after parsing the parameter. - */ - SyncRepFreeConfig(syncrep_parse_result); } return true;}void +assign_synchronous_standby_names(const char *newval, void *extra) +{ + /* Free the old SyncRepConfig if exists */ + if (SyncRepConfig) + SyncRepFreeConfig(SyncRepConfig); + + /* Copy the parsed config into TopMemoryContext if exists */ + if (syncrep_parse_result) + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result); + else + SyncRepConfig = NULL; + + return; +} + +voidassign_synchronous_commit(int newval, void *extra){ switch (newval) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 81d3d28..20d23d5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) MemoryContextSwitchTo(oldcontext); /* - * Allocate and update the config data of synchronous replication, - * and then get the currently active synchronous standbys. + * Get the currently active synchronous standbys. */ - SyncRepUpdateConfig(); LWLockAcquire(SyncRepLock, LW_SHARED); sync_standbys = SyncRepGetSyncStandbys(NULL); LWLockRelease(SyncRepLock); - /* - * Free the previously-allocated config data because a backend - * no longer needs it. The next call of this function needs to - * allocate and update the config data newly because the setting - * of sync replication might be changed between the calls. - */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - for (i = 0; i < max_wal_senders; i++) { WalSnd *walsnd = &WalSndCtl->walsnds[i]; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index fb091bc..3ce83bf 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] = }, &SyncRepStandbyNames, "", - check_synchronous_standby_names, NULL, NULL + check_synchronous_standby_names, assign_synchronous_standby_names, NULL }, { diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h index 14b5664..9a1eb2f 100644 --- a/src/include/replication/syncrep.h +++ b/src/include/replication/syncrep.h @@ -59,13 +59,14 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List *SyncRepGetSyncStandbys(bool*am_sync); -extern void SyncRepUpdateConfig(void);extern void SyncRepFreeConfig(SyncRepConfigData *config); +extern SyncRepConfigData *SyncRepCopyConfig(SyncRepConfigData *oldconfig);/* called by checkpointer */extern void SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra, GucSourcesource); +extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void assign_synchronous_commit(intnewval, void *extra);/*
>
>
> assign_s_s_names causes SEGV when it is called without calling
> check_s_s_names. I think that's not the case for this varialbe
> because it is unresettable amid a session. It is very uneasy for
> me but I don't see a proper means to reset
> syncrep_parse_result.
+ /* Copy the parsed config into TopMemoryContext if exists */
+ if (syncrep_parse_result)
+ SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
Could you please explain how to trigger the scenario where you have seen SEGV?
On Sat, Apr 23, 2016 at 7:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Apr 20, 2016 at 12:46 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> >> >> assign_s_s_names causes SEGV when it is called without calling >> check_s_s_names. I think that's not the case for this varialbe >> because it is unresettable amid a session. It is very uneasy for >> me but I don't see a proper means to reset >> syncrep_parse_result. >> > > Is it because syncrep_parse_result is not freed after creating a copy of it > in assign_synchronous_standby_names()? If it so, then I think we need to > call SyncRepFreeConfig(syncrep_parse_result); in > assign_synchronous_standby_names at below place: > > + /* Copy the parsed config into TopMemoryContext if exists */ > > + if (syncrep_parse_result) > > + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result); > > Could you please explain how to trigger the scenario where you have seen > SEGV? Seeing this discussion moving on, I am wondering if we should not discuss those improvements for 9.7. We are getting close to beta 1, and this is clearly not a bug, and it's not like HEAD is broken. So I think that we should not take the risk to make the code unstable at this stage. -- Michael
>
> On Sat, Apr 23, 2016 at 7:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Apr 20, 2016 at 12:46 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>
> >>
> >> assign_s_s_names causes SEGV when it is called without calling
> >> check_s_s_names. I think that's not the case for this varialbe
> >> because it is unresettable amid a session. It is very uneasy for
> >> me but I don't see a proper means to reset
> >> syncrep_parse_result.
> >>
> >
> > Is it because syncrep_parse_result is not freed after creating a copy of it
> > in assign_synchronous_standby_names()? If it so, then I think we need to
> > call SyncRepFreeConfig(syncrep_parse_result); in
> > assign_synchronous_standby_names at below place:
> >
> > + /* Copy the parsed config into TopMemoryContext if exists */
> >
> > + if (syncrep_parse_result)
> >
> > + SyncRepConfig = SyncRepCopyConfig(syncrep_parse_result);
> >
> > Could you please explain how to trigger the scenario where you have seen
> > SEGV?
>
> Seeing this discussion moving on, I am wondering if we should not
> discuss those improvements for 9.7.
>
Amit Kapila <amit.kapila16@gmail.com> writes: > The main point for this improvement is that the handling for guc s_s_names > is not similar to what we do for other somewhat similar guc's and which > causes in-efficiency in non-hot code path (less used code). This is not about efficiency, this is about correctness. The proposed v7 patch is flat out not acceptable, not now and not for 9.7 either, because it introduces a GUC assign hook that can easily fail (eg, through out-of-memory for the copy step). Assign hook functions need to be incapable of failure. I do not see any good reason why this one cannot satisfy that requirement, either. It just needs to make use of the "extra" mechanism to pass back an already-suitably-long-lived result from check_synchronous_standby_names. See check_timezone_abbreviations/ assign_timezone_abbreviations for a model to follow. You are going to need to find a way to package the parse result into a single malloc'd blob, though, because that's as much as guc.c can keep track of for an "extra" value. regards, tom lane
At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us> > Amit Kapila <amit.kapila16@gmail.com> writes: > > The main point for this improvement is that the handling for guc s_s_names > > is not similar to what we do for other somewhat similar guc's and which > > causes in-efficiency in non-hot code path (less used code). > > This is not about efficiency, this is about correctness. The proposed > v7 patch is flat out not acceptable, not now and not for 9.7 either, > because it introduces a GUC assign hook that can easily fail (eg, through > out-of-memory for the copy step). Assign hook functions need to be > incapable of failure. I do not see any good reason why this one cannot > satisfy that requirement, either. It just needs to make use of the > "extra" mechanism to pass back an already-suitably-long-lived result from > check_synchronous_standby_names. See check_timezone_abbreviations/ > assign_timezone_abbreviations for a model to follow. I had already seen there before the v7 and had the same feeling below in mind but packing in a blob needs to use other than List to hold the name list (just should be an array) and it is followed by the necessity of many changes in where the list is accessed. But the result is hopeless as you mentioned :( > You are going to > need to find a way to package the parse result into a single malloc'd > blob, though, because that's as much as guc.c can keep track of for an > "extra" value. Ok, I'll post the v8 with the blob solution sooner. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, attached is the new version v8. At Tue, 26 Apr 2016 11:02:25 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160426.110225.35506931.horiguchi.kyotaro@lab.ntt.co.jp> > At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us> > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > The main point for this improvement is that the handling for guc s_s_names > > > is not similar to what we do for other somewhat similar guc's and which > > > causes in-efficiency in non-hot code path (less used code). > > > > This is not about efficiency, this is about correctness. The proposed > > v7 patch is flat out not acceptable, not now and not for 9.7 either, > > because it introduces a GUC assign hook that can easily fail (eg, through > > out-of-memory for the copy step). Assign hook functions need to be > > incapable of failure. I do not see any good reason why this one cannot > > satisfy that requirement, either. It just needs to make use of the > > "extra" mechanism to pass back an already-suitably-long-lived result from > > check_synchronous_standby_names. See check_timezone_abbreviations/ > > assign_timezone_abbreviations for a model to follow. > > I had already seen there before the v7 and had the same feeling > below in mind but packing in a blob needs to use other than List > to hold the name list (just should be an array) and it is > followed by the necessity of many changes in where the list is > accessed. But the result is hopeless as you mentioned :( > > > You are going to > > need to find a way to package the parse result into a single malloc'd > > blob, though, because that's as much as guc.c can keep track of for an > > "extra" value. > > Ok, I'll post the v8 with the blob solution sooner. Hmm. It was way easier than I thought. The attached v8 patch does, - Changed SyncRepConfigData from a struct using liked list to a blob. Since the former struct is useful in parsing, it isstill used and converted into the latter form in check_s_s_names. - Make assign_s_s_names not to do nothing other than just assigning SyncRepConfig. - Change SyncRepGetSyncStandbys to read the latter form of configuration. - SyncRepFreeConfig is removed since it is no longer needed. It passes both make check and recovery/make check. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 3c9142e..376fe51 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -361,11 +361,6 @@ SyncRepInitConfig(void){ int priority; - /* Update the config data of synchronous replication */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - SyncRepUpdateConfig(); - /* * Determine if we are a potential sync standby and remember the result * for handling replies from standby. @@ -575,7 +570,7 @@ SyncRepGetSyncStandbys(bool *am_sync) if (am_sync != NULL) *am_sync = false; - lowest_priority = list_length(SyncRepConfig->members); + lowest_priority = SyncRepConfig->nmembers; next_highest_priority = lowest_priority + 1; /* @@ -730,9 +725,7 @@ SyncRepGetSyncStandbys(bool *am_sync)static intSyncRepGetStandbyPriority(void){ - List *members; - ListCell *l; - int priority = 0; + int priority; bool found = false; /* @@ -745,12 +738,9 @@ SyncRepGetStandbyPriority(void) if (!SyncStandbysDefined()) return 0; - members = SyncRepConfig->members; - foreach(l, members) + for (priority = 1 ; priority <= SyncRepConfig->nmembers ; priority++) { - char *standby_name = (char *) lfirst(l); - - priority++; + char *standby_name = SyncRepConfig->members[priority - 1]; if (pg_strcasecmp(standby_name, application_name)== 0 || pg_strcasecmp(standby_name, "*") == 0) @@ -867,50 +857,6 @@ SyncRepUpdateSyncStandbysDefined(void) }} -/* - * Parse synchronous_standby_names and update the config data - * of synchronous standbys. - */ -void -SyncRepUpdateConfig(void) -{ - int parse_rc; - - if (!SyncStandbysDefined()) - return; - - /* - * check_synchronous_standby_names() verifies the setting value of - * synchronous_standby_names before this function is called. So - * syncrep_yyparse() must not cause an error here. - */ - syncrep_scanner_init(SyncRepStandbyNames); - parse_rc = syncrep_yyparse(); - syncrep_scanner_finish(); - - if (parse_rc != 0) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg_internal("synchronous_standby_names parser returned %d", - parse_rc))); - - SyncRepConfig = syncrep_parse_result; - syncrep_parse_result = NULL; -} - -/* - * Free a previously-allocated config data of synchronous replication. - */ -void -SyncRepFreeConfig(SyncRepConfigData *config) -{ - if (!config) - return; - - list_free_deep(config->members); - pfree(config); -} -#ifdef USE_ASSERT_CHECKINGstatic boolSyncRepQueueIsOrderedByLSN(int mode) @@ -956,9 +902,16 @@ boolcheck_synchronous_standby_names(char **newval, void **extra, GucSource source){ int parse_rc; + SyncRepConfigData *pconf; + int i; + ListCell *lc; if (*newval != NULL && (*newval)[0] != '\0') { + /* + * syncrep_yyparse generates a result on the current memory context as + * the side effect and points it using syncrep_prase_result. + */ syncrep_scanner_init(*newval); parse_rc = syncrep_yyparse(); syncrep_scanner_finish(); @@ -1016,18 +969,35 @@ check_synchronous_standby_names(char **newval, void **extra, GucSource source) errhint("Specify more names of potential synchronous standbys in synchronous_standby_names."))); } - /* - * syncrep_yyparse sets the global syncrep_parse_result as side effect. - * But this function is required to just check, so frees it - * after parsing the parameter. - */ - SyncRepFreeConfig(syncrep_parse_result); - } + /* Convert SyncRepConfig into the packed struct fit to guc extra */ + pconf = (SyncRepConfigData *) + malloc(SizeOfSyncRepConfig( + list_length(syncrep_parse_result->members))); + pconf->num_sync = syncrep_parse_result->num_sync; + pconf->nmembers = list_length(syncrep_parse_result->members); + i = 0; + foreach (lc, syncrep_parse_result->members) + { + strncpy(pconf->members[i], (char*) lfirst (lc), NAMEDATALEN - 1); + pconf->members[i][NAMEDATALEN - 1] = 0; + i++; + } + *extra = (void *)pconf; + + /* No further need for syncrep_parse_result */ + syncrep_parse_result = NULL; + } return true;}void +assign_synchronous_standby_names(const char *newval, void *extra) +{ + SyncRepConfig = (SyncRepConfigData *) extra; +} + +voidassign_synchronous_commit(int newval, void *extra){ switch (newval) diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y index 380fedc..932fa9d 100644 --- a/src/backend/replication/syncrep_gram.y +++ b/src/backend/replication/syncrep_gram.y @@ -19,9 +19,9 @@#include "utils/formatting.h"/* Result of the parsing is returned here */ -SyncRepConfigData *syncrep_parse_result; +SyncRepParseData *syncrep_parse_result; -static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members); +static SyncRepParseData *create_syncrep_config(char *num_sync, List *members);/* * Bison doesn't allocate anything thatneeds to live across parser calls, @@ -43,7 +43,7 @@ static SyncRepConfigData *create_syncrep_config(char *num_sync, List *members);{ char *str; List *list; - SyncRepConfigData *config; + SyncRepParseData *config;}%token <str> NAME NUM @@ -72,11 +72,11 @@ standby_name:;%% -static SyncRepConfigData * +static SyncRepParseData *create_syncrep_config(char *num_sync, List *members){ - SyncRepConfigData *config = - (SyncRepConfigData *) palloc(sizeof(SyncRepConfigData)); + SyncRepParseData *config = + (SyncRepParseData *) palloc(sizeof(SyncRepParseData)); config->num_sync = atoi(num_sync); config->members= members; diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 81d3d28..20d23d5 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2780,23 +2780,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) MemoryContextSwitchTo(oldcontext); /* - * Allocate and update the config data of synchronous replication, - * and then get the currently active synchronous standbys. + * Get the currently active synchronous standbys. */ - SyncRepUpdateConfig(); LWLockAcquire(SyncRepLock, LW_SHARED); sync_standbys = SyncRepGetSyncStandbys(NULL); LWLockRelease(SyncRepLock); - /* - * Free the previously-allocated config data because a backend - * no longer needs it. The next call of this function needs to - * allocate and update the config data newly because the setting - * of sync replication might be changed between the calls. - */ - SyncRepFreeConfig(SyncRepConfig); - SyncRepConfig = NULL; - for (i = 0; i < max_wal_senders; i++) { WalSnd *walsnd = &WalSndCtl->walsnds[i]; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 60856dd..cccc8eb 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3484,7 +3484,7 @@ static struct config_string ConfigureNamesString[] = }, &SyncRepStandbyNames, "", - check_synchronous_standby_names, NULL, NULL + check_synchronous_standby_names, assign_synchronous_standby_names, NULL }, { diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h index 14b5664..6197308 100644 --- a/src/include/replication/syncrep.h +++ b/src/include/replication/syncrep.h @@ -33,16 +33,30 @@#define SYNC_REP_WAIT_COMPLETE 2/* + * Struct for parsing synchronous_standby_names + */ +typedef struct SyncRepParseData +{ + int num_sync; /* number of sync standbys that we need to wait for */ + List *members; /* list of names of potential sync standbys */ +} SyncRepParseData; + +/* * Struct for the configuration of synchronous replication. */typedef struct SyncRepConfigData{ int num_sync; /* number of sync standbys that we need to wait for */ - List *members; /* list of names of potential sync standbys */ + int nmembers; /* number of members in the following list */ + char members[FLEXIBLE_ARRAY_MEMBER][NAMEDATALEN];/* list of names of + * potential sync + * standbys */} SyncRepConfigData; -extern SyncRepConfigData *syncrep_parse_result; -extern SyncRepConfigData *SyncRepConfig; +#define SizeOfSyncRepConfig(n) \ + (offsetof(SyncRepConfigData, members) + (n) * NAMEDATALEN) + +extern SyncRepParseData *syncrep_parse_result;/* user-settable parameters for synchronous replication */extern char *SyncRepStandbyNames; @@ -59,13 +73,12 @@ extern void SyncRepReleaseWaiters(void);/* called by wal sender and user backend */extern List *SyncRepGetSyncStandbys(bool*am_sync); -extern void SyncRepUpdateConfig(void); -extern void SyncRepFreeConfig(SyncRepConfigData *config);/* called by checkpointer */extern void SyncRepUpdateSyncStandbysDefined(void);externbool check_synchronous_standby_names(char **newval, void **extra, GucSourcesource); +extern void assign_synchronous_standby_names(const char *newval, void *extra);extern void assign_synchronous_commit(intnewval, void *extra);/*
Hello, attached is the new version v8.
At Tue, 26 Apr 2016 11:02:25 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160426.110225.35506931.horiguchi.kyotaro@lab.ntt.co.jp>
> At Sat, 23 Apr 2016 10:12:03 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <476.1461420723@sss.pgh.pa.us>
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > The main point for this improvement is that the handling for guc s_s_names
> > > is not similar to what we do for other somewhat similar guc's and which
> > > causes in-efficiency in non-hot code path (less used code).
> >
> > This is not about efficiency, this is about correctness. The proposed
> > v7 patch is flat out not acceptable, not now and not for 9.7 either,
> > because it introduces a GUC assign hook that can easily fail (eg, through
> > out-of-memory for the copy step). Assign hook functions need to be
> > incapable of failure.
I do not see any good reason why this one cannot
> > satisfy that requirement, either. It just needs to make use of the
> > "extra" mechanism to pass back an already-suitably-long-lived result from
> > check_synchronous_standby_names. See check_timezone_abbreviations/
> > assign_timezone_abbreviations for a model to follow.
>
> I had already seen there before the v7 and had the same feeling
> below in mind but packing in a blob needs to use other than List
> to hold the name list (just should be an array) and it is
> followed by the necessity of many changes in where the list is
> accessed. But the result is hopeless as you mentioned :(
>
> > You are going to
> > need to find a way to package the parse result into a single malloc'd
> > blob, though, because that's as much as guc.c can keep track of for an
> > "extra" value.
>
> Ok, I'll post the v8 with the blob solution sooner.
Hmm. It was way easier than I thought. The attached v8 patch does,
- Changed SyncRepConfigData from a struct using liked list to a
blob. Since the former struct is useful in parsing, it is still
used and converted into the latter form in check_s_s_names.
- Make assign_s_s_names not to do nothing other than just
assigning SyncRepConfig.
- Change SyncRepGetSyncStandbys to read the latter form of
configuration.
- SyncRepFreeConfig is removed since it is no longer needed.
+ /* Convert SyncRepConfig into the packed struct fit to guc extra */
+ pconf = (SyncRepConfigData *)
+ malloc(SizeOfSyncRepConfig(
+ list_length(syncrep_parse_result->members)));
I think there should be a check for malloc failure in above code.
+ /* No further need for syncrep_parse_result */
+ syncrep_parse_result = NULL;
Isn't this a memory leak? Shouldn't we need to free the corresponding memory as well.
Hello, At Tue, 26 Apr 2016 09:57:50 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1KGVrQTueP2Rijjg_FNQ_TU3n5rt8-X5a0LaEzUQ-+i-Q@mail.gmail.com> > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > > The main point for this improvement is that the handling for guc > > s_s_names > > > > > is not similar to what we do for other somewhat similar guc's and > > which > > > > > causes in-efficiency in non-hot code path (less used code). > > > > > > > > This is not about efficiency, this is about correctness. The proposed > > > > v7 patch is flat out not acceptable, not now and not for 9.7 either, > > > > because it introduces a GUC assign hook that can easily fail (eg, > > through > > > > out-of-memory for the copy step). Assign hook functions need to be > > > > incapable of failure. > > > It seems to me that similar problem can be there > for assign_pgstat_temp_directory() as it can also lead to "out of memory" > error. However, in general I understand your concern and I think we should > avoid any such failure in assign functions. I noticed that forgetting error handling of malloc then searched for the callers of guc_malloc just now and found the same thing. This should be addressed as another issue. > > > > You are going to > > > > need to find a way to package the parse result into a single malloc'd > > > > blob, though, because that's as much as guc.c can keep track of for an > > > > "extra" value. > > > > > > Ok, I'll post the v8 with the blob solution sooner. > > > > Hmm. It was way easier than I thought. The attached v8 patch does, ... > + /* Convert SyncRepConfig into the packed struct fit to guc extra */ > > + pconf = (SyncRepConfigData *) > > + malloc(SizeOfSyncRepConfig( > > + list_length(syncrep_parse_result->members))); > > I think there should be a check for malloc failure in above code. Yes, I'm ashamed to have forgotten what I mentioned just before. Added the same thing with guc_malloc. The error is at ERROR since parsing GUC files should continue on parse errors (and seeing check_log_destination). > + /* No further need for syncrep_parse_result */ > > + syncrep_parse_result = NULL; > > Isn't this a memory leak? Shouldn't we need to free the corresponding > memory as well. It is palloc'ed on the current context, which AFAICS would be 'config file processing' or 'PortalHeapMemory'for the ALTER SYSTEM case. Both of them are rather short-living. I don't think that leaving them is a problem on both of the cases and there's no point freeing only it among those (if any) allocated in the generated code by bison and flex... I suppose. I just added a comment in the v9. | * No further need for syncrep_parse_result. The memory blocks are | * released along with the deletion of the current context. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Apr 27, 2016 at 10:14 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I just added a comment in the v9. Sorry, I have attached an empty patch. This is another one that should be with content. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > Sorry, I have attached an empty patch. This is another one that should > be with content. I started to review this, and in passing came across this gem in syncrep_scanner.l: /* * flex emits a yy_fatal_error() function that it calls in response to * critical errors like malloc failure, file I/Oerrors, and detection of * internal inconsistency. That function prints a message and calls exit(). * Mutate it to insteadcall ereport(FATAL), which terminates this process. * * The process that causes this fatal error should be terminated.* Otherwise it has to abandon the new setting value of * synchronous_standby_names and keep running with the previousone * while the other processes switch to the new one. * This inconsistency of the setting that each process is basedon * can cause a serious problem. Though it's basically not good idea to * use FATAL here because it can take down thepostmaster, * we should do that in order to avoid such an inconsistency. */ #undef fprintf #define fprintf(file, fmt, msg) syncrep_flex_fatal(fmt, msg) static void syncrep_flex_fatal(const char *fmt, const char *msg) {ereport(FATAL, (errmsg_internal("%s", msg))); } This is the faultiest reasoning possible. There are a hundred reasons why a process might fail to absorb a GUC setting, and causing just one such code path to FATAL out is not going to improve system stability one bit. If you think it is absolutely imperative that all processes in the system have identical synchronous_standby_names settings, then we need to make it be PGC_POSTMASTER, not indulge in half-baked non-solutions like this. But I'd like to know why that is so essential. It looks to me like what matters is only whether each individual walsender thinks its client is a sync standby, and so inconsistent settings between different walsenders don't really matter. Which is a good thing, because if it's to remain SIGHUP, you can't promise that they'll all absorb a new value at the same instant anyway. In short, I don't see any good reason not to make this be a plain ERROR like it is in every other scanner in the backend. regards, tom lane
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > Sorry, I have attached an empty patch. This is another one that should > be with content. I pushed this after whacking it around some, and cleaning up some sort-of-related problems in the syncrep parser/lexer. There remains a point that I'm not very happy about, which is the code in check_synchronous_standby_names to emit a WARNING if the num_sync setting is too large. That's a pretty bad compromise: we should either decide that the case is legal or that it is not. If it's legal, people who are correctly using the case will not thank us for logging a WARNING every single time the postmaster gets a SIGHUP (and those who aren't using it correctly will have their systems freezing up, warning or no warning). If it's not legal, we should make it an error not a warning. My inclination is to just rip out the warning. But I wonder whether the desire to have one doesn't imply that the semantics are poorly chosen and should be revisited. regards, tom lane
Hello, At Wed, 27 Apr 2016 18:05:26 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3167.1461794726@sss.pgh.pa.us> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > > Sorry, I have attached an empty patch. This is another one that should > > be with content. > > I pushed this after whacking it around some, and cleaning up some > sort-of-related problems in the syncrep parser/lexer. Thank you for pushing this (with improvements) and improvements of synchronous_standby_names. I agree to the discussion that standby names should have restriction not to break possible extension to be happen near future. > There remains a point that I'm not very happy about, which is the code > in check_synchronous_standby_names to emit a WARNING if the num_sync > setting is too large. That's a pretty bad compromise: we should either > decide that the case is legal or that it is not. If it's legal, people > who are correctly using the case will not thank us for logging a WARNING > every single time the postmaster gets a SIGHUP (and those who aren't using > it correctly will have their systems freezing up, warning or no warning). > If it's not legal, we should make it an error not a warning. This specification makes the code a bit complex and makes the document a bit less understandable. It seems to me somewhat suspicious that allowing duplcate (potentially synchronous) walrecivers is so useful as to justify such disadvantages. In spite of this, my inclination is also the same as the following:p rather than making the behavior consistent and clear. > My inclination is to just rip out the warning. Is there anyone object to removing the warining? > But I wonder whether the > desire to have one doesn't imply that the semantics are poorly chosen > and should be revisited. We already have abandoned a bit of backward compatibility in this feature. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > At Wed, 27 Apr 2016 18:05:26 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3167.1461794726@sss.pgh.pa.us> >> My inclination is to just rip out the warning. > Is there anyone object to removing the warining? Hearing no objections, done. regards, tom lane