Thread: Synchronous replication patch v1
Hi Attached is a patch for a synchronous log-shipping replication which was discussed just a month ago. I would like you to review this patch in Nov commit fest. The outline of this patch is as follow; 1) Walsender This is new process to focus on sending xlog through the position which a backend requests on commit. Walsender calculates the area of xlog to be replicated by a logic similar to XLogWrite. At first, walsender is forked as a normal backend by postmater (i.e. the standby connects to postmaster just like normal frontend). A backend works as walsender after receiving "mimic-walsender" message. Then, walsender is handled differently from a backend. Now, the number of walsenders is restricted to one. 2) Communication between backends and walsender On commit, a backend tells walsender the position (LSN) to be replicated via shmem, and wakes it up by signaling if needed. Then, a backend sleeps until requested replication is completed. At this time, walsender might signal a backend to wake up. Synchronous and asynchronous replication mode are supported. In async case, a backend basically don't need to sleep for replication. User can tune a backend's max sleep time as a replication timeout. Now, the timeout closes the connection to the standby, terminates walsender, but the other postgres process continue to work. 3) Management of the xlog positions for replication XLog positions are managed consistent. It's necessary to be careful especially in AdvanceXLInsertBuffer and xlog_switch case. 4) Walreceiver This is new contrib program to focus on receiving xlog and writing it. User can specify the xlog location (where walreceiver writes xlog in just after receiving), and the archive location (where walreceiver archives a filled xlog file). This options are used to cooperate with pg_standby (prevents pg_standby from reading the xlog file under walreceiver writing) The above is a necessary minimum function, and some requests which came out in the discussion have not been implemented yet. If there is other indispensable function, please let me know. And, there are some problems in this patch; * This patch is somewhat big, though it should be subdivided for review. * Source code comments and documents are insufficient. Is it against the rule of commit fest to add such a status patch into review-queue? If so, I would aim for 8.5. Otherwise, I will deal with the problems also during commit fest. What is your opinion? For compile ---------------- * apply sync_replication_v1.patch to HEAD * locate walsender.c on src/backend/postmaster * locate walsender.h on src/include/postmaster * locate walreceiver on contrib How to use --------------- 1) Start postgres normally (don't need to configure any parameter) 2) Start walreceiver and connect with postmaster just like psql. Wal streaming starts automatically. Now, there are three configurable parameter in postgresql.conf. > synchronous_replication = on # immediate replication at commit > replication_timeout = 0ms # 0 is disabled > wal_sender_delay = 200ms # 1-10000 milliseconds The usage of walreceiver is as follow. > Usage: > walreceiver [OPTION]... [XLOGLOCATION] [ARCHIVELOCATION] > > Options: > -h HOSTNAME database server host or socket directory (default: local socket) > -p PORT database server port (default: "5432") > -U NAME database user name (default: postgres) > -? show usage If you want to do replication by using walreceiver and pg_standby, it's necessary to make the archive location of them the same. Regards; -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > And, there are some problems in this patch; > > * This patch is somewhat big, though it should be subdivided for > review. > > * Source code comments and documents are insufficient. > > Is it against the rule of commit fest to add such a status patch > into review-queue? If so, I would aim for 8.5. Otherwise, > I will deal with the problems also during commit fest. > What is your opinion? You can add work-in-progress patches and even just design docs to the commitfest queue. That's perfectly OK. They will be reviewed as any other work, but naturally if it's not a patch that's ready to be committed without major work, it won't be committed. I haven't looked at the patch yet, but if you think there's chances to get it into shape for inclusion to 8.4, before the commit fest is over, you can and should keep working on it and submit updated patches during the commit fest. However, help with reviewing other patches would also be very much appreciated. The idea of commitfests is that everyone stops working on their own stuff, except for cleaning up and responding to review comments on one's own patches that are in the queue, and helps to review other people's patches. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao wrote: > Attached is a patch for a synchronous log-shipping replication which > was discussed just a month ago. I would like you to review this patch > in Nov commit fest. Here's some first quick comments: AFAICS, there's no security, at all. Anyone that can log in, can become a WAL sender, and receive all WAL for the whole cluster. If the connection is jammed for a while, or just slow, is there something that prevents the slave from falling so much behind that the master checkpoints, archives, and deletes some WAL segments that are still needed for the replication? > The outline of this patch is as follow; > > 1) Walsender > > This is new process to focus on sending xlog through the position > which a backend requests on commit. Walsender calculates the area > of xlog to be replicated by a logic similar to XLogWrite. > > At first, walsender is forked as a normal backend by postmater (i.e. > the standby connects to postmaster just like normal frontend). > A backend works as walsender after receiving "mimic-walsender" > message. Then, walsender is handled differently from a backend. > > Now, the number of walsenders is restricted to one. That feels kinda weird. I think it would be better if the client indicated in the startup message that it wants to become WAL sender. It'll be needed for the authentication. > And, there are some problems in this patch; > > * This patch is somewhat big, though it should be subdivided for > review. I've seen bigger :-). The signal handling changes might be a candidate for splitting into a separate patch. > * Source code comments and documents are insufficient. Sure. (though I've seen worse :-)). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Oct 31, 2008 at 10:15 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> And, there are some problems in this patch; >> >> * This patch is somewhat big, though it should be subdivided for >> review. >> >> * Source code comments and documents are insufficient. >> >> Is it against the rule of commit fest to add such a status patch >> into review-queue? If so, I would aim for 8.5. Otherwise, >> I will deal with the problems also during commit fest. >> What is your opinion? > > You can add work-in-progress patches and even just design docs to the > commitfest queue. That's perfectly OK. They will be reviewed as any other > work, but naturally if it's not a patch that's ready to be committed without > major work, it won't be committed. > > I haven't looked at the patch yet, but if you think there's chances to get > it into shape for inclusion to 8.4, before the commit fest is over, you can > and should keep working on it and submit updated patches during the commit > fest. However, help with reviewing other patches would also be very much > appreciated. The idea of commitfests is that everyone stops working on their > own stuff, except for cleaning up and responding to review comments on one's > own patches that are in the queue, and helps to review other people's > patches. OK, thanks Heikki. I will keep working on Synch Rep during commit-fest. At first, as you say, I'll split the signal handling changes into an individual patch ASAP. Regards; -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, thank you for taking time to review the patch. On Fri, Oct 31, 2008 at 11:12 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> Attached is a patch for a synchronous log-shipping replication which >> was discussed just a month ago. I would like you to review this patch >> in Nov commit fest. > > Here's some first quick comments: > > AFAICS, there's no security, at all. Anyone that can log in, can become a > WAL sender, and receive all WAL for the whole cluster. One simple solution is to define the database only for replication. In this solution, we can handle the authentication for replication like the usual database access. That is, pg_hba.conf, the cooperation with a database role, etc are supported also in replication. So, a user can set up the authentication rules easily. ISTM that there is no advantage which separates authentication for replication from the existing mechanism. How about this solution? -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Fri, Oct 31, 2008 at 11:12 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> AFAICS, there's no security, at all. Anyone that can log in, can become a >> WAL sender, and receive all WAL for the whole cluster. > > One simple solution is to define the database only for replication. In > this solution, > we can handle the authentication for replication like the usual database access. > That is, pg_hba.conf, the cooperation with a database role, etc are > supported also > in replication. So, a user can set up the authentication rules easily. You mean like a pseudo database name in pg_hba.conf, and in the startup message, that actually means "connect for replication"? Yeah, something like that sounds reasonable to me. > ISTM that there> is no advantage which separates authentication for replication from the existing> mechanism. Agreed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Nov 5, 2008 at 12:51 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> >> On Fri, Oct 31, 2008 at 11:12 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> AFAICS, there's no security, at all. Anyone that can log in, can become a >>> WAL sender, and receive all WAL for the whole cluster. >> >> One simple solution is to define the database only for replication. In >> this solution, >> we can handle the authentication for replication like the usual database >> access. >> That is, pg_hba.conf, the cooperation with a database role, etc are >> supported also >> in replication. So, a user can set up the authentication rules easily. > > You mean like a pseudo database name in pg_hba.conf, and in the startup > message, that actually means "connect for replication"? Yeah, something like > that sounds reasonable to me. Yes, I would define a pseudo database name for replication. A backend works as walsender only if it received the startup packet including the database name for replication. But, authentication and initialization continue till ReadyForQuery is sent. So, I assume that walsender starts replication after sending ReadyForQuery and receiving a message for replication. In this design, some features (e.g. post_auth_delay) are supported as they are. Another advantage is that a client can use lipq, such as PQconnectdb, for the connection for replication as they are. Between ReadyForQuery and a message for replication, a client can issue some queries. At least, my walreceiver would query timeline ID and request xlog-switch (In my previous patch, they are exchanged after walsender starts, but it has little flexibility). Of course, I have to create new function which returns current timeline ID. Initial sequence of walsender ---------------- 1) process the startup packet 1-1) if the database name for replication is specified, a backend would declare postmaster that I am walsender (remove its backend from BackendList, etc). 2) authentication and initialization (BackendRun, PostgresMain) 3) walsender sends ReadyForQuery 4) a client queries timeline ID and requests xlog-switch 6) a client requests the start of WAL streaming 6-1) if a backend is not walsender, it refuses the request. I correct the code and post it ASAP. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, 2008-11-04 at 22:59 +0900, Fujii Masao wrote: > Hi, thank you for taking time to review the patch. > > On Fri, Oct 31, 2008 at 11:12 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > Fujii Masao wrote: > >> > >> Attached is a patch for a synchronous log-shipping replication which > >> was discussed just a month ago. I would like you to review this patch > >> in Nov commit fest. > > > > Here's some first quick comments: > > > > AFAICS, there's no security, at all. Anyone that can log in, can become a > > WAL sender, and receive all WAL for the whole cluster. > > One simple solution is to define the database only for replication. In > this solution, > we can handle the authentication for replication like the usual database access. > That is, pg_hba.conf, the cooperation with a database role, etc are > supported also > in replication. So, a user can set up the authentication rules easily. > ISTM that there > is no advantage which separates authentication for replication from the existing > mechanism. It be easier to use libpq directly?. That would make it easier because whatever connection method you have configured will work for replication also. We already have a protocol message for streaming data: COPY. If you implemented the send as a new command, similar to COPY, it would all work very easily. SENDFILE? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, Simon, On Wed, Nov 5, 2008 at 7:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-11-04 at 22:59 +0900, Fujii Masao wrote: >> Hi, thank you for taking time to review the patch. >> >> On Fri, Oct 31, 2008 at 11:12 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> > Fujii Masao wrote: >> >> >> >> Attached is a patch for a synchronous log-shipping replication which >> >> was discussed just a month ago. I would like you to review this patch >> >> in Nov commit fest. >> > >> > Here's some first quick comments: >> > >> > AFAICS, there's no security, at all. Anyone that can log in, can become a >> > WAL sender, and receive all WAL for the whole cluster. >> >> One simple solution is to define the database only for replication. In >> this solution, >> we can handle the authentication for replication like the usual database access. >> That is, pg_hba.conf, the cooperation with a database role, etc are >> supported also >> in replication. So, a user can set up the authentication rules easily. >> ISTM that there >> is no advantage which separates authentication for replication from the existing >> mechanism. > > It be easier to use libpq directly?. That would make it easier because > whatever connection method you have configured will work for replication > also. > > We already have a protocol message for streaming data: COPY. > > If you implemented the send as a new command, similar to COPY, it would > all work very easily. SENDFILE? Thank you for the suggestion. I will reconsider the protocol of WAL streaming based on your suggestion. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi Fujii, Here's some initial thoughts on the structure of this. I've deliberately not yet read other comments, so we have some independent viewpoints. Sorry if that means we end up saying same thing twice. On Fri, 2008-10-31 at 20:36 +0900, Fujii Masao wrote: > 1) Walsender > > This is new process to focus on sending xlog through the position > which a backend requests on commit. Walsender calculates the area > of xlog to be replicated by a logic similar to XLogWrite. > > At first, walsender is forked as a normal backend by postmater (i.e. > the standby connects to postmaster just like normal frontend). > A backend works as walsender after receiving "mimic-walsender" > message. Then, walsender is handled differently from a backend. > > Now, the number of walsenders is restricted to one. I would think we would want this integrated into the server as an additional special backend, similar to WALWriter. If it works for now, that's fine for other testing. This is not an especially difficult change, I can help with this. > 2) Communication between backends and walsender > > On commit, a backend tells walsender the position (LSN) to be > replicated via shmem, and wakes it up by signaling if needed. > Then, a backend sleeps until requested replication is completed. > At this time, walsender might signal a backend to wake up. > > Synchronous and asynchronous replication mode are supported. > In async case, a backend basically don't need to sleep for replication. > > User can tune a backend's max sleep time as a replication timeout. > Now, the timeout closes the connection to the standby, terminates > walsender, but the other postgres process continue to work. No comments until I've read the code. > 3) Management of the xlog positions for replication > > XLog positions are managed consistent. It's necessary to be careful > especially in AdvanceXLInsertBuffer and xlog_switch case. Sounds good. > 4) Walreceiver > > This is new contrib program to focus on receiving xlog and writing it. > User can specify the xlog location (where walreceiver writes xlog in > just after receiving), and the archive location (where walreceiver > archives a filled xlog file). This options are used to cooperate with > pg_standby (prevents pg_standby from reading the xlog file under > walreceiver writing) Again, I would expect this to be integrated with server. I would expect code to live in src/postmaster/walreceiver.c, with main logic in a file alongside xlog.c, perhaps xreceive.c. We would start WALReceiver when we enter archive recovery mode - I already have logic for this state change. After that you would be able to use the archive location specified via recovery.conf. The logic need not be any further integrated than you have here. > The above is a necessary minimum function, and some requests > which came out in the discussion have not been implemented yet. > If there is other indispensable function, please let me know. > > And, there are some problems in this patch; > > * This patch is somewhat big, though it should be subdivided for > review. > > * Source code comments and documents are insufficient. Source code comments are essential. I try to put enough comments so that each chunk of the patch has a comment to explain why that change is a necessary part of the whole patch. Doing that is a good way to find chunks that you can remove. > Now, there are three configurable parameter in postgresql.conf. > > > synchronous_replication = on # immediate replication at commit > > replication_timeout = 0ms # 0 is disabled > > wal_sender_delay = 200ms # 1-10000 milliseconds Could you write some docs for this? I just want to check how you think it will work. Does synchronous_replication = off mean a) asynchronous replication or b) no replication at all I want to be able to specify synch, asynch or no replication. We need an explanation and example of how to set this up when performing a large initial base backup. Earlier we discussed using archiver to transfer initial files and then switching to streaming mode later. How does all that work now? http://archives.postgresql.org/pgsql-hackers/2008-09/msg01208.php I'll be looking at this a lot more over next few weeks/months, so this is just a few short initial comments. Well done for getting this together so quickly, especially with your visit to hospital taking away time. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi Simon, On Wed, Nov 5, 2008 at 11:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I would think we would want this integrated into the server as an > additional special backend, similar to WALWriter. If it works for now, > that's fine for other testing. This is not an especially difficult > change, I can help with this. I integrated walsender into the server as a special backend. Please check "walsender process patch v1" http://archives.postgresql.org/pgsql-hackers/2008-11/msg00294.php > Again, I would expect this to be integrated with server. I would expect > code to live in src/postmaster/walreceiver.c, with main logic in a file > alongside xlog.c, perhaps xreceive.c. We would start WALReceiver when we > enter archive recovery mode - I already have logic for this state > change. After that you would be able to use the archive location > specified via recovery.conf. OK. I will try to integrate walreceiver into the server. But, I'm not familiar with Hot-Standby patch including the logic for such a state change. Which patch do I need to check? And, we have to decide where an user specifies host name and port number. I think that recovery.conf is suitable for specifying them. And, If they are not specified in recovery.conf, walreceiver would not be invoked. Is there any parameter required for walreceiver in addition to them? (additional info for authentication?) >> > synchronous_replication = on # immediate replication at commit >> > replication_timeout = 0ms # 0 is disabled >> > wal_sender_delay = 200ms # 1-10000 milliseconds > > Could you write some docs for this? I just want to check how you think > it will work. > > Does synchronous_replication = off mean > a) asynchronous replication or > b) no replication at all > > I want to be able to specify synch, asynch or no replication. synchronous_replication is very similar to synchronous_commit. Docs is as follow. 8<------------------------------ Specifies whether transaction commit will wait for WAL records to be replicated to the standby before the command returns a "success" indication to the client. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe in the standby against a server crash. (The maximum delay is the same as wal_sender_delay.) Unlike synchronous_commit, setting this parameter to off might cause inconsistency between the database in the primary and the transaction logs in the standby. This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it writes transaction logs. It is therefore possible, and useful, to have some transactions replication synchronously and others asynchronously. For example, to make a single multi-statement transaction replication asynchronously when the default is the opposite, issue SET LOCAL synchronous_replication TO OFF within the transaction. 8<------------------------------ I would write the doc also about the other parameters. > We need an explanation and example of how to set this up when performing > a large initial base backup. Earlier we discussed using archiver to > transfer initial files and then switching to streaming mode later. How > does all that work now? > http://archives.postgresql.org/pgsql-hackers/2008-09/msg01208.php I assume the following procedure: 1) Start postgres in the primary 2) Get an online-backup in the primary 3) Locate the online-backup in the standby 4) Start postgres (with walreceiver) in the standby # Configure restore_command, host of the primary and port in recovery.conf 5) Manual operation # If there are missing files for PITR in the standby, copy them from somewhere (archive location of the primary, tape backup..etc). The missing files might be xlog or historyfile. Since xlog file segment is switched when replication starts, the missing xlog files would basically exist in the archive location of the primary. I would detail this procedure and write it in doc. In previous discussion, there was a difference of opinion about who copies missing files, postgres (walsender and walreceiver) or outside of postgres. Since we cannot expect accurately where missing files are, I think that it's unsuitable that postgres copies them. > I'll be looking at this a lot more over next few weeks/months, so this > is just a few short initial comments. Thank you for taking time to review the design!! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Nov 6, 2008 at 3:59 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > 1) Start postgres in the primary > 2) Get an online-backup in the primary > 3) Locate the online-backup in the standby > 4) Start postgres (with walreceiver) in the standby > # Configure restore_command, host of the primary and port in recovery.conf > 5) Manual operation > # If there are missing files for PITR in the standby, copy them > from somewhere > (archive location of the primary, tape backup..etc). > The missing files might be xlog or history file. Since xlog > file segment is > switched when replication starts, the missing xlog files would > basically exist > in the archive location of the primary. More properly, since startup process and walreceiver decide timeline ID from the history files, all of them need to exist in the standby (need copy if missing) before 4) starting postgres. If the database whose timeline is the same as the primary's exists in the standby, 2)3) getting new online-backup is not necessary. For example, after the standby falls down, the database at that time is applicable to restart it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Nov 6, 2008 at 2:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > If the database whose timeline is the same as the primary's > exists in the standby, 2)3) getting new online-backup is not > necessary. For example, after the standby falls down, the > database at that time is applicable to restart it. > > If I remember correctly, when postgres finishes its recovery, it increments the timeline. If this is true, whenever ACT fails and SBY becomes primary, SBY would increment its timeline. So when the former ACT comes back and joins the replication as SBY, would it need to get a fresh backup before it can join as SBY ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Hi, Pavan, On Thu, Nov 6, 2008 at 9:35 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Thu, Nov 6, 2008 at 2:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> If the database whose timeline is the same as the primary's >> exists in the standby, 2)3) getting new online-backup is not >> necessary. For example, after the standby falls down, the >> database at that time is applicable to restart it. >> >> > > If I remember correctly, when postgres finishes its recovery, it > increments the timeline. If this is true, whenever ACT fails and SBY > becomes primary, SBY would increment its timeline. So when the former > ACT comes back and joins the replication as SBY, would it need to get > a fresh backup before it can join as SBY ? PITR from not online backup is tricky in the first place. We might not be able to support the catch-up without a fresh online backup officially.. Furthermore, there is another problem. Please see the following mail. http://archives.postgresql.org/pgsql-hackers/2008-09/msg00964.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, On Thu, Nov 6, 2008 at 3:59 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Again, I would expect this to be integrated with server. I would expect >> code to live in src/postmaster/walreceiver.c, with main logic in a file >> alongside xlog.c, perhaps xreceive.c. We would start WALReceiver when we >> enter archive recovery mode - I already have logic for this state >> change. After that you would be able to use the archive location >> specified via recovery.conf. > > OK. I will try to integrate walreceiver into the server. I report the current status of the coding. I'm going to post the next version of the patch tomorrow. Please wait a little longer ;) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center