Re: Streaming Replication patch for CommitFest 2009-09 - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Streaming Replication patch for CommitFest 2009-09 |
Date | |
Msg-id | 4AB1EE66.6040804@enterprisedb.com Whole thread Raw |
In response to | Re: Streaming Replication patch for CommitFest 2009-09 (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: Streaming Replication patch for CommitFest 2009-09
Re: Streaming Replication patch for CommitFest 2009-09 Re: Streaming Replication patch for CommitFest 2009-09 Re: Streaming Replication patch for CommitFest 2009-09 Re: Streaming Replication patch for CommitFest 2009-09 |
List | pgsql-hackers |
Fujii Masao wrote: > On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> After playing with this a little bit, I think we need logic in the slave >> to reconnect to the master if the connection is broken for some reason, >> or can't be established in the first place. At the moment, that is >> considered as the end of recovery, and the slave starts up. You have the >> trigger file mechanism to stop that, but it only gives you a chance to >> manually kill and restart the slave before it chooses a new timeline and >> starts up, it doesn't reconnect automatically. > > I was thinking that the automatic reconnection capability is the TODO item > for the later CF. The infrastructure for it has already been introduced in the > current patch. Please see the macro MAX_WALRCV_RETRIES (backend/ > postmaster/walreceiver.c). This is the maximum number of times to retry > walreceiver. In the current version, this is the fixed value, but we can make > this user-configurable (parameter of recovery.conf is suitable, I think). Ah, I see. Robert Haas suggested a while ago that walreceiver could be a stand-alone utility, not requiring postmaster at all. That would allow you to set up streaming replication as another way to implement WAL archiving. Looking at how the processes interact, there really isn't much communication between walreceiver and the rest of the system, so that sounds pretty attractive. Walreceiver only needs access to shared memory so that it can tell the startup process how far it has replicated already. Even when we add the synchronous capability, I don't think we need any more inter-process communication. Only if we wanted to acknowledge to the master when a piece of WAL log has been successfully replayed, the startup process would need to tell walreceiver about it, but I think we're going to settle for acknowledging when a piece of log has been fsync'd to disk. Walreceiver is really a slave to the startup process. The startup process decides when it's launched, and it's the startup process that then waits for it to advance. But the way it's set up at the moment, the startup process needs to ask the postmaster to start it up, and it doesn't look very robust to me. For example, if launching walreceiver fails for some reason, startup process will just hang waiting for it. I'm thinking that walreceiver should be a stand-alone program that the startup process launches, similar to how it invokes restore_command in PITR recovery. Instead of using system(), though, it would use fork+exec, and a pipe to communicate. Also, when we get around to implement the "fetch base backup automatically via the TCP connection" feature, we can't use walreceiver as it is now for that, because there's no hope of starting up the system that far without a base backup. I'm not sure if it can or should be merged with the walreceiver program, but it can't be a postmaster child process, that's for sure. Thoughts? > Also a parameter like retries_interval might be necessary. This parameter > indicates the interval between each reconnection attempt. Yeah, maybe, although a hard-coded interval of a few seconds should be enough to get us started. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: