Re: Replication server timeout patch - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Replication server timeout patch |
Date | |
Msg-id | AANLkTi=rnYRRq2rucXhGBVHzWhQ=_Fj5bDPiNxrAe+ks@mail.gmail.com Whole thread Raw |
In response to | Re: Replication server timeout patch (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Replication server timeout patch
|
List | pgsql-hackers |
On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 11, 2011 at 4:30 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 11.02.2011 22:11, Robert Haas wrote: >>> >>> On Fri, Feb 11, 2011 at 2:02 PM, Daniel Farina<drfarina@acm.org> wrote: >>>> >>>> I split this out of the synchronous replication patch for independent >>>> review. I'm dashing out the door, so I haven't put it on the CF yet or >>>> anything, but I just wanted to get it out there...I'll be around in >>>> Not Too Long to finish any other details. >>> >>> This looks like a useful and separately committable change. >> >> Hmm, so this patch implements a watchdog, where the master disconnects the >> standby if the heartbeat from the standby stops for more than >> 'replication_[server]_timeout' seconds. The standby sends the heartbeat >> every wal_receiver_status_interval seconds. >> >> It would be nice if the master and standby could negotiate those settings. >> As the patch stands, it's easy to have a pathological configuration where >> replication_server_timeout < wal_receiver_status_interval, so that the >> master repeatedly disconnects the standby because it doesn't reply in time. >> Maybe the standby should report how often it's going to send a heartbeat, >> and master should wait for that long + some safety margin. Or maybe the >> master should tell the standby how often it should send the heartbeat? > > I guess the biggest use case for that behavior would be in a case > where you have two standbys, one of which doesn't send a heartbeat and > the other of which does. Then you really can't rely on a single > timeout. > > Maybe we could change the server parameter to indicate what multiple > of wal_receiver_status_interval causes a hangup, and then change the > client to notify the server what value it's using. But that gets > complicated, because the value could be changed while the standby is > running. On reflection I'm deeply uncertain this is a good idea. It's pretty hopeless to suppose that we can keep the user from choosing parameter settings which will cause them problems, and there are certainly far stupider things they could do then set replication_timeout < wal_receiver_status_interval. They could, for example, set fsync=off or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that last one out of the box). Any of those settings have the potential to thoroughly destroy their system in one way or another, and there's not a darn thing we can do about it. Setting up some kind of handshake system based on a multiple of the wal_receiver_status_interval is going to be complex, and it's not necessarily going to deliver the behavior someone wants anyway. If someone has wal_receiver_status_interval=10 on one system and =30 on another system, does it therefore follow that the timeouts should also be different by 3X? Perhaps, but it's non-obvious. There are two things that I think are pretty clear. If the receiver has wal_receiver_status_interval=0, then we should ignore replication_timeout for that connection. And also we need to make sure that the replication_timeout can't kill off a connection that is in the middle of streaming a large base backup. Maybe we should try to get those two cases right and not worry about the rest. Dan, can you check whether the base backup thing is a problem with this as implemented? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: