Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases. - Mailing list pgsql-bugs
From | mark |
---|---|
Subject | Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases. |
Date | |
Msg-id | 058b01cbc7ef$8cfaa9e0$a6effda0$@com Whole thread Raw |
In response to | Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases. (Fujii Masao <masao.fujii@gmail.com>) |
Responses |
Re: BUG #5851: ROHS (read only hot standby) needs to be
restarted manually in somecases.
|
List | pgsql-bugs |
> -----Original Message----- > From: Fujii Masao [mailto:masao.fujii@gmail.com] > Sent: Tuesday, February 08, 2011 4:00 PM > To: mark > Cc: Robert Haas; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be > restarted manually in somecases. > > On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr@gmail.com> wrote: > > this is the recovery.conf file, see any problems with it? maybe I > > didn't do some syntax right right ? > > > > [postgres@<redacted> data9.0]$ cat recovery.conf > > standby_mode = 'on' > > primary_conninfo = 'host=<redacted> port=5432 user=postgres > > keepalives_idle=30 keepalives_interval=30 keepalives_count=30' > > This setting would lead TCP keepalive to take about 930 seconds > (= 30 + 30 * 30) to detect the network outage. If you want to stop > replication as soon as the outage happens, you need to decrease > the keepalive setting values. What numbers would you suggest? I have been guessing and probably doing a very poor job of it. I am turning knobs and not getting any meaningful changes with respect to in my problem. So either I am not turning them correctly, or they are not the right knobs for my problem. Trying to fix my own ignorance here. (should I move this off the bugs list, since maybe it's not a bug?) The settings have been unspecified in the recovery file, it's been specified in the recovery file, and I have tried the following in the recovery file: (~two weeks and it died) keepalives_idle=0 keepalives_interval=0 keepalives_count=0 (~two weeks and it dies) keepalives_idle=30 keepalives_interval=30 keepalives_count=30 (this didn't work either, don't recall how long this lasted, maybe a month) keepalives_idle=2100 keepalives_interval=0 keepalives_count=0 Background is basically this: trying to do streaming replication over a WAN, probably ship about 5GB of changes per day, hardware on both ends can easily keep up with that. Running over a shared metro line and have about 3-5MBytes per second depending on the time of day that I can count on. I have wal_keep segments at 250 (I don't care about the disk overhead for this, since I wanted to not have to use wal archiving). The link is being severed more often than usually lately while some network changes are being made so while I would expect that improve in the future this isn't exactly the most reliable connection. so getting whatever as right as I can is of value to me. Typically I see the streaming replication break down for good completely a few hours after something that causes a interruption in networking. Nagios notifications lag some but not hours and has to go through a few people before I find out about it. When checking the nagios pages on their logs I don't see pages about the distance between the master and the standby getting bigger during this time, and then once I see the first unexpected EOF then the distance between the master and standby gets further and further until it gets fixed or we have to re-sync the whole base over. Again I can't seem to duplicate this problem on demand with virtual machines, I startup a master and standby, setup streaming rep, kickoff a multi hour or day pg bench run and start messing with networking. Every time I try and duplicate this synthetically the standby picks right back where it left off and catches back up. I am at a loss so I do appreciate everyone's help. Thanks in advance -Mark > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center
pgsql-bugs by date: