Thread: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
"Mark"
Date:
The following bug has been logged online: Bug reference: 5851 Logged by: Mark Email address: dvlhntr@gmail.com PostgreSQL version: 9.0.2 x86_64 Operating system: CentOS release 5.5 (Final) | 2.6.18-194.17.1.el5 #1 SMP X86_64 Description: ROHS (read only hot standby) needs to be restarted manually in somecases. Details: getting a break down in streaming rep. my current work around is to restart the PG instance on the ROHS. doesn't seem to affect the master any. doesn't require a re-rsync of the base to get replication going again. has happened with 9.0.2 twice now in a month. 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: could not receive data from client: Connection reset by peer 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: unexpected EOF on standby connection this was all I have in the master's log with the level set to debug 1, I have reset it to debug 5 and will just wait till it dies again and hopefully get a better idea of what is going on. nothing is being logged to the standby. I can't find anything else to grab that shows this break down in streaming rep that won't start back up. This is a somewhat *long* distance replication over a 100mbit metro line. we have had routing issues in the past and see replication fall behind but once connectivity is restored we see it catch up, without a restart of the standby. probably only ships a few gig of changes a day. these are production machines so I can't do too much playing around to try and induce "issues" PostgreSQL 9.0.2 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit (1 row) is this a known issue ? I didn't see anything when I have the mailing list archive a quick glance search that looked like this. is there somewhere else I should be looking for more details into why this is happening ? I can post the configs if you all want them but nothing special is happening w/ regards to them. thank you, Mark
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
Robert Haas
Date:
On Wed, Jan 26, 2011 at 8:24 PM, Mark <dvlhntr@gmail.com> wrote: > getting a break down in streaming rep. my current work around is to resta= rt > the PG instance on the ROHS. doesn't seem to affect the master any. doesn= 't > require a re-rsync of the base to get replication going again. has happen= ed > with 9.0.2 twice now in a month. > > > > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0could not receive > data > from client: Connection reset by peer > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0unexpected EOF on > standby connection > > this was all I have in the master's log with the level set to debug 1, I > have reset it to debug 5 and will just wait till it dies again and hopefu= lly > get a better idea of what is going on. nothing is being logged to the > standby. Maybe a break in network connectivity is leading the master to think that the slave is dead, while the slave still thinks it's connected. You might need to adjust the TCP keepalive parameters the slave uses to connect to the master. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
"mark"
Date:
When showing the setting on the slave or master all tcp_keepalive settings (idle, interval and count) are showing 0; The config file shows interval and count commented out, but idle in the config file is set to 2100.=20 Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? (or a value that high isn't be accepted?) I have reloaded configs and still seeing 0's I assume you would suggest I turn that number down... a lot.=20 ..: Mark > -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Friday, January 28, 2011 6:48 AM > To: Mark > Cc: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be > restarted manually in somecases. >=20 > On Wed, Jan 26, 2011 at 8:24 PM, Mark <dvlhntr@gmail.com> wrote: > > getting a break down in streaming rep. my current work around is to > restart > > the PG instance on the ROHS. doesn't seem to affect the master any. > doesn't > > require a re-rsync of the base to get replication going again. has > happened > > with 9.0.2 twice now in a month. > > > > > > > > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0could not > receive > > data > > from client: Connection reset by peer > > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0unexpected EOF > on > > standby connection > > > > this was all I have in the master's log with the level set to debug > 1, I > > have reset it to debug 5 and will just wait till it dies again and > hopefully > > get a better idea of what is going on. nothing is being logged to the > > standby. >=20 > Maybe a break in network connectivity is leading the master to think > that the slave is dead, while the slave still thinks it's connected. > You might need to adjust the TCP keepalive parameters the slave uses > to connect to the master. >=20 > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
Robert Haas
Date:
On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote: > When showing the setting on the slave or master all tcp_keepalive settings > (idle, interval and count) are showing 0; > > The config file shows interval and count commented out, but idle in the > config file is set to 2100. > > Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? (or a > value that high isn't be accepted?) > > I have reloaded configs and still seeing 0's > > > > I assume you would suggest I turn that number down... a lot. Yeah, the defaults are way too long for our purposes. The way to get this set correctly, I think, is to set it in the primary_conninfo stream on the slave. You end up with something like this: primary_conninfo='host=blahblah user=bob keepalives_idle=XX keepalives_interval=XX keepalives_count=XX' I'm of the opinion that we really need an application-level keepalive here, but the above is certainly a lot better than nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
"mark"
Date:
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Sunday, January 30, 2011 12:19 PM > To: mark > Cc: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be > restarted manually in somecases. > > On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote: > > When showing the setting on the slave or master all tcp_keepalive > settings > > (idle, interval and count) are showing 0; > > > > The config file shows interval and count commented out, but idle in > the > > config file is set to 2100. > > > > Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? > (or a > > value that high isn't be accepted?) > > > > I have reloaded configs and still seeing 0's > > > > > > > > I assume you would suggest I turn that number down... a lot. > > Yeah, the defaults are way too long for our purposes. The way to get > this set correctly, I think, is to set it in the primary_conninfo > stream on the slave. You end up with something like this: > > primary_conninfo='host=blahblah user=bob keepalives_idle=XX > keepalives_interval=XX keepalives_count=XX' > Thanks I will try this on Monday and will report back if it fixes the problem. (however since I can't reproduce the issue on demand it might be a waiting game. Might not know for a month or so tho) -Mark > I'm of the opinion that we really need an application-level keepalive > here, but the above is certainly a lot better than nothing. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
mark
Date:
On Sun, Jan 30, 2011 at 12:45 PM, mark <dvlhntr@gmail.com> wrote: > > >> -----Original Message----- >> From: Robert Haas [mailto:robertmhaas@gmail.com] >> Sent: Sunday, January 30, 2011 12:19 PM >> To: mark >> Cc: pgsql-bugs@postgresql.org >> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be >> restarted manually in somecases. >> >> On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote: >> > When showing the setting on the slave or master all tcp_keepalive >> settings >> > (idle, interval and count) are showing 0; >> > >> > The config file shows interval and count commented out, but idle in >> the >> > config file is set to 2100. >> > >> > Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? >> (or a >> > value that high isn't be accepted?) >> > >> > I have reloaded configs and still seeing 0's >> > >> > >> > >> > I assume you would suggest I turn that number down... a lot. >> >> Yeah, the defaults are way too long for our purposes. =A0The way to get >> this set correctly, I think, is to set it in the primary_conninfo >> stream on the slave. =A0You end up with something like this: >> >> primary_conninfo=3D'host=3Dblahblah user=3Dbob keepalives_idle=3DXX >> keepalives_interval=3DXX keepalives_count=3DXX' >> > Thanks I will try this on Monday and will report back if it fixes the > problem. (however since I can't reproduce the issue on demand it might be= a > waiting game. Might not know for a month or so tho) > > -Mark > > >> I'm of the opinion that we really need an application-level keepalive >> here, but the above is certainly a lot better than nothing. my streaming replication woes continue. I made those changes in the recovery.conf file but I am still having streaming replication stay broken after any sort of network interruption until someone manaully comes along and fixes things by restarting the standby or if it's been too long resynchronizing the base. I think it's a network interruption that is triggering the break down, but I don't have anything to prove it. wal_keep_segments are set to 250, which was supposed to give us a few hours to fix the issue but it seems we blew through that many last night and such when someone got around to fixing it the standby was too far behind. my #1 problem with this right now is I can't seem to reproduce on demand with virtual machines in our development area. this is the recovery.conf file, see any problems with it? maybe I didn't do some syntax right right ? [postgres@<redacted> data9.0]$ cat recovery.conf standby_mode =3D 'on' primary_conninfo =3D 'host=3D<redacted> port=3D5432 user=3Dpostgres keepalives_idle=3D30 keepalives_interval=3D30 keepalives_count=3D30' thanks ..: Mark p.s. looking forward to 9.1 where a standby can be started with streaming from scratch. that sounds nice. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company > >
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
Fujii Masao
Date:
On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr@gmail.com> wrote: > this is the recovery.conf file, see any problems with it? maybe I > didn't do some syntax right right ? > > [postgres@<redacted> data9.0]$ cat recovery.conf > standby_mode = 'on' > primary_conninfo = 'host=<redacted> port=5432 user=postgres > keepalives_idle=30 keepalives_interval=30 keepalives_count=30' This setting would lead TCP keepalive to take about 930 seconds (= 30 + 30 * 30) to detect the network outage. If you want to stop replication as soon as the outage happens, you need to decrease the keepalive setting values. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
"mark"
Date:
> -----Original Message----- > From: Fujii Masao [mailto:masao.fujii@gmail.com] > Sent: Tuesday, February 08, 2011 4:00 PM > To: mark > Cc: Robert Haas; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be > restarted manually in somecases. > > On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr@gmail.com> wrote: > > this is the recovery.conf file, see any problems with it? maybe I > > didn't do some syntax right right ? > > > > [postgres@<redacted> data9.0]$ cat recovery.conf > > standby_mode = 'on' > > primary_conninfo = 'host=<redacted> port=5432 user=postgres > > keepalives_idle=30 keepalives_interval=30 keepalives_count=30' > > This setting would lead TCP keepalive to take about 930 seconds > (= 30 + 30 * 30) to detect the network outage. If you want to stop > replication as soon as the outage happens, you need to decrease > the keepalive setting values. What numbers would you suggest? I have been guessing and probably doing a very poor job of it. I am turning knobs and not getting any meaningful changes with respect to in my problem. So either I am not turning them correctly, or they are not the right knobs for my problem. Trying to fix my own ignorance here. (should I move this off the bugs list, since maybe it's not a bug?) The settings have been unspecified in the recovery file, it's been specified in the recovery file, and I have tried the following in the recovery file: (~two weeks and it died) keepalives_idle=0 keepalives_interval=0 keepalives_count=0 (~two weeks and it dies) keepalives_idle=30 keepalives_interval=30 keepalives_count=30 (this didn't work either, don't recall how long this lasted, maybe a month) keepalives_idle=2100 keepalives_interval=0 keepalives_count=0 Background is basically this: trying to do streaming replication over a WAN, probably ship about 5GB of changes per day, hardware on both ends can easily keep up with that. Running over a shared metro line and have about 3-5MBytes per second depending on the time of day that I can count on. I have wal_keep segments at 250 (I don't care about the disk overhead for this, since I wanted to not have to use wal archiving). The link is being severed more often than usually lately while some network changes are being made so while I would expect that improve in the future this isn't exactly the most reliable connection. so getting whatever as right as I can is of value to me. Typically I see the streaming replication break down for good completely a few hours after something that causes a interruption in networking. Nagios notifications lag some but not hours and has to go through a few people before I find out about it. When checking the nagios pages on their logs I don't see pages about the distance between the master and the standby getting bigger during this time, and then once I see the first unexpected EOF then the distance between the master and standby gets further and further until it gets fixed or we have to re-sync the whole base over. Again I can't seem to duplicate this problem on demand with virtual machines, I startup a master and standby, setup streaming rep, kickoff a multi hour or day pg bench run and start messing with networking. Every time I try and duplicate this synthetically the standby picks right back where it left off and catches back up. I am at a loss so I do appreciate everyone's help. Thanks in advance -Mark > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
Robert Haas
Date:
On Tue, Feb 8, 2011 at 7:23 PM, mark <dvlhntr@gmail.com> wrote: > (~two weeks and it dies) > keepalives_idle=30 > keepalives_interval=30 > keepalives_count=30 Maybe something like this: keepalives_idle=60 keepalives_interval=5 keepalives_count=10 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
From
"mark"
Date:
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: Thursday, March 03, 2011 9:04 AM > To: mark > Cc: Fujii Masao; pgsql-bugs@postgresql.org > Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be > restarted manually in somecases. > > On Tue, Feb 8, 2011 at 7:23 PM, mark <dvlhntr@gmail.com> wrote: > > (~two weeks and it dies) > > keepalives_idle=30 > > keepalives_interval=30 > > keepalives_count=30 > > Maybe something like this: > > keepalives_idle=60 > keepalives_interval=5 > keepalives_count=10 > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Thank you ! Things have been more stable with regards to our WAN links. I am sure that won't last for very long and expect to find out sooner than later if these changes mitigate the issue I had been seeing. Again, thank you ~Mark