Thread: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.

The following bug has been logged online:

Bug reference:      5851
Logged by:          Mark
Email address:      dvlhntr@gmail.com
PostgreSQL version: 9.0.2 x86_64
Operating system:   CentOS release 5.5 (Final) |  2.6.18-194.17.1.el5 #1 SMP
X86_64
Description:        ROHS (read only hot standby) needs to be restarted
manually in somecases.
Details:

getting a break down in streaming rep. my current work around is to restart
the PG instance on the ROHS. doesn't seem to affect the master any. doesn't
require a re-rsync of the base to get replication going again. has happened
with 9.0.2 twice now in a month.



2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG:  could not receive
data
from client: Connection reset by peer
2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG:  unexpected EOF on
standby connection

this was all I have in the master's log with the level set to debug 1, I
have reset it to debug 5 and will just wait till it dies again and hopefully
get a better idea of what is going on. nothing is being logged to the
standby. I can't find anything else to grab that shows this break down in
streaming rep that won't start back up.

This is a somewhat *long* distance replication over a 100mbit metro line. we
have had routing issues in the past and see replication fall behind but once
connectivity is restored we see it catch up, without a restart of the
standby.

probably only ships a few gig of changes a day.

these are production machines so I can't do too much playing around to try
and induce "issues"


 PostgreSQL 9.0.2 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit
(1 row)




is this a known issue ? I didn't see anything when I have the mailing list
archive a quick glance search that looked like this.

is there somewhere else I should be looking for more details into why this
is happening ?


I can post the configs if you all want them but nothing special is happening
w/ regards to them.

thank you,

Mark
On Wed, Jan 26, 2011 at 8:24 PM, Mark <dvlhntr@gmail.com> wrote:
> getting a break down in streaming rep. my current work around is to resta=
rt
> the PG instance on the ROHS. doesn't seem to affect the master any. doesn=
't
> require a re-rsync of the base to get replication going again. has happen=
ed
> with 9.0.2 twice now in a month.
>
>
>
> 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0could not receive
> data
> from client: Connection reset by peer
> 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0unexpected EOF on
> standby connection
>
> this was all I have in the master's log with the level set to debug 1, I
> have reset it to debug 5 and will just wait till it dies again and hopefu=
lly
> get a better idea of what is going on. nothing is being logged to the
> standby.

Maybe a break in network connectivity is leading the master to think
that the slave is dead, while the slave still thinks it's connected.
You might need to adjust the TCP keepalive parameters the slave uses
to connect to the master.

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
When showing the setting on the slave or master all tcp_keepalive settings
(idle, interval and count) are showing 0;

The config file shows interval and count commented out, but idle in the
config file is set to 2100.=20

Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? (or a
value that high isn't be accepted?)

I have reloaded configs and still seeing 0's



I assume you would suggest I turn that number down... a lot.=20



..: Mark




> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Friday, January 28, 2011 6:48 AM
> To: Mark
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
> restarted manually in somecases.
>=20
> On Wed, Jan 26, 2011 at 8:24 PM, Mark <dvlhntr@gmail.com> wrote:
> > getting a break down in streaming rep. my current work around is to
> restart
> > the PG instance on the ROHS. doesn't seem to affect the master any.
> doesn't
> > require a re-rsync of the base to get replication going again. has
> happened
> > with 9.0.2 twice now in a month.
> >
> >
> >
> > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0could not
> receive
> > data
> > from client: Connection reset by peer
> > 2011-01-26 08:35:42 MST :: (postgres@10.80.2.89) LOG: =A0unexpected EOF
> on
> > standby connection
> >
> > this was all I have in the master's log with the level set to debug
> 1, I
> > have reset it to debug 5 and will just wait till it dies again and
> hopefully
> > get a better idea of what is going on. nothing is being logged to the
> > standby.
>=20
> Maybe a break in network connectivity is leading the master to think
> that the slave is dead, while the slave still thinks it's connected.
> You might need to adjust the TCP keepalive parameters the slave uses
> to connect to the master.
>=20
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote:
> When showing the setting on the slave or master all tcp_keepalive settings
> (idle, interval and count) are showing 0;
>
> The config file shows interval and count commented out, but idle in the
> config file is set to 2100.
>
> Possible that "show tcp_keepalive_idle;" isn't reporting accurately ? (or a
> value that high isn't be accepted?)
>
> I have reloaded configs and still seeing 0's
>
>
>
> I assume you would suggest I turn that number down... a lot.

Yeah, the defaults are way too long for our purposes.  The way to get
this set correctly, I think, is to set it in the primary_conninfo
stream on the slave.  You end up with something like this:

primary_conninfo='host=blahblah user=bob keepalives_idle=XX
keepalives_interval=XX keepalives_count=XX'

I'm of the opinion that we really need an application-level keepalive
here, but the above is certainly a lot better than nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Sunday, January 30, 2011 12:19 PM
> To: mark
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
> restarted manually in somecases.
>
> On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote:
> > When showing the setting on the slave or master all tcp_keepalive
> settings
> > (idle, interval and count) are showing 0;
> >
> > The config file shows interval and count commented out, but idle in
> the
> > config file is set to 2100.
> >
> > Possible that "show tcp_keepalive_idle;" isn't reporting accurately ?
> (or a
> > value that high isn't be accepted?)
> >
> > I have reloaded configs and still seeing 0's
> >
> >
> >
> > I assume you would suggest I turn that number down... a lot.
>
> Yeah, the defaults are way too long for our purposes.  The way to get
> this set correctly, I think, is to set it in the primary_conninfo
> stream on the slave.  You end up with something like this:
>
> primary_conninfo='host=blahblah user=bob keepalives_idle=XX
> keepalives_interval=XX keepalives_count=XX'
>
Thanks I will try this on Monday and will report back if it fixes the
problem. (however since I can't reproduce the issue on demand it might be a
waiting game. Might not know for a month or so tho)

-Mark


> I'm of the opinion that we really need an application-level keepalive
> here, but the above is certainly a lot better than nothing.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
On Sun, Jan 30, 2011 at 12:45 PM, mark <dvlhntr@gmail.com> wrote:
>
>
>> -----Original Message-----
>> From: Robert Haas [mailto:robertmhaas@gmail.com]
>> Sent: Sunday, January 30, 2011 12:19 PM
>> To: mark
>> Cc: pgsql-bugs@postgresql.org
>> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
>> restarted manually in somecases.
>>
>> On Fri, Jan 28, 2011 at 1:03 PM, mark <dvlhntr@gmail.com> wrote:
>> > When showing the setting on the slave or master all tcp_keepalive
>> settings
>> > (idle, interval and count) are showing 0;
>> >
>> > The config file shows interval and count commented out, but idle in
>> the
>> > config file is set to 2100.
>> >
>> > Possible that "show tcp_keepalive_idle;" isn't reporting accurately ?
>> (or a
>> > value that high isn't be accepted?)
>> >
>> > I have reloaded configs and still seeing 0's
>> >
>> >
>> >
>> > I assume you would suggest I turn that number down... a lot.
>>
>> Yeah, the defaults are way too long for our purposes. =A0The way to get
>> this set correctly, I think, is to set it in the primary_conninfo
>> stream on the slave. =A0You end up with something like this:
>>
>> primary_conninfo=3D'host=3Dblahblah user=3Dbob keepalives_idle=3DXX
>> keepalives_interval=3DXX keepalives_count=3DXX'
>>
> Thanks I will try this on Monday and will report back if it fixes the
> problem. (however since I can't reproduce the issue on demand it might be=
 a
> waiting game. Might not know for a month or so tho)
>
> -Mark
>
>
>> I'm of the opinion that we really need an application-level keepalive
>> here, but the above is certainly a lot better than nothing.

my streaming replication woes continue.


I made those changes  in the recovery.conf file but I am still having
streaming replication stay broken after any sort of network
interruption until someone manaully comes along and fixes things by
restarting the standby or if it's been too long resynchronizing the
base.

I think it's a network interruption that is triggering the break down,
but I don't have anything to prove it.

wal_keep_segments are set to 250, which was supposed to give us a few
hours to fix the issue but it seems we blew through that many last
night and such when someone got around to fixing it the standby was
too far behind.


my #1 problem with this right now is I can't seem to reproduce on
demand with virtual machines in our development area.

this is the recovery.conf file, see any problems with it? maybe I
didn't do some syntax right right ?

[postgres@<redacted> data9.0]$ cat recovery.conf
standby_mode =3D 'on'
primary_conninfo =3D 'host=3D<redacted> port=3D5432 user=3Dpostgres
keepalives_idle=3D30 keepalives_interval=3D30 keepalives_count=3D30'



thanks
..: Mark

p.s. looking forward to 9.1 where a standby can be started with
streaming from scratch. that sounds nice.

>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>
>
On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr@gmail.com> wrote:
> this is the recovery.conf file, see any problems with it? maybe I
> didn't do some syntax right right ?
>
> [postgres@<redacted> data9.0]$ cat recovery.conf
> standby_mode = 'on'
> primary_conninfo = 'host=<redacted> port=5432 user=postgres
> keepalives_idle=30 keepalives_interval=30 keepalives_count=30'

This setting would lead TCP keepalive to take about 930 seconds
(= 30 + 30 * 30) to detect the network outage. If you want to stop
replication as soon as the outage happens, you need to decrease
the keepalive setting values.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
> -----Original Message-----
> From: Fujii Masao [mailto:masao.fujii@gmail.com]
> Sent: Tuesday, February 08, 2011 4:00 PM
> To: mark
> Cc: Robert Haas; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
> restarted manually in somecases.
>
> On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr@gmail.com> wrote:
> > this is the recovery.conf file, see any problems with it? maybe I
> > didn't do some syntax right right ?
> >
> > [postgres@<redacted> data9.0]$ cat recovery.conf
> > standby_mode = 'on'
> > primary_conninfo = 'host=<redacted> port=5432 user=postgres
> > keepalives_idle=30 keepalives_interval=30 keepalives_count=30'
>
> This setting would lead TCP keepalive to take about 930 seconds
> (= 30 + 30 * 30) to detect the network outage. If you want to stop
> replication as soon as the outage happens, you need to decrease
> the keepalive setting values.

What numbers would you suggest? I have been guessing and probably doing a
very poor job of it.

I am turning knobs and not getting any meaningful changes with respect to in
my problem. So either I am not turning them correctly, or they are not the
right knobs for my problem.

Trying to fix my own ignorance here. (should I move this off the bugs list,
since maybe it's not a bug?)

The settings have been unspecified in the recovery file,  it's been
specified in the recovery file, and I have tried the following in the
recovery file:

(~two weeks and it died)
keepalives_idle=0
keepalives_interval=0
keepalives_count=0

(~two weeks and it dies)
keepalives_idle=30
keepalives_interval=30
keepalives_count=30

(this didn't work either, don't recall how long this lasted, maybe a month)

keepalives_idle=2100
keepalives_interval=0
keepalives_count=0






Background is basically this: trying to do streaming replication over a WAN,
probably ship about 5GB of changes per day, hardware on both ends can easily
keep up with that. Running over a shared metro line and have about 3-5MBytes
per second depending on the time of day that I can count on. I have wal_keep
segments at 250 (I don't care about the disk overhead for this, since I
wanted to not have to use wal archiving). The link is being severed more
often than usually lately while some network changes are being made so while
I would expect that improve in the future this isn't exactly the most
reliable connection. so getting whatever as right as I can is of value to
me.

Typically I see the streaming replication break down for good completely a
few hours after something that causes a interruption in networking. Nagios
notifications lag some but not hours and has to go through a few people
before I find out about it. When checking the nagios pages on their logs I
don't see pages about the distance between the master and the standby
getting bigger during this time, and then once I see the first unexpected
EOF then the distance between the master and standby gets further and
further until it gets fixed or we have to re-sync the whole base over.


Again I can't seem to duplicate this problem on demand with virtual
machines, I startup a master and standby, setup streaming rep, kickoff a
multi hour or day pg bench run and start messing with networking. Every time
I try and duplicate this synthetically the standby picks right back where it
left off and catches back up.

I am at a loss so I do appreciate everyone's help.

Thanks in advance

-Mark

>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
On Tue, Feb 8, 2011 at 7:23 PM, mark <dvlhntr@gmail.com> wrote:
> (~two weeks and it dies)
> keepalives_idle=30
> keepalives_interval=30
> keepalives_count=30

Maybe something like this:

keepalives_idle=60
keepalives_interval=5
keepalives_count=10

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Thursday, March 03, 2011 9:04 AM
> To: mark
> Cc: Fujii Masao; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
> restarted manually in somecases.
>
> On Tue, Feb 8, 2011 at 7:23 PM, mark <dvlhntr@gmail.com> wrote:
> > (~two weeks and it dies)
> > keepalives_idle=30
> > keepalives_interval=30
> > keepalives_count=30
>
> Maybe something like this:
>
> keepalives_idle=60
> keepalives_interval=5
> keepalives_count=10
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


Thank you !


Things have been more stable with regards to our WAN links. I am sure that
won't last for very long and expect to find out sooner than later if these
changes mitigate the issue I had been seeing.


Again, thank you

~Mark