Thread: Re: [COMMITTERS] pgsql: Make a hard state change from catchup to streaming mode.

On Fri, Feb 18, 2011 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Make a hard state change from catchup to streaming mode.
> More useful state change for monitoring purposes, plus a
> required change for synchronous replication patch.

As far as I can see, this patch was not posted or discussed before
commit, and I'm not sure it's the behavior everyone wants.  It has the
effect of preventing the system from ever going backwards from
"streaming" to "catchup".  Is that what we want?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [COMMITTERS] pgsql: Make a hard state change from catchup to streaming mode.

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> Simon Riggs <simon@2ndquadrant.com> wrote:
>> Make a hard state change from catchup to streaming mode.
>> More useful state change for monitoring purposes, plus a
>> required change for synchronous replication patch.
> 
> As far as I can see, this patch was not posted or discussed before
> commit, and I'm not sure it's the behavior everyone wants.  It has
> the effect of preventing the system from ever going backwards from
> "streaming" to "catchup".  Is that what we want?
We are looking at moving to streaming replication instead of WAL
file shipping, but we often have WAN outages.  These can last
minutes, hours, or even a few days.  What would be the impact of
this patch on us during and after such outages?
I don't know how well such experience generalizes, but my personal
experience with replication technology is that "hard state changes"
tend to make things more "clunky" and introduce odd issues at the
state transitions.  Where different message types are intermingled
without such hard state changes, I've seen more graceful behavior.
Of course, take that with a grain of salt -- I haven't read the
patch and am talking in generalities based on having written a
couple serious replication tools in the past, and having used a few
others.
-Kevin


Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 18, 2011 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Make a hard state change from catchup to streaming mode.
>> More useful state change for monitoring purposes, plus a
>> required change for synchronous replication patch.

> As far as I can see, this patch was not posted or discussed before
> commit, and I'm not sure it's the behavior everyone wants.  It has the
> effect of preventing the system from ever going backwards from
> "streaming" to "catchup".  Is that what we want?

That seems like a very bad idea from here.  Being able to go back to
catchup after loss of the streaming connection is essential for
robustness.  If we now have to restart the slave for that to happen,
it's not an improvement.
        regards, tom lane


On Fri, 2011-02-18 at 10:41 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Feb 18, 2011 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> Make a hard state change from catchup to streaming mode.
> >> More useful state change for monitoring purposes, plus a
> >> required change for synchronous replication patch.
> 
> > As far as I can see, this patch was not posted or discussed before
> > commit, and I'm not sure it's the behavior everyone wants.  It has the
> > effect of preventing the system from ever going backwards from
> > "streaming" to "catchup".  Is that what we want?
> 
> That seems like a very bad idea from here.  Being able to go back to
> catchup after loss of the streaming connection is essential for
> robustness.  If we now have to restart the slave for that to happen,
> it's not an improvement.

If we lose the connection then we start another walsender process, so
that is a non-issue.

We need a hard state change at one particular point to avoid sync rep
requestors from hanging for hours when the standby is connected but a
long way from being caught up.

This was a part of my sync rep patch that it was easier to break out and
commit early. There has never been any comment or objection to this
concept and the same feature has existed in my patches for months.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



On Fri, Feb 18, 2011 at 10:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Feb 18, 2011 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Make a hard state change from catchup to streaming mode.
>>> More useful state change for monitoring purposes, plus a
>>> required change for synchronous replication patch.
>
>> As far as I can see, this patch was not posted or discussed before
>> commit, and I'm not sure it's the behavior everyone wants.  It has the
>> effect of preventing the system from ever going backwards from
>> "streaming" to "catchup".  Is that what we want?
>
> That seems like a very bad idea from here.  Being able to go back to
> catchup after loss of the streaming connection is essential for
> robustness.  If we now have to restart the slave for that to happen,
> it's not an improvement.

No, that's not the case where it matters.  The state would get reset
on reconnect.  The problem is when, say, the master server is
generating WAL at a rate which exceeds the network bandwidth of the
link between the master and the standby.  The previous coding will
make us flip back into the catchup state when that happens.

Actually, the old code is awfully sensitive, and knowing that you are
not caught up is not really enough information: you need to know how
far behind you are, and how long you've been behind for.  I'm guessing
that Simon intended this patch to deal with that problem, but it's not
the right fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Feb 18, 2011 at 10:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We need a hard state change at one particular point to avoid sync rep
> requestors from hanging for hours when the standby is connected but a
> long way from being caught up.

That's a matter of opinion.  I think there are a number of people here
who would say that what you need is a good way to know when you've
caught up, and that you shouldn't enable sync rep until that happens.
What you're proposing would be useful too, if it didn't break other
cases, but it does.  This is precisely why it's a bad idea for us to
be trying to do this kind of engineering at the very last minute.

> This was a part of my sync rep patch that it was easier to break out and
> commit early. There has never been any comment or objection to this
> concept and the same feature has existed in my patches for months.

You posted the latest version of your sync rep patch on January 15th,
after months of inactivity.  Heikki reviewed it on the 21st, and there
it sat un-updated for three weeks.  If your expectation is that any
portion of that patch to which nobody specifically objected is fair
game to commit without further discussion, I don't think that's the
way it works around here.

Post your patches and we'll review them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, 2011-02-18 at 09:41 -0600, Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> > Simon Riggs <simon@2ndquadrant.com> wrote:
>  
> >> Make a hard state change from catchup to streaming mode.
> >> More useful state change for monitoring purposes, plus a
> >> required change for synchronous replication patch.
> > 
> > As far as I can see, this patch was not posted or discussed before
> > commit, and I'm not sure it's the behavior everyone wants.  It has
> > the effect of preventing the system from ever going backwards from
> > "streaming" to "catchup".  Is that what we want?
>  
> We are looking at moving to streaming replication instead of WAL
> file shipping, but we often have WAN outages.  These can last
> minutes, hours, or even a few days.  What would be the impact of
> this patch on us during and after such outages?

None at all. The patch introduces no behavioural changes, only a useful,
but minor re-categorisation of what is already happening so that its
easier to monitor what happens following startup of a standby.
> I don't know how well such experience generalizes, but my personal
> experience with replication technology is that "hard state changes"
> tend to make things more "clunky" and introduce odd issues at the
> state transitions.  Where different message types are intermingled
> without such hard state changes, I've seen more graceful behavior.
>  
> Of course, take that with a grain of salt -- I haven't read the
> patch and am talking in generalities based on having written a
> couple serious replication tools in the past, and having used a few
> others.

I respect your experience.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



On Fri, Feb 18, 2011 at 11:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 18, 2011 at 10:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Fri, Feb 18, 2011 at 10:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Make a hard state change from catchup to streaming mode.
>>>> More useful state change for monitoring purposes, plus a
>>>> required change for synchronous replication patch.
>>
>>> As far as I can see, this patch was not posted or discussed before
>>> commit, and I'm not sure it's the behavior everyone wants.  It has the
>>> effect of preventing the system from ever going backwards from
>>> "streaming" to "catchup".  Is that what we want?
>>
>> That seems like a very bad idea from here.  Being able to go back to
>> catchup after loss of the streaming connection is essential for
>> robustness.  If we now have to restart the slave for that to happen,
>> it's not an improvement.
>
> No, that's not the case where it matters.  The state would get reset
> on reconnect.  The problem is when, say, the master server is
> generating WAL at a rate which exceeds the network bandwidth of the
> link between the master and the standby.  The previous coding will
> make us flip back into the catchup state when that happens.
>
> Actually, the old code is awfully sensitive, and knowing that you are
> not caught up is not really enough information: you need to know how
> far behind you are, and how long you've been behind for.  I'm guessing
> that Simon intended this patch to deal with that problem, but it's not
> the right fix.

So am I the only one who thinks this needs to be reverted?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company