Thread: Recovery test failure for recovery_min_apply_delay on hamster

Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
Hi all,

I have enabled yesterday the recovery test suite on hamster, and we
did not have to wait long before seeing the first failure on it, the
machine being slow as hell so it is quite good at catching race
conditions:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-03-01%2016%3A00%3A06
Honestly, I did runs on this machine of the test suite, but I didn't
see it, so that's quite sporadic. Yesterday's run worked fine for
example.

In more details, the following problem showed up:
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
not ok 1 - check content with delay of 1s

#   Failed test 'check content with delay of 1s'
#   at t/005_replay_delay.pl line 39.
#          got: '20'
#     expected: '10'
### Running SQL command on node "master": SELECT pg_current_xlog_location();
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
ok 2 - check content with delay of 2s

This is a timing issue, caused by the use of recovery_min_apply_delay,
the test doing the following:
1) Set up recovery_min_apply_delay to 2 seconds
2) Start the standby
3) Apply an INSERT on master, save pg_current_xlog_location from master
4) sleep 1s
5) query standby, and wait that WAL has not been applied yet.
6) Wait that required LSN from master has been applied
7) query again standby, and see that WAL has been applied.

The problem is that visibly hamster is so slow that more than 2s have
been spent between phases 3 and 5, meaning that the delay has already
been reached, and WAL was applied.

Here are a couple of ways to address this problem:
1) Remove the check before applying the delay
2) Increase recovery_min_apply_delay to a time that will allow even
slow machines to see a difference. By experience with the other tests
30s would be enough. The sleep time needs to be increased as well,
making the time taken for the test to run longer
3) Remove all together 005, because doing either 1) or 2) reduces the
value of the test.
I'd like 1) personally, I still see value in this test.

Thoughts?
-- 
Michael



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Wed, Mar 2, 2016 at 2:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Here are a couple of ways to address this problem:
> 1) Remove the check before applying the delay
> 2) Increase recovery_min_apply_delay to a time that will allow even
> slow machines to see a difference. By experience with the other tests
> 30s would be enough. The sleep time needs to be increased as well,
> making the time taken for the test to run longer
> 3) Remove all together 005, because doing either 1) or 2) reduces the
> value of the test.
> I'd like 1) personally, I still see value in this test.

So, as doing 1) would be actually equivalent to simply having a master
and checking that its standby replicates correctly, I have been
looking at 2) to see to how long the delay has to be set to make the
test failure-proof. After doing some measurements with hamster, 10s
and 15s have proved to not be enough unfortunately, 20s has not failed
in 10 attempts though. Attached is a patch to bump it to 20s, though I
would not complain if the test is actually removed to accelerate the
runs of this test suite.
--
Michael

Attachment

Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Wed, Mar 2, 2016 at 2:04 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Here are a couple of ways to address this problem:
> > 1) Remove the check before applying the delay
> > 2) Increase recovery_min_apply_delay to a time that will allow even
> > slow machines to see a difference. By experience with the other tests
> > 30s would be enough. The sleep time needs to be increased as well,
> > making the time taken for the test to run longer
> > 3) Remove all together 005, because doing either 1) or 2) reduces the
> > value of the test.
> > I'd like 1) personally, I still see value in this test.
> 
> So, as doing 1) would be actually equivalent to simply having a master
> and checking that its standby replicates correctly, I have been
> looking at 2) to see to how long the delay has to be set to make the
> test failure-proof. After doing some measurements with hamster, 10s
> and 15s have proved to not be enough unfortunately, 20s has not failed
> in 10 attempts though. Attached is a patch to bump it to 20s, though I
> would not complain if the test is actually removed to accelerate the
> runs of this test suite.

Is there anything we can do to short-circuit the wait in the case that
replication happens promptly?  A one-minute wait would be acceptable we
terminate it early by checking every second.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Wed, Mar 9, 2016 at 12:14 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Is there anything we can do to short-circuit the wait in the case that
> replication happens promptly?  A one-minute wait would be acceptable we
> terminate it early by checking every second.

After sleeping (best debugger ever) on that, actually a way popped up
in my mind, and I propose the attached, which refactors a bit 005 and
checks that the LSN position of master has been applied on standby
after at least the delay wanted. A maximum delay of 90s is authorized,
like poll_query_until.
--
Michael

Attachment

Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Wed, Mar 9, 2016 at 12:14 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Is there anything we can do to short-circuit the wait in the case that
> > replication happens promptly?  A one-minute wait would be acceptable we
> > terminate it early by checking every second.
> 
> After sleeping (best debugger ever) on that, actually a way popped up
> in my mind, and I propose the attached, which refactors a bit 005 and
> checks that the LSN position of master has been applied on standby
> after at least the delay wanted. A maximum delay of 90s is authorized,
> like poll_query_until.

Hmm, okay, that's great.  A question: what happens if the test itself is
slow and the servers are fast, and the test doesn't manage to run two
iterations before the two seconds have elapsed?  This may happen on
overloaded or slow servers, if you're unlucky.

I don't have any ideas on ensuring that we don't apply earlier than the
given period at the moment.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Wed, Mar 9, 2016 at 12:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> After sleeping (best debugger ever) on that, actually a way popped up
>> in my mind, and I propose the attached, which refactors a bit 005 and
>> checks that the LSN position of master has been applied on standby
>> after at least the delay wanted. A maximum delay of 90s is authorized,
>> like poll_query_until.
>
> Hmm, okay, that's great.  A question: what happens if the test itself is
> slow and the servers are fast, and the test doesn't manage to run two
> iterations before the two seconds have elapsed?  This may happen on
> overloaded or slow servers, if you're unlucky.

Yes, a failure would happen. The same thought occurred to me during a
long flight. And this is why the previous patch was full of meh.

> I don't have any ideas on ensuring that we don't apply earlier than the
> given period at the moment.

Attached is one, which is based on timestamp values queried from the
standby server. We could use as well perl's localtime call to
calculate the time delay.
--
Michael

Attachment

Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Wed, Mar 9, 2016 at 9:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Mar 9, 2016 at 12:29 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Michael Paquier wrote:
>>> After sleeping (best debugger ever) on that, actually a way popped up
>>> in my mind, and I propose the attached, which refactors a bit 005 and
>>> checks that the LSN position of master has been applied on standby
>>> after at least the delay wanted. A maximum delay of 90s is authorized,
>>> like poll_query_until.
>>
>> Hmm, okay, that's great.  A question: what happens if the test itself is
>> slow and the servers are fast, and the test doesn't manage to run two
>> iterations before the two seconds have elapsed?  This may happen on
>> overloaded or slow servers, if you're unlucky.
>
> Yes, a failure would happen. The same thought occurred to me during a
> long flight. And this is why the previous patch was full of meh.
>
>> I don't have any ideas on ensuring that we don't apply earlier than the
>> given period at the moment.
>
> Attached is one, which is based on timestamp values queried from the
> standby server. We could use as well perl's localtime call to
> calculate the time delay.

Actually, the attached is better. This one relies on time() to perform
the delay checks, and takes care of things even for slow machines.
--
Michael

Attachment

Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Noah Misch
Date:
On Sun, Mar 13, 2016 at 10:18:46PM +0100, Michael Paquier wrote:
> On Wed, Mar 9, 2016 at 9:05 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Wed, Mar 9, 2016 at 12:29 PM, Alvaro Herrera
> > <alvherre@2ndquadrant.com> wrote:
> >> Michael Paquier wrote:
> >>> After sleeping (best debugger ever) on that, actually a way popped up
> >>> in my mind, and I propose the attached, which refactors a bit 005 and
> >>> checks that the LSN position of master has been applied on standby
> >>> after at least the delay wanted. A maximum delay of 90s is authorized,
> >>> like poll_query_until.
> >>
> >> Hmm, okay, that's great.  A question: what happens if the test itself is
> >> slow and the servers are fast, and the test doesn't manage to run two
> >> iterations before the two seconds have elapsed?  This may happen on
> >> overloaded or slow servers, if you're unlucky.
> >
> > Yes, a failure would happen. The same thought occurred to me during a
> > long flight. And this is why the previous patch was full of meh.
> >
> >> I don't have any ideas on ensuring that we don't apply earlier than the
> >> given period at the moment.
> >
> > Attached is one, which is based on timestamp values queried from the
> > standby server. We could use as well perl's localtime call to
> > calculate the time delay.
> 
> Actually, the attached is better. This one relies on time() to perform
> the delay checks, and takes care of things even for slow machines.

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
since you committed the patch believed to have created it, you own this open
item.  If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this.  Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution.  Please
present, within 72 hours, a plan to fix the defect within seven days of this
message.  Thanks.



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Alvaro Herrera
Date:
Michael Paquier wrote:

> Actually, the attached is better. This one relies on time() to perform
> the delay checks, and takes care of things even for slow machines.

Thanks, pushed with some minor adjustments.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Alvaro Herrera
Date:
Noah Misch wrote:

> The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
> since you committed the patch believed to have created it, you own this open
> item.

That's correct.  Since we already had a patch available, I pushed it.
I'll wait for a few days before marking the open item as closed in the
wiki, to make sure that hamster reports success a few times.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Fri, Apr 1, 2016 at 4:14 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Noah Misch wrote:
>
>> The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
>> since you committed the patch believed to have created it, you own this open
>> item.
>
> That's correct.  Since we already had a patch available, I pushed it.
> I'll wait for a few days before marking the open item as closed in the
> wiki, to make sure that hamster reports success a few times.

Thanks. I just did a couple of additional manual tests on hamster, and
the sporadic failure does not show up anymore, so the daily runs
should be in good shape now for this test.
-- 
Michael



Re: Recovery test failure for recovery_min_apply_delay on hamster

From
Michael Paquier
Date:
On Fri, Apr 1, 2016 at 10:02 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Apr 1, 2016 at 4:14 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Noah Misch wrote:
>>
>>> The above-described topic is currently a PostgreSQL 9.6 open item.  Alvaro,
>>> since you committed the patch believed to have created it, you own this open
>>> item.
>>
>> That's correct.  Since we already had a patch available, I pushed it.
>> I'll wait for a few days before marking the open item as closed in the
>> wiki, to make sure that hamster reports success a few times.
>
> Thanks. I just did a couple of additional manual tests on hamster, and
> the sporadic failure does not show up anymore, so the daily runs
> should be in good shape now for this test.

The failing test has passed 7 days in a row, so I am marking this
issue as fixed.
-- 
Michael