Thread: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

[HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop.  I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms.  That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.

This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine.  I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.

Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough.  I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?

            regards, tom lane

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 266d27c..8d6edd2 100644
*** a/src/test/recovery/t/001_stream_rep.pl
--- b/src/test/recovery/t/001_stream_rep.pl
*************** isnt($xmin, '', 'cascaded slot xmin non-
*** 265,270 ****
--- 265,272 ----
  # Xmin from a previous run should be cleared on startup.
  $node_standby_2->start;

+ sleep(1);  # need some delay before interrogating slot xmin
+
  ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
  is($xmin, '',
      'cascaded slot xmin reset after startup with hs feedback reset');

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Craig Ringer
Date:
On 26 June 2017 at 05:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been experimenting with a change to pg_ctl, which I'll post
> separately, to reduce its reaction time so that it reports success
> more quickly after a wait for postmaster start/stop.  I found one
> case in "make check-world" that got a failure when I reduced the
> reaction time to ~1ms.  That's the very last test in 001_stream_rep.pl,
> "cascaded slot xmin reset after startup with hs feedback reset", and
> the cause appears to be that it's not allowing any delay time for a
> replication slot's state to update after a postmaster restart.
>
> This seems worth fixing independently of any possible code changes,
> because it shows that this test could fail on a slow or overloaded
> machine.  I couldn't find any instances of such a failure in the
> buildfarm archives, but that may have a lot to do with the fact that
> owners of slow buildfarm animals are (mostly?) not running this test.
>
> Some experimentation says that the minimum delay needed to make it
> work reliably on my workstation is about 100ms, so a simple patch
> along the lines of the attached might be good enough.  I find this
> approach conceptually dissatisfying, though, since it's still
> potentially vulnerable to the failure under sufficient load.
> I wonder if there is an easy way to improve that ... maybe convert
> to something involving poll_query_until?

This should do the trick:

$node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);




-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);

+1 for avoiding a sleep call if it is not necessary. Fast platforms
would always pay a cost on that, and slow platforms would wait 1s (or
more!) when polling for the result.

Could it be possible to remove as well the second sleep(2) call in
this test please? This could be replaced by a similar poll using the
query Craig has just given as this needs to wait until the xmon is
cleared. Other tests expect a value so this poll cannot happen in
get_slot_xmins.
-- 
Michael



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);

> +1 for avoiding a sleep call if it is not necessary. Fast platforms
> would always pay a cost on that, and slow platforms would wait 1s (or
> more!) when polling for the result.

> Could it be possible to remove as well the second sleep(2) call in
> this test please?

Yes, I'd like to see those fixed sleeps go away too.  Want to work
on a concrete patch?
        regards, tom lane



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Craig Ringer
Date:
On 26 June 2017 at 10:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
>
>> +1 for avoiding a sleep call if it is not necessary. Fast platforms
>> would always pay a cost on that, and slow platforms would wait 1s (or
>> more!) when polling for the result.
>
>> Could it be possible to remove as well the second sleep(2) call in
>> this test please?
>
> Yes, I'd like to see those fixed sleeps go away too.  Want to work
> on a concrete patch?


Attached.


-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Mon, Jun 26, 2017 at 11:44 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 26 June 2017 at 10:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> On Mon, Jun 26, 2017 at 10:48 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>>> $node_standby_1->poll_query_until('postgres', q[SELECT xmin IS NULL
>>>> from pg_replication_slots WHERE slot_name = '] . $slotname_2 . q[']);
>>
>>> +1 for avoiding a sleep call if it is not necessary. Fast platforms
>>> would always pay a cost on that, and slow platforms would wait 1s (or
>>> more!) when polling for the result.
>>
>>> Could it be possible to remove as well the second sleep(2) call in
>>> this test please?
>>
>> Yes, I'd like to see those fixed sleeps go away too.  Want to work
>> on a concrete patch?
>
>
> Attached.

Thanks for the patch.

As long as we are on it, there is this code block in the test:
my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');

($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');

This should be more verbose as the 2nd and 4th test should say
"catalog xmin" instead of xmin.

Also, wouldn't it be better to poll as well node_standby_1's
pg_replication_slot on slotname_2? It would really seem better to make
the nullness check conditional in get_slot_xmins instead. Sorry for
changing opinion here.
-- 
Michael



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Craig Ringer
Date:
On 26 June 2017 at 11:06, Michael Paquier <michael.paquier@gmail.com> wrote:

> As long as we are on it, there is this code block in the test:
> my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
> is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
> is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
>
> ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
> is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
> is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
>
> This should be more verbose as the 2nd and 4th test should say
> "catalog xmin" instead of xmin.

Agree, should. Mind posting a revision?

> Also, wouldn't it be better to poll as well node_standby_1's
> pg_replication_slot on slotname_2? It would really seem better to make
> the nullness check conditional in get_slot_xmins instead. Sorry for
> changing opinion here.

I'm not sure I understand this.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 26 June 2017 at 11:06, Michael Paquier <michael.paquier@gmail.com> wrote:
>
>> As long as we are on it, there is this code block in the test:
>> my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
>> is($xmin,         '', 'non-cascaded slot xmin null with no hs_feedback');
>> is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
>>
>> ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
>> is($xmin,         '', 'cascaded slot xmin null with no hs_feedback');
>> is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
>>
>> This should be more verbose as the 2nd and 4th test should say
>> "catalog xmin" instead of xmin.
>
> Agree, should. Mind posting a revision?

Sure.

>> Also, wouldn't it be better to poll as well node_standby_1's
>> pg_replication_slot on slotname_2? It would really seem better to make
>> the nullness check conditional in get_slot_xmins instead. Sorry for
>> changing opinion here.
>
> I'm not sure I understand this.

The patch attached may explain that better. Your patch added 3 poll
phases. I think that a 4th is needed. At the same time I have found
cleaner to put the poll calls into a small wrapper.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Jun 26, 2017 at 12:12 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> I'm not sure I understand this.

> The patch attached may explain that better. Your patch added 3 poll
> phases. I think that a 4th is needed. At the same time I have found
> cleaner to put the poll calls into a small wrapper.

Looks good as far as it goes, but I wonder whether any of the other
get_slot_xmins calls need polling too.  Don't feel a need to add such
calls until someone exhibits a failure there, but I won't be very
surprised if someone does.

Anyway, pushed this patch with minor editing.  Thanks!
        regards, tom lane



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Looks good as far as it goes, but I wonder whether any of the other
> get_slot_xmins calls need polling too.  Don't feel a need to add such
> calls until someone exhibits a failure there, but I won't be very
> surprised if someone does.

I got the same thought, wondering as well if get_slot_xmins should be
renamed check_slot_xmins with the is() tests moved inside it as well.
Not sure if that's worth the API ugliness though.

> Anyway, pushed this patch with minor editing.  Thanks!

Thanks.
-- 
Michael



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Jun 27, 2017 at 3:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks good as far as it goes, but I wonder whether any of the other
>> get_slot_xmins calls need polling too.  Don't feel a need to add such
>> calls until someone exhibits a failure there, but I won't be very
>> surprised if someone does.

And behold, we have here
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-08-08%2020%3A54%3A09

#   Failed test 'xmin of cascaded slot non-null with hs feedback'
#   at t/001_stream_rep.pl line 224.
#          got: ''
#     expected: anything else

That's one of only four calls lacking a preceding wait_slot_xmins call.
The ones at lines 173 and 177 seem safe because nothing has happened yet.
I think the one at line 300 should be safe because the standby_2 server is
shut down at that point, but is there any way that the standby_1's view
hasn't updated by the time that happens?

> I got the same thought, wondering as well if get_slot_xmins should be
> renamed check_slot_xmins with the is() tests moved inside it as well.
> Not sure if that's worth the API ugliness though.

Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
wait_slot_xmins into get_slot_xmins so you can't skip it ...
        regards, tom lane



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> I got the same thought, wondering as well if get_slot_xmins should be
>> renamed check_slot_xmins with the is() tests moved inside it as well.
>> Not sure if that's worth the API ugliness though.
>
> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
> wait_slot_xmins into get_slot_xmins so you can't skip it ...

Let's do that please. Merging both was my first feeling when
refactoring this test upthread. Should I send a patch?
-- 
Michael



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> I got the same thought, wondering as well if get_slot_xmins should be
>>> renamed check_slot_xmins with the is() tests moved inside it as well.
>>> Not sure if that's worth the API ugliness though.

>> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
>> wait_slot_xmins into get_slot_xmins so you can't skip it ...

> Let's do that please. Merging both was my first feeling when
> refactoring this test upthread. Should I send a patch?

Sure, have at it.
        regards, tom lane



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Aug 8, 2017 at 11:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Michael Paquier <michael.paquier@gmail.com> writes:
>>>> I got the same thought, wondering as well if get_slot_xmins should be
>>>> renamed check_slot_xmins with the is() tests moved inside it as well.
>>>> Not sure if that's worth the API ugliness though.
>
>>> Mmm, doesn't seem like that's worth doing, but I'm half tempted to merge
>>> wait_slot_xmins into get_slot_xmins so you can't skip it ...
>
>> Let's do that please. Merging both was my first feeling when
>> refactoring this test upthread. Should I send a patch?
>
> Sure, have at it.

And here you go.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> Let's do that please. Merging both was my first feeling when
>>> refactoring this test upthread. Should I send a patch?

>> Sure, have at it.

> And here you go.

Pushed with a bit of work on the comments.
        regards, tom lane



Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From
Michael Paquier
Date:
On Sun, Aug 13, 2017 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Wed, Aug 9, 2017 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Michael Paquier <michael.paquier@gmail.com> writes:
>>>> Let's do that please. Merging both was my first feeling when
>>>> refactoring this test upthread. Should I send a patch?
>
>>> Sure, have at it.
>
>> And here you go.
>
> Pushed with a bit of work on the comments.

Thanks.
-- 
Michael