Thread: Increasing timeout of poll_query_until for TAP tests

Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

25 July 2016, 04:22:46

Hi all,

Lately hamster is failing every 4/5 days on the recovery regression
tests in 003 covering the recovery targets, with that:
# Postmaster PID for node "standby_2" is 20510
#
Timed out while waiting for standby to catch up at
t/003_recovery_targets.pl line 36.

Which means that poll_for_query timed out for the standby to catch up..

Here is an example of test that failed:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07

This buildfarm machine is legendary known for its slowness, and I
don't see a better answer to that than increasing the max timeout of
poll_query_until to put that back to green.
Thoughts?
--
Michael

Attachment

fix-recovery-tap-failures.patch

Re: Increasing timeout of poll_query_until for TAP tests

From

Alvaro Herrera

Date:

25 July 2016, 05:38:20

Michael Paquier wrote:

> Lately hamster is failing every 4/5 days on the recovery regression
> tests in 003 covering the recovery targets, with that:
> # Postmaster PID for node "standby_2" is 20510
> #
> Timed out while waiting for standby to catch up at
> t/003_recovery_targets.pl line 36.
> 
> Which means that poll_for_query timed out for the standby to catch up..
> 
> Here is an example of test that failed:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07

Yeah, thanks, pushed.  However this doesn't explain all the failures we see:

1) In
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06
we see the pg_basebackup test failing.  I suppose that failure is also
because of slowness, though of course this patch won't fix it.

2) In
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check
we see a completely different failure:

error running SQL: 'psql:<stdin>:1: ERROR:  relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int                            ^'
while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at
/home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1166.

Do we have an explanation for this one?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

25 July 2016, 05:52:25

On Mon, Jul 25, 2016 at 2:38 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
> Yeah, thanks, pushed.  However this doesn't explain all the failures we see:

I missed those ones, thanks for the reminder.

> 1) In
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06
> we see the pg_basebackup test failing.  I suppose that failure is also
> because of slowness, though of course this patch won't fix it.

That's from 010_pg_basebackup... And I am not sure what's behind that.
Could it be possible to add --verbose to the commands of
pg_basebackup? We may be able to catch the problem if it shows up
again.

> 2) In
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check
> we see a completely different failure:
>
> error running SQL: 'psql:<stdin>:1: ERROR:  relation "tab_int" does not exist
> LINE 1: SELECT count(*) FROM tab_int
>                              ^'
> while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at
/home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1166.
 
>
> Do we have an explanation for this one?

Ah, yes, and that's a stupid mistake. We had better use
replay_location instead of write_location. There is a risk that
records have not been replayed yet even if they have been written on
the standby, so it is possible that the query looking at tab_int may
not see this relation.
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

25 July 2016, 13:05:37

On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Ah, yes, and that's a stupid mistake. We had better use
> replay_location instead of write_location. There is a risk that
> records have not been replayed yet even if they have been written on
> the standby, so it is possible that the query looking at tab_int may
> not see this relation.

Or in short, the attached fixes 2) and will help providing input for 1)..
--
Michael

Attachment

tap-fixes.patch

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

27 July 2016, 01:01:07

On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Ah, yes, and that's a stupid mistake. We had better use
>> replay_location instead of write_location. There is a risk that
>> records have not been replayed yet even if they have been written on
>> the standby, so it is possible that the query looking at tab_int may
>> not see this relation.
>
> Or in short, the attached fixes 2) and will help providing input for 1)..

Increasing the timeout had zero effect for test 003:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07
So we're facing something else. I'll dig into that deeper using
manually hamster.
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

02 August 2016, 01:28:12

On Wed, Jul 27, 2016 at 10:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Ah, yes, and that's a stupid mistake. We had better use
>>> replay_location instead of write_location. There is a risk that
>>> records have not been replayed yet even if they have been written on
>>> the standby, so it is possible that the query looking at tab_int may
>>> not see this relation.
>>
>> Or in short, the attached fixes 2) and will help providing input for 1)..
>
> Increasing the timeout had zero effect for test 003:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07
> So we're facing something else. I'll dig into that deeper using
> manually hamster.

And so I have been finally able to reproduce this one, and this is a
timing issue in the test.

First see the failure itself:
LOG:  recovery stopping after commit of transaction 548, time
2016-08-01 21:14:37.647104+09
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
LOG:  statement: SELECT '0/30122D0'::pg_lsn <= pg_last_xlog_replay_location()
[keeps waiting for this LSN to be replayed]

But by looking at the WAL records of this failed test I could see the following
rmgr: Transaction len (rec/tot):      8/    34, tx:        547, lsn:
0/03012248, prev 0/03012208, desc: COMMIT 2016-08-01 21:14:37.514805
JST
rmgr: Transaction len (rec/tot):      8/    34, tx:        548, lsn:
0/03012270, prev 0/03012248, desc: COMMIT 2016-08-01 21:14:37.647104
JST
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn:
0/03012298, prev 0/03012270, desc: RUNNING_XACTS nextXid 549
latestCompletedXid 548 oldestRunningXid 549
rmgr: Heap        len (rec/tot):      3/    59, tx:        549, lsn:
0/030122D0, prev 0/03012298, desc: INSERT off 193, blkref #0: rel
1663/12404/16384 blk 8
It is not a surprise if this keeps waiting forever. As the recovery is
paused before it cannot replay the wanted LSN.

Here using pg_xlog_replay_resume() is not the correct solution because
this would cause the node to finish recovery before we want it to, and
so is recovery_target_action = 'promote'. If we look at the test, it
is doing the following when getting the TXID that is used as recovery
target:
$node_master->safe_psql('postgres',
        "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
  $node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that, and it fixes as well the issue with test
001 regarding write_location that should not be used. With this patch
I have let the test 003 run for two hours in a row and it did not
complain. Previously I was able to see a failure at the 3rd~4th
attempts, within 30 minutes.

There is still an issue with pg_basebackup when testing stream mode
and replication slots. I am digging into this one now..
--
Michael

Attachment

recovery-test-fixes.patch

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

02 August 2016, 06:11:36

On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> There is still an issue with pg_basebackup when testing stream mode
> and replication slots. I am digging into this one now..

After 5 hours running this test in a row and 30 attempts torturing
hamster with a script running make check in an infinite loop with set
-e I am giving up on that for the time being... I have added the patch
to make the tests more stable to next CF so as it is not forgotten:
https://commitfest.postgresql.org/10/693/
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Alvaro Herrera

Date:

02 August 2016, 18:36:54

Michael Paquier wrote:
> On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > There is still an issue with pg_basebackup when testing stream mode
> > and replication slots. I am digging into this one now..
> 
> After 5 hours running this test in a row and 30 attempts torturing
> hamster with a script running make check in an infinite loop with set
> -e I am giving up on that for the time being... I have added the patch
> to make the tests more stable to next CF so as it is not forgotten:
> https://commitfest.postgresql.org/10/693/

Great, thanks for the effort, will push.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Increasing timeout of poll_query_until for TAP tests

From

Alvaro Herrera

Date:

02 August 2016, 22:21:12

Michael Paquier wrote:

> Here using pg_xlog_replay_resume() is not the correct solution because
> this would cause the node to finish recovery before we want it to, and
> so is recovery_target_action = 'promote'. If we look at the test, it
> is doing the following when getting the TXID that is used as recovery
> target:
> $node_master->safe_psql('postgres',
>         "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
> my $recovery_txid =
>   $node_master->safe_psql('postgres', "SELECT txid_current()");
> my $lsn2 =
>   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
> What I think we had better do is reverse the calls
> pg_current_xlog_location() and txid_current() so as we are sure that
> the LSN we track for replay is lower than the real target LSN. The
> same problem exists when defining the timestamp target.
>
> The patch attached does that,

Why not capture both items in a single select, such as in the attached
patch?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

recovery-test-fixes-2.patch

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

03 August 2016, 01:40:29

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>
>> Here using pg_xlog_replay_resume() is not the correct solution because
>> this would cause the node to finish recovery before we want it to, and
>> so is recovery_target_action = 'promote'. If we look at the test, it
>> is doing the following when getting the TXID that is used as recovery
>> target:
>> $node_master->safe_psql('postgres',
>>         "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
>> my $recovery_txid =
>>   $node_master->safe_psql('postgres', "SELECT txid_current()");
>> my $lsn2 =
>>   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
>> What I think we had better do is reverse the calls
>> pg_current_xlog_location() and txid_current() so as we are sure that
>> the LSN we track for replay is lower than the real target LSN. The
>> same problem exists when defining the timestamp target.
>>
>> The patch attached does that,
>
> Why not capture both items in a single select, such as in the attached
> patch?

Let me test this....
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Alvaro Herrera

Date:

03 August 2016, 17:34:18

Michael Paquier wrote:
> On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > Why not capture both items in a single select, such as in the attached
> > patch?
> 
> Let me test this....
> [... A while after ...]
> This looks to work properly. 12 runs in a row have passed.

Okay, applied that way.

BTW, one-line long queries look awful in that perl code.  I don't
propose to change anything now, but I propose that long queries are
split using here-docs in new code,

$node->safe_psql(<<EQ);
SELECT foo FROM bar
EQ

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

03 August 2016, 21:57:03

On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>> > Why not capture both items in a single select, such as in the attached
>> > patch?
>>
>> Let me test this....
>> [... A while after ...]
>> This looks to work properly. 12 runs in a row have passed.
>
> Okay, applied that way.
>
> BTW, one-line long queries look awful in that perl code.  I don't
> propose to change anything now, but I propose that long queries are
> split using here-docs in new code,
>
> $node->safe_psql(<<EQ);
> SELECT foo
>   FROM bar
> EQ

Yep, that would be a good idea. I didn't know this grammar existed. Or
use qq() directly.
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

18 August 2016, 03:18:43

On Thu, Aug 4, 2016 at 6:56 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Michael Paquier wrote:
>>> On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>>> > Why not capture both items in a single select, such as in the attached
>>> > patch?
>>>
>>> Let me test this....
>>> [... A while after ...]
>>> This looks to work properly. 12 runs in a row have passed.
>>
>> Okay, applied that way.
>>
>> BTW, one-line long queries look awful in that perl code.  I don't
>> propose to change anything now, but I propose that long queries are
>> split using here-docs in new code,
>>
>> $node->safe_psql(<<EQ);
>> SELECT foo
>>   FROM bar
>> EQ
>
> Yep, that would be a good idea. I didn't know this grammar existed. Or
> use qq() directly.

hamster has not failed even once for two weeks now, so I think that
we're good. Let's see if the problem with pg_basebackup pops up once
again.
-- 
Michael

Re: Increasing timeout of poll_query_until for TAP tests

From

Noah Misch

Date:

01 January 2018, 08:52:27

On Tue, Aug 02, 2016 at 06:21:03PM -0400, Alvaro Herrera wrote:
> Michael Paquier wrote:
> > $node_master->safe_psql('postgres',
> >         "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
> > my $recovery_txid =
> >   $node_master->safe_psql('postgres', "SELECT txid_current()");
> > my $lsn2 =
> >   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
> > What I think we had better do is reverse the calls
> > pg_current_xlog_location() and txid_current() so as we are sure that
> > the LSN we track for replay is lower than the real target LSN. The
> > same problem exists when defining the timestamp target.
> > 
> > The patch attached does that,
> 
> Why not capture both items in a single select, such as in the attached
> patch?

> -my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
> -my $lsn3 =
> -  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
> +$ret =
> +  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();");
> +my ($lsn3, $recovery_time) = split /\|/, $ret;

Since now() is transaction_timestamp(), $recovery_time precedes or equals
$lsn3, and this didn't close the race.  Using clock_timestamp() here would
work, as does using separate transactions like recovery-test-fixes.patch did.
I'll shortly push a fix for this and a similar ordering problem in the
standby_5 test, which first appeared subsequent to this thread.

Re: Increasing timeout of poll_query_until for TAP tests

From

Michael Paquier

Date:

01 January 2018, 13:55:37

On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:
> Since now() is transaction_timestamp(), $recovery_time precedes or equals
> $lsn3, and this didn't close the race.  Using clock_timestamp() here would
> work, as does using separate transactions like recovery-test-fixes.patch did.
> I'll shortly push a fix for this and a similar ordering problem in the
> standby_5 test, which first appeared subsequent to this thread.

As recovery_target_inclusive is true by default, my conclusion on the
matter, which was something that my tests on hamster, the now-dead
buildfarm animal seemed to confirm, is that just getting a timestamp at
least the value of the LSN from the same transaction was enough to fix
all the failures. And hamster was really slow. I can follow why
logically your patch makes sense, so I agree that this is sane. Have you
spotted failures from the buildfarm? This is hard to spot from any
people which just have access to what the buildfarm UI offers, so any
reference to failures on this thread would be welcome.
--
Michael

Attachment

signature.asc

Re: Increasing timeout of poll_query_until for TAP tests

From

Noah Misch

Date:

01 January 2018, 21:18:01

On Mon, Jan 01, 2018 at 07:55:37PM +0900, Michael Paquier wrote:
> On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote:
> > Since now() is transaction_timestamp(), $recovery_time precedes or equals
> > $lsn3, and this didn't close the race.  Using clock_timestamp() here would
> > work, as does using separate transactions like recovery-test-fixes.patch did.
> > I'll shortly push a fix for this and a similar ordering problem in the
> > standby_5 test, which first appeared subsequent to this thread.
> 
> As recovery_target_inclusive is true by default, my conclusion on the
> matter, which was something that my tests on hamster, the now-dead
> buildfarm animal seemed to confirm, is that just getting a timestamp at
> least the value of the LSN from the same transaction was enough to fix
> all the failures. And hamster was really slow. I can follow why
> logically your patch makes sense, so I agree that this is sane. Have you
> spotted failures from the buildfarm?

No, but I checked only the last 90 days.  Earlier master (e.g. git checkout
6078770^) with the following patch reproduces the failures on every run:

--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -71,8 +71,8 @@ my ($lsn2, $recovery_txid) = split /\|/, $ret;
 $node_master->safe_psql('postgres',
     "INSERT INTO tab_int VALUES (generate_series(2001,3000))");
 $ret =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn(), now();");
-my ($lsn3, $recovery_time) = split /\|/, $ret;
+  $node_master->safe_psql('postgres', "SELECT pg_sleep(80), pg_current_wal_lsn(), now();");
+my ($delay_for_autovacuum, $lsn3, $recovery_time) = split /\|/, $ret;
 
 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',
@@ -88,6 +88,7 @@ $node_master->safe_psql('postgres',
     "INSERT INTO tab_int VALUES (generate_series(4001,5000))");
 my $recovery_lsn =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_master->safe_psql('postgres', 'VACUUM');  # write some WAL
 my $lsn5 =
   $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");