Thread: Increasing timeout of poll_query_until for TAP tests
Hi all, Lately hamster is failing every 4/5 days on the recovery regression tests in 003 covering the recovery targets, with that: # Postmaster PID for node "standby_2" is 20510 # Timed out while waiting for standby to catch up at t/003_recovery_targets.pl line 36. Which means that poll_for_query timed out for the standby to catch up.. Here is an example of test that failed: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07 This buildfarm machine is legendary known for its slowness, and I don't see a better answer to that than increasing the max timeout of poll_query_until to put that back to green. Thoughts? -- Michael
Attachment
Michael Paquier wrote: > Lately hamster is failing every 4/5 days on the recovery regression > tests in 003 covering the recovery targets, with that: > # Postmaster PID for node "standby_2" is 20510 > # > Timed out while waiting for standby to catch up at > t/003_recovery_targets.pl line 36. > > Which means that poll_for_query timed out for the standby to catch up.. > > Here is an example of test that failed: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-24%2016%3A00%3A07 Yeah, thanks, pushed. However this doesn't explain all the failures we see: 1) In http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06 we see the pg_basebackup test failing. I suppose that failure is also because of slowness, though of course this patch won't fix it. 2) In http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check we see a completely different failure: error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist LINE 1: SELECT count(*) FROM tab_int ^' while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1166. Do we have an explanation for this one? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 25, 2016 at 2:38 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: > Yeah, thanks, pushed. However this doesn't explain all the failures we see: I missed those ones, thanks for the reminder. > 1) In > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-14%2016%3A00%3A06 > we see the pg_basebackup test failing. I suppose that failure is also > because of slowness, though of course this patch won't fix it. That's from 010_pg_basebackup... And I am not sure what's behind that. Could it be possible to add --verbose to the commands of pg_basebackup? We may be able to catch the problem if it shows up again. > 2) In > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-06-29%2016%3A00%3A06&stg=recovery-check > we see a completely different failure: > > error running SQL: 'psql:<stdin>:1: ERROR: relation "tab_int" does not exist > LINE 1: SELECT count(*) FROM tab_int > ^' > while running 'psql -XAtq -d port=52824 host=/tmp/or2xHglniM dbname=postgres -f - -v ON_ERROR_STOP=1' at /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1166. > > Do we have an explanation for this one? Ah, yes, and that's a stupid mistake. We had better use replay_location instead of write_location. There is a risk that records have not been replayed yet even if they have been written on the standby, so it is possible that the query looking at tab_int may not see this relation. -- Michael
On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Ah, yes, and that's a stupid mistake. We had better use > replay_location instead of write_location. There is a risk that > records have not been replayed yet even if they have been written on > the standby, so it is possible that the query looking at tab_int may > not see this relation. Or in short, the attached fixes 2) and will help providing input for 1).. -- Michael
Attachment
On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Ah, yes, and that's a stupid mistake. We had better use >> replay_location instead of write_location. There is a risk that >> records have not been replayed yet even if they have been written on >> the standby, so it is possible that the query looking at tab_int may >> not see this relation. > > Or in short, the attached fixes 2) and will help providing input for 1).. Increasing the timeout had zero effect for test 003: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07 So we're facing something else. I'll dig into that deeper using manually hamster. -- Michael
On Wed, Jul 27, 2016 at 10:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Ah, yes, and that's a stupid mistake. We had better use >>> replay_location instead of write_location. There is a risk that >>> records have not been replayed yet even if they have been written on >>> the standby, so it is possible that the query looking at tab_int may >>> not see this relation. >> >> Or in short, the attached fixes 2) and will help providing input for 1).. > > Increasing the timeout had zero effect for test 003: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07 > So we're facing something else. I'll dig into that deeper using > manually hamster. And so I have been finally able to reproduce this one, and this is a timing issue in the test. First see the failure itself: LOG: recovery stopping after commit of transaction 548, time 2016-08-01 21:14:37.647104+09 LOG: recovery has paused HINT: Execute pg_xlog_replay_resume() to continue. LOG: statement: SELECT '0/30122D0'::pg_lsn <= pg_last_xlog_replay_location() [keeps waiting for this LSN to be replayed] But by looking at the WAL records of this failed test I could see the following rmgr: Transaction len (rec/tot): 8/ 34, tx: 547, lsn: 0/03012248, prev 0/03012208, desc: COMMIT 2016-08-01 21:14:37.514805 JST rmgr: Transaction len (rec/tot): 8/ 34, tx: 548, lsn: 0/03012270, prev 0/03012248, desc: COMMIT 2016-08-01 21:14:37.647104 JST rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/03012298, prev 0/03012270, desc: RUNNING_XACTS nextXid 549 latestCompletedXid 548 oldestRunningXid 549 rmgr: Heap len (rec/tot): 3/ 59, tx: 549, lsn: 0/030122D0, prev 0/03012298, desc: INSERT off 193, blkref #0: rel 1663/12404/16384 blk 8 It is not a surprise if this keeps waiting forever. As the recovery is paused before it cannot replay the wanted LSN. Here using pg_xlog_replay_resume() is not the correct solution because this would cause the node to finish recovery before we want it to, and so is recovery_target_action = 'promote'. If we look at the test, it is doing the following when getting the TXID that is used as recovery target: $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); my $recovery_txid = $node_master->safe_psql('postgres', "SELECT txid_current()"); my $lsn2 = $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); What I think we had better do is reverse the calls pg_current_xlog_location() and txid_current() so as we are sure that the LSN we track for replay is lower than the real target LSN. The same problem exists when defining the timestamp target. The patch attached does that, and it fixes as well the issue with test 001 regarding write_location that should not be used. With this patch I have let the test 003 run for two hours in a row and it did not complain. Previously I was able to see a failure at the 3rd~4th attempts, within 30 minutes. There is still an issue with pg_basebackup when testing stream mode and replication slots. I am digging into this one now.. -- Michael
Attachment
On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > There is still an issue with pg_basebackup when testing stream mode > and replication slots. I am digging into this one now.. After 5 hours running this test in a row and 30 attempts torturing hamster with a script running make check in an infinite loop with set -e I am giving up on that for the time being... I have added the patch to make the tests more stable to next CF so as it is not forgotten: https://commitfest.postgresql.org/10/693/ -- Michael
Michael Paquier wrote: > On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > There is still an issue with pg_basebackup when testing stream mode > > and replication slots. I am digging into this one now.. > > After 5 hours running this test in a row and 30 attempts torturing > hamster with a script running make check in an infinite loop with set > -e I am giving up on that for the time being... I have added the patch > to make the tests more stable to next CF so as it is not forgotten: > https://commitfest.postgresql.org/10/693/ Great, thanks for the effort, will push. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier wrote: > Here using pg_xlog_replay_resume() is not the correct solution because > this would cause the node to finish recovery before we want it to, and > so is recovery_target_action = 'promote'. If we look at the test, it > is doing the following when getting the TXID that is used as recovery > target: > $node_master->safe_psql('postgres', > "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); > my $recovery_txid = > $node_master->safe_psql('postgres', "SELECT txid_current()"); > my $lsn2 = > $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); > What I think we had better do is reverse the calls > pg_current_xlog_location() and txid_current() so as we are sure that > the LSN we track for replay is lower than the real target LSN. The > same problem exists when defining the timestamp target. > > The patch attached does that, Why not capture both items in a single select, such as in the attached patch? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: > >> Here using pg_xlog_replay_resume() is not the correct solution because >> this would cause the node to finish recovery before we want it to, and >> so is recovery_target_action = 'promote'. If we look at the test, it >> is doing the following when getting the TXID that is used as recovery >> target: >> $node_master->safe_psql('postgres', >> "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); >> my $recovery_txid = >> $node_master->safe_psql('postgres', "SELECT txid_current()"); >> my $lsn2 = >> $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); >> What I think we had better do is reverse the calls >> pg_current_xlog_location() and txid_current() so as we are sure that >> the LSN we track for replay is lower than the real target LSN. The >> same problem exists when defining the timestamp target. >> >> The patch attached does that, > > Why not capture both items in a single select, such as in the attached > patch? Let me test this.... [... A while after ...] This looks to work properly. 12 runs in a row have passed. -- Michael
Michael Paquier wrote: > On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Why not capture both items in a single select, such as in the attached > > patch? > > Let me test this.... > [... A while after ...] > This looks to work properly. 12 runs in a row have passed. Okay, applied that way. BTW, one-line long queries look awful in that perl code. I don't propose to change anything now, but I propose that long queries are split using here-docs in new code, $node->safe_psql(<<EQ); SELECT foo FROM bar EQ -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: >> On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > >> > Why not capture both items in a single select, such as in the attached >> > patch? >> >> Let me test this.... >> [... A while after ...] >> This looks to work properly. 12 runs in a row have passed. > > Okay, applied that way. > > BTW, one-line long queries look awful in that perl code. I don't > propose to change anything now, but I propose that long queries are > split using here-docs in new code, > > $node->safe_psql(<<EQ); > SELECT foo > FROM bar > EQ Yep, that would be a good idea. I didn't know this grammar existed. Or use qq() directly. -- Michael
On Thu, Aug 4, 2016 at 6:56 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Aug 4, 2016 at 2:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Michael Paquier wrote: >>> On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >>> > Why not capture both items in a single select, such as in the attached >>> > patch? >>> >>> Let me test this.... >>> [... A while after ...] >>> This looks to work properly. 12 runs in a row have passed. >> >> Okay, applied that way. >> >> BTW, one-line long queries look awful in that perl code. I don't >> propose to change anything now, but I propose that long queries are >> split using here-docs in new code, >> >> $node->safe_psql(<<EQ); >> SELECT foo >> FROM bar >> EQ > > Yep, that would be a good idea. I didn't know this grammar existed. Or > use qq() directly. hamster has not failed even once for two weeks now, so I think that we're good. Let's see if the problem with pg_basebackup pops up once again. -- Michael
On Tue, Aug 02, 2016 at 06:21:03PM -0400, Alvaro Herrera wrote: > Michael Paquier wrote: > > $node_master->safe_psql('postgres', > > "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); > > my $recovery_txid = > > $node_master->safe_psql('postgres', "SELECT txid_current()"); > > my $lsn2 = > > $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); > > What I think we had better do is reverse the calls > > pg_current_xlog_location() and txid_current() so as we are sure that > > the LSN we track for replay is lower than the real target LSN. The > > same problem exists when defining the timestamp target. > > > > The patch attached does that, > > Why not capture both items in a single select, such as in the attached > patch? > -my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()"); > -my $lsn3 = > - $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); > +$ret = > + $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();"); > +my ($lsn3, $recovery_time) = split /\|/, $ret; Since now() is transaction_timestamp(), $recovery_time precedes or equals $lsn3, and this didn't close the race. Using clock_timestamp() here would work, as does using separate transactions like recovery-test-fixes.patch did. I'll shortly push a fix for this and a similar ordering problem in the standby_5 test, which first appeared subsequent to this thread.
On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote: > Since now() is transaction_timestamp(), $recovery_time precedes or equals > $lsn3, and this didn't close the race. Using clock_timestamp() here would > work, as does using separate transactions like recovery-test-fixes.patch did. > I'll shortly push a fix for this and a similar ordering problem in the > standby_5 test, which first appeared subsequent to this thread. As recovery_target_inclusive is true by default, my conclusion on the matter, which was something that my tests on hamster, the now-dead buildfarm animal seemed to confirm, is that just getting a timestamp at least the value of the LSN from the same transaction was enough to fix all the failures. And hamster was really slow. I can follow why logically your patch makes sense, so I agree that this is sane. Have you spotted failures from the buildfarm? This is hard to spot from any people which just have access to what the buildfarm UI offers, so any reference to failures on this thread would be welcome. -- Michael
Attachment
On Mon, Jan 01, 2018 at 07:55:37PM +0900, Michael Paquier wrote: > On Sun, Dec 31, 2017 at 09:52:27PM -0800, Noah Misch wrote: > > Since now() is transaction_timestamp(), $recovery_time precedes or equals > > $lsn3, and this didn't close the race. Using clock_timestamp() here would > > work, as does using separate transactions like recovery-test-fixes.patch did. > > I'll shortly push a fix for this and a similar ordering problem in the > > standby_5 test, which first appeared subsequent to this thread. > > As recovery_target_inclusive is true by default, my conclusion on the > matter, which was something that my tests on hamster, the now-dead > buildfarm animal seemed to confirm, is that just getting a timestamp at > least the value of the LSN from the same transaction was enough to fix > all the failures. And hamster was really slow. I can follow why > logically your patch makes sense, so I agree that this is sane. Have you > spotted failures from the buildfarm? No, but I checked only the last 90 days. Earlier master (e.g. git checkout 6078770^) with the following patch reproduces the failures on every run: --- a/src/test/recovery/t/003_recovery_targets.pl +++ b/src/test/recovery/t/003_recovery_targets.pl @@ -71,8 +71,8 @@ my ($lsn2, $recovery_txid) = split /\|/, $ret; $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(2001,3000))"); $ret = - $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn(), now();"); -my ($lsn3, $recovery_time) = split /\|/, $ret; + $node_master->safe_psql('postgres', "SELECT pg_sleep(80), pg_current_wal_lsn(), now();"); +my ($delay_for_autovacuum, $lsn3, $recovery_time) = split /\|/, $ret; # Even more data, this time with a recovery target name $node_master->safe_psql('postgres', @@ -88,6 +88,7 @@ $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(4001,5000))"); my $recovery_lsn = $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn()"); +$node_master->safe_psql('postgres', 'VACUUM'); # write some WAL my $lsn5 = $node_master->safe_psql('postgres', "SELECT pg_current_wal_lsn();");