Re: Increasing timeout of poll_query_until for TAP tests - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Increasing timeout of poll_query_until for TAP tests |
Date | |
Msg-id | CAB7nPqSeW9nbBaUYMARF6COWCsfU6tJZC9mac8K8zRN5DrUCbA@mail.gmail.com Whole thread Raw |
In response to | Re: Increasing timeout of poll_query_until for TAP tests (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: Increasing timeout of poll_query_until for TAP tests
Re: Increasing timeout of poll_query_until for TAP tests |
List | pgsql-hackers |
On Wed, Jul 27, 2016 at 10:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Ah, yes, and that's a stupid mistake. We had better use >>> replay_location instead of write_location. There is a risk that >>> records have not been replayed yet even if they have been written on >>> the standby, so it is possible that the query looking at tab_int may >>> not see this relation. >> >> Or in short, the attached fixes 2) and will help providing input for 1).. > > Increasing the timeout had zero effect for test 003: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07 > So we're facing something else. I'll dig into that deeper using > manually hamster. And so I have been finally able to reproduce this one, and this is a timing issue in the test. First see the failure itself: LOG: recovery stopping after commit of transaction 548, time 2016-08-01 21:14:37.647104+09 LOG: recovery has paused HINT: Execute pg_xlog_replay_resume() to continue. LOG: statement: SELECT '0/30122D0'::pg_lsn <= pg_last_xlog_replay_location() [keeps waiting for this LSN to be replayed] But by looking at the WAL records of this failed test I could see the following rmgr: Transaction len (rec/tot): 8/ 34, tx: 547, lsn: 0/03012248, prev 0/03012208, desc: COMMIT 2016-08-01 21:14:37.514805 JST rmgr: Transaction len (rec/tot): 8/ 34, tx: 548, lsn: 0/03012270, prev 0/03012248, desc: COMMIT 2016-08-01 21:14:37.647104 JST rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/03012298, prev 0/03012270, desc: RUNNING_XACTS nextXid 549 latestCompletedXid 548 oldestRunningXid 549 rmgr: Heap len (rec/tot): 3/ 59, tx: 549, lsn: 0/030122D0, prev 0/03012298, desc: INSERT off 193, blkref #0: rel 1663/12404/16384 blk 8 It is not a surprise if this keeps waiting forever. As the recovery is paused before it cannot replay the wanted LSN. Here using pg_xlog_replay_resume() is not the correct solution because this would cause the node to finish recovery before we want it to, and so is recovery_target_action = 'promote'. If we look at the test, it is doing the following when getting the TXID that is used as recovery target: $node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (generate_series(1001,2000))"); my $recovery_txid = $node_master->safe_psql('postgres', "SELECT txid_current()"); my $lsn2 = $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();"); What I think we had better do is reverse the calls pg_current_xlog_location() and txid_current() so as we are sure that the LSN we track for replay is lower than the real target LSN. The same problem exists when defining the timestamp target. The patch attached does that, and it fixes as well the issue with test 001 regarding write_location that should not be used. With this patch I have let the test 003 run for two hours in a row and it did not complain. Previously I was able to see a failure at the 3rd~4th attempts, within 30 minutes. There is still an issue with pg_basebackup when testing stream mode and replication slots. I am digging into this one now.. -- Michael
Attachment
pgsql-hackers by date: