001_rep_changes.pl timeout on jacana/bowerbird - Mailing list pgsql-hackers

From Noah Misch
Subject 001_rep_changes.pl timeout on jacana/bowerbird
Date
Msg-id 20200426012748.GA1529639@rfd.leadboat.com
Whole thread Raw
List pgsql-hackers
Starting 2019-11-17, jacana and bowerbird (different compiler, same machine?)
have failed four times like this:

# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE
application_name= 'tap_sub';
 
# expecting this output:
# t
# last actual query output:
# 
# with stderr:
# Looks like your test exited with 25 just after 8.
[11:06:11] t/001_rep_changes.pl .. 
Dubious, test returned 25 (wstat 6400, 0x1900)
Failed 9/17 subtests 

Every such run:
  sysname  │      snapshot       │ branch │                                             bfurl

───────────┼─────────────────────┼────────┼───────────────────────────────────────────────────────────────────────────────────────────────
 bowerbird │ 2019-11-17 15:22:42 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-11-17%2015%3A22%3A42
 bowerbird │ 2020-01-10 17:30:49 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2020-01-10%2017%3A30%3A49
 jacana    │ 2020-04-05 00:00:27 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-05%2000%3A00%3A27
 jacana    │ 2020-04-16 00:00:27 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-16%2000%3A00%3A27

The dates and branches suggest either a v13 regression (hence my concern) or a
regression in the underlying machine.  In each affected run, other tests
completed at their normal speed, but 001_rep_changes.pl failed after 397s (a
normal run takes <30s).  In the publisher log, the failed run[1] is missing
"replication connection authorized", "IDENTIFY_SYSTEM", etc.  Subscriber logs
do not differ; failed and good runs both have 'logical replication apply
worker for subscription "tap_sub" has started'.  That suggests a subscriber
stuck in ApplyWorkerMain(), somewhere between that log message and the end of
walrcv_connect.  I failed to reproduce this.  Andrew, are you interested in
attempting to reproduce it and/or identify the blockage?

Thanks,
nm


==== [1] master log with delay (failed run)
...
2020-04-15 20:56:40.334 EDT [5e97ad48.247c:3] 001_rep_changes.pl LOG:  statement: DELETE FROM tab_rep
2020-04-15 20:56:40.334 EDT [5e97ad48.247c:4] 001_rep_changes.pl LOG:  disconnection: session time: 0:00:00.015
user=pgrunnerdatabase=postgres host=127.0.0.1 port=65317
 
2020-04-15 20:56:40.381 EDT [5e97ad39.1964:4] LOG:  received fast shutdown request
2020-04-15 20:56:40.381 EDT [5e97ad39.1964:5] LOG:  aborting any active transactions
2020-04-15 20:56:40.396 EDT [5e97ad39.1964:6] LOG:  background worker "logical replication launcher" (PID 9920) exited
withexit code 1
 
2020-04-15 20:56:40.396 EDT [5e97ad39.1a7c:1] LOG:  shutting down
2020-04-15 20:56:40.538 EDT [5e97ad47.1ce8:9] tap_sub LOG:  disconnection: session time: 0:00:00.541 user=pgrunner
database=postgreshost=127.0.0.1 port=65313
 
2020-04-15 20:56:40.569 EDT [5e97ad39.1964:7] LOG:  database system is shut down
2020-04-15 20:56:40.780 EDT [5e97ad48.c0:1] LOG:  starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by
x86_64-w64-mingw32-gcc.exe(x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit
 
2020-04-15 20:56:40.780 EDT [5e97ad48.c0:2] LOG:  listening on IPv4 address "127.0.0.1", port 58418
2020-04-15 20:56:40.811 EDT [5e97ad48.36c:1] LOG:  database system was shut down at 2020-04-15 20:56:40 EDT
2020-04-15 20:56:40.873 EDT [5e97ad48.c0:3] LOG:  database system is ready to accept connections
2020-04-15 20:56:41.015 EDT [5e97ad49.1888:1] [unknown] LOG:  connection received: host=127.0.0.1 port=65319
2020-04-15 20:56:41.029 EDT [5e97ad49.1888:2] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=001_rep_changes.pl
2020-04-15 20:56:41.044 EDT [5e97ad49.1888:3] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <=
replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
 
... repeated a total of 1800 times ...
2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:1] [unknown] LOG:  connection received: host=127.0.0.1 port=50920
2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:2] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=001_rep_changes.pl
2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:3] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <=
replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
 
2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:4] 001_rep_changes.pl LOG:  disconnection: session time: 0:00:00.015
user=pgrunnerdatabase=postgres host=127.0.0.1 port=50920
 
<test script gives up hope>
2020-04-15 21:02:59.148 EDT [5e97ad48.c0:4] LOG:  received immediate shutdown request
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:1] WARNING:  terminating connection because of crash of another server
process
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:2] DETAIL:  The postmaster has commanded this server process to roll back
thecurrent transaction and exit, because another server process exited abnormally and possibly corrupted shared
memory.
2020-04-15 21:02:59.148 EDT [5e97ad48.2148:3] HINT:  In a moment you should be able to reconnect to the database and
repeatyour command.
 
2020-04-15 21:02:59.164 EDT [5e97ad48.c0:5] LOG:  database system is shut down

==== master log without delay (good run)
...
2020-04-15 11:17:24.868 EDT [5e972584.19a0:3] 001_rep_changes.pl LOG:  statement: DELETE FROM tab_rep
2020-04-15 11:17:24.868 EDT [5e972584.19a0:4] 001_rep_changes.pl LOG:  disconnection: session time: 0:00:00.015
user=pgrunnerdatabase=postgres host=127.0.0.1 port=60986
 
2020-04-15 11:17:24.931 EDT [5e972575.1594:4] LOG:  received fast shutdown request
2020-04-15 11:17:24.931 EDT [5e972575.1594:5] LOG:  aborting any active transactions
2020-04-15 11:17:24.947 EDT [5e972575.1594:6] LOG:  background worker "logical replication launcher" (PID 9988) exited
withexit code 1
 
2020-04-15 11:17:24.947 EDT [5e972575.185c:1] LOG:  shutting down
2020-04-15 11:17:25.072 EDT [5e972584.560:9] tap_sub LOG:  disconnection: session time: 0:00:00.636 user=pgrunner
database=postgreshost=127.0.0.1 port=60983
 
2020-04-15 11:17:25.104 EDT [5e972575.1594:7] LOG:  database system is shut down
2020-04-15 11:17:25.312 EDT [5e972585.164:1] LOG:  starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by
x86_64-w64-mingw32-gcc.exe(x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit
 
2020-04-15 11:17:25.312 EDT [5e972585.164:2] LOG:  listening on IPv4 address "127.0.0.1", port 60213
2020-04-15 11:17:25.343 EDT [5e972585.c7c:1] LOG:  database system was shut down at 2020-04-15 11:17:25 EDT
2020-04-15 11:17:25.390 EDT [5e972585.164:3] LOG:  database system is ready to accept connections
2020-04-15 11:17:25.531 EDT [5e972585.d40:1] [unknown] LOG:  connection received: host=127.0.0.1 port=60988
2020-04-15 11:17:25.552 EDT [5e972585.d40:2] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=001_rep_changes.pl
2020-04-15 11:17:25.567 EDT [5e972585.d40:3] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <=
replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
 
2020-04-15 11:17:25.583 EDT [5e972585.d40:4] 001_rep_changes.pl LOG:  disconnection: session time: 0:00:00.051
user=pgrunnerdatabase=postgres host=127.0.0.1 port=60988
 
2020-04-15 11:17:25.646 EDT [5e972585.1044:1] [unknown] LOG:  connection received: host=127.0.0.1 port=60987
2020-04-15 11:17:25.646 EDT [5e972585.1044:2] [unknown] LOG:  replication connection authorized: user=pgrunner
application_name=tap_sub
2020-04-15 11:17:25.661 EDT [5e972585.1044:3] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-04-15 11:17:25.661 EDT [5e972585.1044:4] tap_sub LOG:  received replication command: START_REPLICATION SLOT
"tap_sub"LOGICAL 0/15C1DF8 (proto_version '1', publication_names '"tap_pub_ins_only"')
 
2020-04-15 11:17:25.661 EDT [5e972585.1044:5] tap_sub LOG:  starting logical decoding for slot "tap_sub"
2020-04-15 11:17:25.661 EDT [5e972585.1044:6] tap_sub DETAIL:  Streaming transactions committing after 0/15A7678,
readingWAL from 0/15A7678.
 
2020-04-15 11:17:25.661 EDT [5e972585.1044:7] tap_sub LOG:  logical decoding found consistent point at 0/15A7678
2020-04-15 11:17:25.661 EDT [5e972585.1044:8] tap_sub DETAIL:  There are no running transactions.
2020-04-15 11:17:25.787 EDT [5e972585.1878:1] [unknown] LOG:  connection received: host=127.0.0.1 port=60993
2020-04-15 11:17:25.803 EDT [5e972585.1878:2] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=001_rep_changes.pl
2020-04-15 11:17:25.803 EDT [5e972585.1878:3] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <=
replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
 
2020-04-15 11:17:25.819 EDT [5e972585.1878:4] 001_rep_changes.pl LOG:  disconnection: session time: 0:00:00.031
user=pgrunnerdatabase=postgres host=127.0.0.1 port=60993
 
2020-04-15 11:17:26.102 EDT [5e972586.fcc:1] [unknown] LOG:  connection received: host=127.0.0.1 port=60996
2020-04-15 11:17:26.102 EDT [5e972586.fcc:2] [unknown] LOG:  connection authorized: user=pgrunner database=postgres
application_name=001_rep_changes.pl
2020-04-15 11:17:26.117 EDT [5e972586.fcc:3] 001_rep_changes.pl LOG:  statement: ALTER PUBLICATION tap_pub_ins_only SET
(publish= 'insert, delete')
 
...



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Next
From: Jesse Zhang
Date:
Subject: Fix compilation failure against LLVM 11