001_rep_changes.pl timeout on jacana/bowerbird - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | 001_rep_changes.pl timeout on jacana/bowerbird |
Date | |
Msg-id | 20200426012748.GA1529639@rfd.leadboat.com Whole thread Raw |
List | pgsql-hackers |
Starting 2019-11-17, jacana and bowerbird (different compiler, same machine?) have failed four times like this: # poll_query_until timed out executing this query: # SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name= 'tap_sub'; # expecting this output: # t # last actual query output: # # with stderr: # Looks like your test exited with 25 just after 8. [11:06:11] t/001_rep_changes.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) Failed 9/17 subtests Every such run: sysname │ snapshot │ branch │ bfurl ───────────┼─────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────────────── bowerbird │ 2019-11-17 15:22:42 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-11-17%2015%3A22%3A42 bowerbird │ 2020-01-10 17:30:49 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2020-01-10%2017%3A30%3A49 jacana │ 2020-04-05 00:00:27 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-05%2000%3A00%3A27 jacana │ 2020-04-16 00:00:27 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-16%2000%3A00%3A27 The dates and branches suggest either a v13 regression (hence my concern) or a regression in the underlying machine. In each affected run, other tests completed at their normal speed, but 001_rep_changes.pl failed after 397s (a normal run takes <30s). In the publisher log, the failed run[1] is missing "replication connection authorized", "IDENTIFY_SYSTEM", etc. Subscriber logs do not differ; failed and good runs both have 'logical replication apply worker for subscription "tap_sub" has started'. That suggests a subscriber stuck in ApplyWorkerMain(), somewhere between that log message and the end of walrcv_connect. I failed to reproduce this. Andrew, are you interested in attempting to reproduce it and/or identify the blockage? Thanks, nm ==== [1] master log with delay (failed run) ... 2020-04-15 20:56:40.334 EDT [5e97ad48.247c:3] 001_rep_changes.pl LOG: statement: DELETE FROM tab_rep 2020-04-15 20:56:40.334 EDT [5e97ad48.247c:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunnerdatabase=postgres host=127.0.0.1 port=65317 2020-04-15 20:56:40.381 EDT [5e97ad39.1964:4] LOG: received fast shutdown request 2020-04-15 20:56:40.381 EDT [5e97ad39.1964:5] LOG: aborting any active transactions 2020-04-15 20:56:40.396 EDT [5e97ad39.1964:6] LOG: background worker "logical replication launcher" (PID 9920) exited withexit code 1 2020-04-15 20:56:40.396 EDT [5e97ad39.1a7c:1] LOG: shutting down 2020-04-15 20:56:40.538 EDT [5e97ad47.1ce8:9] tap_sub LOG: disconnection: session time: 0:00:00.541 user=pgrunner database=postgreshost=127.0.0.1 port=65313 2020-04-15 20:56:40.569 EDT [5e97ad39.1964:7] LOG: database system is shut down 2020-04-15 20:56:40.780 EDT [5e97ad48.c0:1] LOG: starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe(x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit 2020-04-15 20:56:40.780 EDT [5e97ad48.c0:2] LOG: listening on IPv4 address "127.0.0.1", port 58418 2020-04-15 20:56:40.811 EDT [5e97ad48.36c:1] LOG: database system was shut down at 2020-04-15 20:56:40 EDT 2020-04-15 20:56:40.873 EDT [5e97ad48.c0:3] LOG: database system is ready to accept connections 2020-04-15 20:56:41.015 EDT [5e97ad49.1888:1] [unknown] LOG: connection received: host=127.0.0.1 port=65319 2020-04-15 20:56:41.029 EDT [5e97ad49.1888:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl 2020-04-15 20:56:41.044 EDT [5e97ad49.1888:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; ... repeated a total of 1800 times ... 2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:1] [unknown] LOG: connection received: host=127.0.0.1 port=50920 2020-04-15 21:02:58.960 EDT [5e97aec2.21f8:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl 2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-04-15 21:02:58.975 EDT [5e97aec2.21f8:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunnerdatabase=postgres host=127.0.0.1 port=50920 <test script gives up hope> 2020-04-15 21:02:59.148 EDT [5e97ad48.c0:4] LOG: received immediate shutdown request 2020-04-15 21:02:59.148 EDT [5e97ad48.2148:1] WARNING: terminating connection because of crash of another server process 2020-04-15 21:02:59.148 EDT [5e97ad48.2148:2] DETAIL: The postmaster has commanded this server process to roll back thecurrent transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-04-15 21:02:59.148 EDT [5e97ad48.2148:3] HINT: In a moment you should be able to reconnect to the database and repeatyour command. 2020-04-15 21:02:59.164 EDT [5e97ad48.c0:5] LOG: database system is shut down ==== master log without delay (good run) ... 2020-04-15 11:17:24.868 EDT [5e972584.19a0:3] 001_rep_changes.pl LOG: statement: DELETE FROM tab_rep 2020-04-15 11:17:24.868 EDT [5e972584.19a0:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.015 user=pgrunnerdatabase=postgres host=127.0.0.1 port=60986 2020-04-15 11:17:24.931 EDT [5e972575.1594:4] LOG: received fast shutdown request 2020-04-15 11:17:24.931 EDT [5e972575.1594:5] LOG: aborting any active transactions 2020-04-15 11:17:24.947 EDT [5e972575.1594:6] LOG: background worker "logical replication launcher" (PID 9988) exited withexit code 1 2020-04-15 11:17:24.947 EDT [5e972575.185c:1] LOG: shutting down 2020-04-15 11:17:25.072 EDT [5e972584.560:9] tap_sub LOG: disconnection: session time: 0:00:00.636 user=pgrunner database=postgreshost=127.0.0.1 port=60983 2020-04-15 11:17:25.104 EDT [5e972575.1594:7] LOG: database system is shut down 2020-04-15 11:17:25.312 EDT [5e972585.164:1] LOG: starting PostgreSQL 13devel on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe(x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 7.3.0, 64-bit 2020-04-15 11:17:25.312 EDT [5e972585.164:2] LOG: listening on IPv4 address "127.0.0.1", port 60213 2020-04-15 11:17:25.343 EDT [5e972585.c7c:1] LOG: database system was shut down at 2020-04-15 11:17:25 EDT 2020-04-15 11:17:25.390 EDT [5e972585.164:3] LOG: database system is ready to accept connections 2020-04-15 11:17:25.531 EDT [5e972585.d40:1] [unknown] LOG: connection received: host=127.0.0.1 port=60988 2020-04-15 11:17:25.552 EDT [5e972585.d40:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl 2020-04-15 11:17:25.567 EDT [5e972585.d40:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-04-15 11:17:25.583 EDT [5e972585.d40:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.051 user=pgrunnerdatabase=postgres host=127.0.0.1 port=60988 2020-04-15 11:17:25.646 EDT [5e972585.1044:1] [unknown] LOG: connection received: host=127.0.0.1 port=60987 2020-04-15 11:17:25.646 EDT [5e972585.1044:2] [unknown] LOG: replication connection authorized: user=pgrunner application_name=tap_sub 2020-04-15 11:17:25.661 EDT [5e972585.1044:3] tap_sub LOG: received replication command: IDENTIFY_SYSTEM 2020-04-15 11:17:25.661 EDT [5e972585.1044:4] tap_sub LOG: received replication command: START_REPLICATION SLOT "tap_sub"LOGICAL 0/15C1DF8 (proto_version '1', publication_names '"tap_pub_ins_only"') 2020-04-15 11:17:25.661 EDT [5e972585.1044:5] tap_sub LOG: starting logical decoding for slot "tap_sub" 2020-04-15 11:17:25.661 EDT [5e972585.1044:6] tap_sub DETAIL: Streaming transactions committing after 0/15A7678, readingWAL from 0/15A7678. 2020-04-15 11:17:25.661 EDT [5e972585.1044:7] tap_sub LOG: logical decoding found consistent point at 0/15A7678 2020-04-15 11:17:25.661 EDT [5e972585.1044:8] tap_sub DETAIL: There are no running transactions. 2020-04-15 11:17:25.787 EDT [5e972585.1878:1] [unknown] LOG: connection received: host=127.0.0.1 port=60993 2020-04-15 11:17:25.803 EDT [5e972585.1878:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl 2020-04-15 11:17:25.803 EDT [5e972585.1878:3] 001_rep_changes.pl LOG: statement: SELECT pg_current_wal_lsn() <= replay_lsnAND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 2020-04-15 11:17:25.819 EDT [5e972585.1878:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.031 user=pgrunnerdatabase=postgres host=127.0.0.1 port=60993 2020-04-15 11:17:26.102 EDT [5e972586.fcc:1] [unknown] LOG: connection received: host=127.0.0.1 port=60996 2020-04-15 11:17:26.102 EDT [5e972586.fcc:2] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=001_rep_changes.pl 2020-04-15 11:17:26.117 EDT [5e972586.fcc:3] 001_rep_changes.pl LOG: statement: ALTER PUBLICATION tap_pub_ins_only SET (publish= 'insert, delete') ...
pgsql-hackers by date: