RE: 001_rep_changes.pl succeeds after a long time - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: 001_rep_changes.pl succeeds after a long time |
Date | |
Msg-id | OS0PR01MB571689717B7139AAD7F62DE694D32@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | 001_rep_changes.pl succeeds after a long time (Álvaro Herrera <alvherre@alvh.no-ip.org>) |
List | pgsql-hackers |
On Thursday, March 13, 2025 12:29 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: Hi, > I noticed that the test file 001_repo_changes.pl finished successfully after > having taken 180s to run. This seems pretty suspicious -- normally that step > takes around one second. > > The problem is seen in this step: > > [19:44:49.572](0.262s) ok 24 - update works with dropped subscriber column > ### Stopping node "publisher" using mode fast # Running: pg_ctl -D > /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check > /t_001_rep_changes_publisher_data/pgdata -m fast stop waiting for server to > shut down.... done server stopped # No postmaster PID for node "publisher" > ### Starting node "publisher" > # Running: pg_ctl -w -D > /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check > /t_001_rep_changes_publisher_data/pgdata -l > /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check > /log/001_rep_changes_publisher.log -o --cluster-name=publisher start > waiting for server to start.... done server started # Postmaster PID for node > "publisher" is 3502228 Waiting for replication conn tap_sub's replay_lsn to > pass 0/17DEBD8 on publisher done > [19:47:50.689](181.116s) ok 25 - check replicated inserts after subscription > publication change > > > Note that test 25 succeeded but took 3 minutes. If I look at the publisher logs, > I see this: > > 2025-03-11 19:44:49.885 CET [3501892] 001_rep_changes.pl LOG: statement: > DELETE FROM tab_rep > 2025-03-11 19:44:50.052 CET [3494694] LOG: received fast shutdown request > 2025-03-11 19:44:50.052 CET [3494694] LOG: aborting any active > transactions > 2025-03-11 19:44:50.053 CET [3494694] LOG: background worker "logical > replication launcher" (PID 3494772) exited with exit code 1 > 2025-03-11 19:44:50.053 CET [3494755] LOG: shutting down > 2025-03-11 19:44:50.480 CET [3501776] tap_sub LOG: released logical > replication slot "tap_sub" > 2025-03-11 19:44:50.484 CET [3494755] LOG: checkpoint starting: shutdown > immediate > 2025-03-11 19:44:50.485 CET [3494755] LOG: checkpoint complete: wrote 9 > buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 > recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, > longest=0.000 s, average=0.000 s; distance=579 kB, estimate=579 kB; > lsn=0/17DEB60, redo lsn=0/17DEB60 > 2025-03-11 19:44:50.488 CET [3494694] LOG: database system is shut down > 2025-03-11 19:44:50.884 CET [3502228] LOG: starting PostgreSQL 18devel on > x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit > 2025-03-11 19:44:50.884 CET [3502228] LOG: listening on Unix socket > "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529" > 2025-03-11 19:44:50.886 CET [3502349] LOG: database system was shut > down at 2025-03-11 19:44:50 CET > 2025-03-11 19:44:50.889 CET [3502228] LOG: database system is ready to > accept connections > 2025-03-11 19:44:50.930 CET [3502395] 001_rep_changes.pl LOG: statement: > SELECT pg_is_in_recovery() > 2025-03-11 19:44:50.945 CET [3502428] 001_rep_changes.pl LOG: statement: > SELECT pg_current_wal_lsn() > 2025-03-11 19:44:50.960 CET [3502440] 001_rep_changes.pl LOG: statement: > SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming' > FROM pg_catalog.pg_stat_replication > WHERE application_name IN ('tap_sub', 'walreceiver') > > The last query is repeated a number of times for 3 minutes, and finally we get > this > > 2025-03-11 19:47:50.569 CET [3540762] 001_rep_changes.pl LOG: statement: > SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming' > FROM pg_catalog.pg_stat_replication > WHERE application_name IN ('tap_sub', 'walreceiver') > 2025-03-11 19:47:50.596 CET [3540764] tap_sub LOG: statement: SELECT > pg_catalog.set_config('search_path', '', false); > 2025-03-11 19:47:50.597 CET [3540764] tap_sub LOG: received replication > command: IDENTIFY_SYSTEM > 2025-03-11 19:47:50.597 CET [3540764] tap_sub STATEMENT: > IDENTIFY_SYSTEM > > > Meanwhile, the subscriber has this: > > 2025-03-11 19:44:49.777 CET [3501647] LOG: logical replication worker for > subscription "tap_sub" will restart because of a parameter change > 2025-03-11 19:44:49.785 CET [3501757] LOG: logical replication apply worker > for subscription "tap_sub" has started > 2025-03-11 19:44:50.481 CET [3501757] LOG: data stream from publisher has > ended > 2025-03-11 19:44:50.481 CET [3501757] ERROR: could not send > end-of-streaming message to primary: server closed the connection > unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > no COPY in progress > 2025-03-11 19:44:50.484 CET [3495251] LOG: background worker "logical > replication apply worker" (PID 3501757) exited with exit code 1 > 2025-03-11 19:44:50.490 CET [3502088] LOG: logical replication apply worker > for subscription "tap_sub" has started > 2025-03-11 19:44:50.490 CET [3502088] ERROR: apply worker for > subscription "tap_sub" could not connect to the publisher: connection to > server on socket "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529" > failed: No such file or directory > Is the server running locally and accepting connections on that socket? > 2025-03-11 19:44:50.493 CET [3495251] LOG: background worker "logical > replication apply worker" (PID 3502088) exited with exit code 1 > 2025-03-11 19:47:50.593 CET [3540763] LOG: logical replication apply worker > for subscription "tap_sub" has started > > > Is the problem that the subscriber never tries to establish the connection > again? > > I have tried to reproduce it, but I haven't seen it again. I suspect that the launcher may have missed the latch ping from the exiting apply worker due to the bug reported in [1]. [1] https://www.postgresql.org/message-id/flat/CALDaNm3fvCg00GPWtjagDNP1cSBAW-gMKTUZWFTg%2B8x6tUa-%3Dw%40mail.gmail.com#e3ac171e017cc88bf5f86541ecd91a5a Best Regards, Hou zj
pgsql-hackers by date: