RE: 001_rep_changes.pl succeeds after a long time - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: 001_rep_changes.pl succeeds after a long time
Date
Msg-id OS0PR01MB571689717B7139AAD7F62DE694D32@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to 001_rep_changes.pl succeeds after a long time  (Álvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
On Thursday, March 13, 2025 12:29 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hi,

> I noticed that the test file 001_repo_changes.pl finished successfully after
> having taken 180s to run.  This seems pretty suspicious -- normally that step
> takes around one second.
> 
> The problem is seen in this step:
> 
> [19:44:49.572](0.262s) ok 24 - update works with dropped subscriber column
> ### Stopping node "publisher" using mode fast # Running: pg_ctl -D
> /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
> /t_001_rep_changes_publisher_data/pgdata -m fast stop waiting for server to
> shut down.... done server stopped # No postmaster PID for node "publisher"
> ### Starting node "publisher"
> # Running: pg_ctl -w -D
> /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
> /t_001_rep_changes_publisher_data/pgdata -l
> /home/alvherre/Code/pgsql-build/master/src/test/subscription/tmp_check
> /log/001_rep_changes_publisher.log -o --cluster-name=publisher start
> waiting for server to start.... done server started # Postmaster PID for node
> "publisher" is 3502228 Waiting for replication conn tap_sub's replay_lsn to
> pass 0/17DEBD8 on publisher done
> [19:47:50.689](181.116s) ok 25 - check replicated inserts after subscription
> publication change
> 
> 
> Note that test 25 succeeded but took 3 minutes.  If I look at the publisher logs,
> I see this:
> 
> 2025-03-11 19:44:49.885 CET [3501892] 001_rep_changes.pl LOG:  statement:
> DELETE FROM tab_rep
> 2025-03-11 19:44:50.052 CET [3494694] LOG:  received fast shutdown request
> 2025-03-11 19:44:50.052 CET [3494694] LOG:  aborting any active
> transactions
> 2025-03-11 19:44:50.053 CET [3494694] LOG:  background worker "logical
> replication launcher" (PID 3494772) exited with exit code 1
> 2025-03-11 19:44:50.053 CET [3494755] LOG:  shutting down
> 2025-03-11 19:44:50.480 CET [3501776] tap_sub LOG:  released logical
> replication slot "tap_sub"
> 2025-03-11 19:44:50.484 CET [3494755] LOG:  checkpoint starting: shutdown
> immediate
> 2025-03-11 19:44:50.485 CET [3494755] LOG:  checkpoint complete: wrote 9
> buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0
> recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=579 kB, estimate=579 kB;
> lsn=0/17DEB60, redo lsn=0/17DEB60
> 2025-03-11 19:44:50.488 CET [3494694] LOG:  database system is shut down
> 2025-03-11 19:44:50.884 CET [3502228] LOG:  starting PostgreSQL 18devel on
> x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
> 2025-03-11 19:44:50.884 CET [3502228] LOG:  listening on Unix socket
> "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529"
> 2025-03-11 19:44:50.886 CET [3502349] LOG:  database system was shut
> down at 2025-03-11 19:44:50 CET
> 2025-03-11 19:44:50.889 CET [3502228] LOG:  database system is ready to
> accept connections
> 2025-03-11 19:44:50.930 CET [3502395] 001_rep_changes.pl LOG:  statement:
> SELECT pg_is_in_recovery()
> 2025-03-11 19:44:50.945 CET [3502428] 001_rep_changes.pl LOG:  statement:
> SELECT pg_current_wal_lsn()
> 2025-03-11 19:44:50.960 CET [3502440] 001_rep_changes.pl LOG:  statement:
> SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
>              FROM pg_catalog.pg_stat_replication
>              WHERE application_name IN ('tap_sub', 'walreceiver')
> 
> The last query is repeated a number of times for 3 minutes, and finally we get
> this
> 
> 2025-03-11 19:47:50.569 CET [3540762] 001_rep_changes.pl LOG:  statement:
> SELECT '0/17DEBD8' <= replay_lsn AND state = 'streaming'
>              FROM pg_catalog.pg_stat_replication
>              WHERE application_name IN ('tap_sub', 'walreceiver')
> 2025-03-11 19:47:50.596 CET [3540764] tap_sub LOG:  statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2025-03-11 19:47:50.597 CET [3540764] tap_sub LOG:  received replication
> command: IDENTIFY_SYSTEM
> 2025-03-11 19:47:50.597 CET [3540764] tap_sub STATEMENT:
> IDENTIFY_SYSTEM
> 
> 
> Meanwhile, the subscriber has this:
> 
> 2025-03-11 19:44:49.777 CET [3501647] LOG:  logical replication worker for
> subscription "tap_sub" will restart because of a parameter change
> 2025-03-11 19:44:49.785 CET [3501757] LOG:  logical replication apply worker
> for subscription "tap_sub" has started
> 2025-03-11 19:44:50.481 CET [3501757] LOG:  data stream from publisher has
> ended
> 2025-03-11 19:44:50.481 CET [3501757] ERROR:  could not send
> end-of-streaming message to primary: server closed the connection
> unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
>     no COPY in progress
> 2025-03-11 19:44:50.484 CET [3495251] LOG:  background worker "logical
> replication apply worker" (PID 3501757) exited with exit code 1
> 2025-03-11 19:44:50.490 CET [3502088] LOG:  logical replication apply worker
> for subscription "tap_sub" has started
> 2025-03-11 19:44:50.490 CET [3502088] ERROR:  apply worker for
> subscription "tap_sub" could not connect to the publisher: connection to
> server on socket "/run/user/1000/alvherre-tmp/i4fQURzady/.s.PGSQL.26529"
> failed: No such file or directory
>         Is the server running locally and accepting connections on that socket?
> 2025-03-11 19:44:50.493 CET [3495251] LOG:  background worker "logical
> replication apply worker" (PID 3502088) exited with exit code 1
> 2025-03-11 19:47:50.593 CET [3540763] LOG:  logical replication apply worker
> for subscription "tap_sub" has started
> 
> 
> Is the problem that the subscriber never tries to establish the connection
> again?
> 
> I have tried to reproduce it, but I haven't seen it again.

I suspect that the launcher may have missed the latch ping from the exiting
apply worker due to the bug reported in [1].

[1]
https://www.postgresql.org/message-id/flat/CALDaNm3fvCg00GPWtjagDNP1cSBAW-gMKTUZWFTg%2B8x6tUa-%3Dw%40mail.gmail.com#e3ac171e017cc88bf5f86541ecd91a5a

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: long-standing data loss bug in initial sync of logical replication
Next
From: Steven Niu
Date:
Subject: Re: [Patch] remove duplicated smgrclose