Re: Recent 027_streaming_regress.pl hangs - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Recent 027_streaming_regress.pl hangs |
Date | |
Msg-id | 20240321063953.oyfojyq3wbc77xxb@awork3.anarazel.de Whole thread Raw |
In response to | Re: Recent 027_streaming_regress.pl hangs (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On 2024-03-20 17:41:45 -0700, Andres Freund wrote: > 2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl > 2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state= 'streaming' > FROM pg_catalog.pg_stat_replication > WHERE application_name IN ('standby_1', 'walreceiver') > 2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgreshost=[local] > 2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request > 2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down > > There was no activity for 7 minutes. > > System statistics show relatively low load CPU and IO load for the period from > 22:00 - 22:10. > > > I suspect we have some more fundamental instability at our hands, there have > been failures like this going back a while, and on various machines. I've reproduced something like this scenario locally, although I am not sure it is precisely what is happening on the buildfarm. At least here it looks like the problem is that apply is lagging substantially: 2024-03-20 22:43:11.024 PDT [1023505][walreceiver][:0][] DEBUG: sendtime 2024-03-20 22:43:11.024348-07 receipttime 2024-03-2022:43:11.02437-07 replication apply delay 285322 ms transfer latency 1 ms Which then means that we'll wait for a long time for apply to finish: Waiting for replication conn standby_1's replay_lsn to pass 0/14385E20 on primary [22:41:34.521](0.221s) # state before polling: # pid | 1023508 # application_name | standby_1 # sent_lsn | 0/14385E20 # primary_wal_lsn | 0/14385E20 # standby_write_lsn | 0/14385E20 # primary_flush_lsn | 0/14385E20 # standby_flush_lsn | 0/14385E20 # standby_replay_lsn | 0/126D5C58 ... [22:43:16.376](0.161s) # running query, attempt 679/1800 [22:43:16.627](0.251s) # state now: # pid | 1023508 # application_name | standby_1 # sent_lsn | 0/14778468 # primary_wal_lsn | 0/14778468 # standby_write_lsn | 0/14778468 # primary_flush_lsn | 0/14778468 # standby_flush_lsn | 0/14778468 # standby_replay_lsn | 0/14778468 I am not sure I have debugged why exactly, but it sure looks like one part is the startup process being busy unlinking files synchronously. This appears to be exacerbated by mdunlinkfork() first truncating and then separately unlinking the file - that looks to trigger a lot of filesystem journal flushes (on xfs). We also spend a fair bit of time updating the control file, because we flush the WAL when replaying a transaction commit with a relation unlink. That also badly interacts with doing metadata operations... Thirdly, we flush received WAL extremely granularly at times, which requires another fsync: 2024-03-20 23:30:21.469 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB0000 2024-03-20 23:30:21.473 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB0170 2024-03-20 23:30:21.479 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB2528 2024-03-20 23:30:21.480 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB58C8 2024-03-20 23:30:21.487 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB7DA0 2024-03-20 23:30:21.490 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BB92B0 2024-03-20 23:30:21.494 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBBAC0 2024-03-20 23:30:21.496 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBCCC0 2024-03-20 23:30:21.499 PDT [1525084][walreceiver][:0][] LOG: flushed received WAL up to 0/13BBCE18 This all when we're quite far behind with apply... Greetings, Andres Freund
pgsql-hackers by date: