Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Race conditions in 019_replslot_limit.pl |
Date | |
Msg-id | 20220215235157.jknqi73c5scietxe@alap3.anarazel.de Whole thread Raw |
In response to | Race conditions in 019_replslot_limit.pl (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Race conditions in 019_replslot_limit.pl
|
List | pgsql-hackers |
Hi, On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-02-14%2006%3A30%3A04 > > [07:42:23] t/018_wal_optimize.pl ................ ok 12403 ms ( 0.00 usr > 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU) > # poll_query_until timed out executing this query: > # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' > # expecting this output: > # lost > # last actual query output: > # unreserved The relevant bit from the log is: 2022-02-14 07:42:27.817 CET [6209f9d3.68bab:3] LOG: database system is ready to accept read-only connections 2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1 2022-02-14 07:42:27.819 CET [6209f9d3.68bb7:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment00000001000000000000001B has already been removed 2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:1] LOG: started streaming WAL from primary at 0/1B00000 on timeline 1 2022-02-14 07:42:27.822 CET [6209f9d3.68bb9:2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment00000001000000000000001B has already been removed I think what happened is that there was no WAL to receive between the start of the primary and the $node_primary3->wait_for_catchup($node_standby3); Because the slot is created without reserving WAL that allows the primary to remove the WAL segments without ever creating a slot based conflict. I think that should be fixable by reserving the slot at creation time? > and: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-02-15%2011%3A00%3A08 > > # Failed test 'have walsender pid 3682154 > # 3682136' > # at t/019_replslot_limit.pl line 335. > # '3682154 > # 3682136' > # doesn't match '(?^:^[0-9]+$)' > > The latter looks like there are two walsenders active, which confuses the > test. Too bad there's no plain pid in the log_line_prefix. Looks like that used to be the buildfarm default, and I haven't fixed that animals configuration... %c apparently is hex(process startup time).hex(pid) in hex, so we're looking for 382f58... Confirmed by the slot name: 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL) which pg_basebackup builds using the backend pid: replication_slot = psprintf("pg_basebackup_%d", (int) PQbackendPID(param->bgconn)); The logs for that pid are: 2022-02-15 12:10:20.873 CET [620b8a1c.382f58:1] LOG: connection received: host=[local] 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:2] LOG: replication connection authorized: user=bf application_name=019_replslot_limit.pl 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:3] LOG: received replication command: SHOW data_directory_mode 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:4] STATEMENT: SHOW data_directory_mode 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:5] LOG: received replication command: CREATE_REPLICATION_SLOT "pg_basebackup_3682136"TEMPORARY PHYSICAL ( RESERVE_WAL) 2022-02-15 12:10:20.874 CET [620b8a1c.382f58:6] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_3682136" TEMPORARY PHYSICAL( RESERVE_WAL) 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:7] LOG: received replication command: IDENTIFY_SYSTEM 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:8] STATEMENT: IDENTIFY_SYSTEM 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:9] LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_3682136"0/600000 TIMELINE 1 2022-02-15 12:10:20.875 CET [620b8a1c.382f58:10] STATEMENT: START_REPLICATION SLOT "pg_basebackup_3682136" 0/600000 TIMELINE1 Even though the node has log_disconnect = true, and other processes indeed log their disconnection, there's no disconnect for the above session until the server is shut down. Even though pg_basebackup clearly finished? Uh, huh? I guess it's conceivable that the backend was still working through process shutdown? But it doesn't seem too likely, given that several other connections manage to get through entire connect / disconnect cycles? > Has anyone looked into these yet? Hadn't yet... Greetings, Andres Freund
pgsql-hackers by date: