Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag |
Date | |
Msg-id | 3365937.1696801735@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag
|
List | pgsql-hackers |
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Oct 06, 2023 at 03:29:28PM +0900, Michael Paquier wrote: >> Andres, are there logs for this TAP test on serinus? Or perhaps there >> is a core file that could be looked at? The other animals are not >> showing anything for the moment. > Well, it looks OK. Still that's itching a bit. There have been intermittent failures on various buildfarm machines since this went in. After seeing one on my own animal mamba [1], I tried to reproduce it manually on that machine, and it does indeed fail about one time in two. The buildfarm script is not managing to capture the relevant log files, but what I see in a manual run is that 001_worker_spi.pl logs this: ... # Postmaster PID for node "mynode" is 21897 [01:19:53.931](2.663s) ok 5 - bgworkers all launched [01:19:54.711](0.780s) ok 6 - dynamic bgworkers all launched error running SQL: 'psql:<stdin>:1: ERROR: could not start background process HINT: More details may be available in the server log.' while running 'psql -XAtq -d port=56393 host=/tmp/PETPK0Stwi dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECTworker_spi_launch(12, 16394, 16395);' at /home/tgl/pgsql/src/test/modules/worker_spi/../../../../src/test/perl/PostgreSQL/Test/Cluster.pmline 2009. # Postmaster PID for node "mynode" is 21897 ### Stopping node "mynode" using mode immediate # Running: pg_ctl -D /home/tgl/pgsql/src/test/modules/worker_spi/tmp_check/t_001_worker_spi_mynode_data/pgdata -m immediatestop waiting for server to shut down.... done server stopped # No postmaster PID for node "mynode" [01:19:55.032](0.321s) # Tests were run but no plan was declared and done_testing() was not seen. [01:19:55.035](0.002s) # Looks like your test exited with 29 just after 6. and in the postmaster log 2023-10-08 01:19:54.265 EDT [5820] LOG: worker_spi dynamic worker 10 initialized with schema10.counted 2023-10-08 01:19:54.378 EDT [27776] 001_worker_spi.pl LOG: statement: SELECT worker_spi_launch(11, 5, 16395); 2023-10-08 01:19:54.476 EDT [18120] 001_worker_spi.pl LOG: statement: SELECT datname, usename, wait_event FROM pg_stat_activity WHERE backend_type = 'worker_spi dynamic' AND pid IN (5820, 428) ORDER BY datname; 2023-10-08 01:19:54.548 EDT [428] LOG: worker_spi dynamic worker 11 initialized with schema11.counted 2023-10-08 01:19:54.680 EDT [152] 001_worker_spi.pl LOG: statement: SELECT datname, usename, wait_event FROM pg_stat_activity WHERE backend_type = 'worker_spi dynamic' AND pid IN (5820, 428) ORDER BY datname; 2023-10-08 01:19:54.779 EDT [1675] 001_worker_spi.pl LOG: statement: ALTER DATABASE mydb ALLOW_CONNECTIONS false; 2023-10-08 01:19:54.854 EDT [26562] 001_worker_spi.pl LOG: statement: SELECT worker_spi_launch(12, 16394, 16395); 2023-10-08 01:19:54.878 EDT [23636] FATAL: database "mydb" is not currently accepting connections 2023-10-08 01:19:54.888 EDT [21897] LOG: background worker "worker_spi dynamic" (PID 23636) exited with exit code 1 2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl ERROR: could not start background process 2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl HINT: More details may be available in the server log. 2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl STATEMENT: SELECT worker_spi_launch(12, 16394, 16395); 2023-10-08 01:19:54.912 EDT [21897] LOG: received immediate shutdown request 2023-10-08 01:19:55.014 EDT [21897] LOG: database system is shut down What it looks like to me is that there is a code path by which "could not start background process" is reported as a failure of the SELECT worker_spi_launch() query itself. The test script is not expecting that, because it executes that query with # bgworker cannot be launched with connection restriction. my $worker3_pid = $node->safe_psql('postgres', qq[SELECT worker_spi_launch(12, $mydb_id, $myrole_id);]); $node->wait_for_log( qr/database "mydb" is not currently accepting connections/, $log_offset); so safe_psql bails out and we get no further. Since this only seems to happen on slow machines, I'd call it a timing problem or race condition. Unless you want to argue that the race should not happen, probably the fix is to make the test script cope with this worker_spi_launch() call failing. As long as we see the expected result from wait_for_log, we can be pretty sure the right thing happened. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-10-08%2001%3A00%3A22
pgsql-hackers by date: