Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag
Date
Msg-id 3365937.1696801735@sss.pgh.pa.us
Whole thread Raw
In response to Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Add a new BGWORKER_BYPASS_ROLELOGINCHECK flag
List pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Oct 06, 2023 at 03:29:28PM +0900, Michael Paquier wrote:
>> Andres, are there logs for this TAP test on serinus?  Or perhaps there
>> is a core file that could be looked at?  The other animals are not
>> showing anything for the moment.

> Well, it looks OK.  Still that's itching a bit.

There have been intermittent failures on various buildfarm machines
since this went in.  After seeing one on my own animal mamba [1],
I tried to reproduce it manually on that machine, and it does
indeed fail about one time in two.  The buildfarm script is not
managing to capture the relevant log files, but what I see in a
manual run is that 001_worker_spi.pl logs this:

...
# Postmaster PID for node "mynode" is 21897
[01:19:53.931](2.663s) ok 5 - bgworkers all launched
[01:19:54.711](0.780s) ok 6 - dynamic bgworkers all launched
error running SQL: 'psql:<stdin>:1: ERROR:  could not start background process
HINT:  More details may be available in the server log.'
while running 'psql -XAtq -d port=56393 host=/tmp/PETPK0Stwi dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql
'SELECTworker_spi_launch(12, 16394, 16395);' at
/home/tgl/pgsql/src/test/modules/worker_spi/../../../../src/test/perl/PostgreSQL/Test/Cluster.pmline 2009. 
# Postmaster PID for node "mynode" is 21897
### Stopping node "mynode" using mode immediate
# Running: pg_ctl -D /home/tgl/pgsql/src/test/modules/worker_spi/tmp_check/t_001_worker_spi_mynode_data/pgdata -m
immediatestop 
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "mynode"
[01:19:55.032](0.321s) # Tests were run but no plan was declared and done_testing() was not seen.
[01:19:55.035](0.002s) # Looks like your test exited with 29 just after 6.

and in the postmaster log

2023-10-08 01:19:54.265 EDT [5820] LOG:  worker_spi dynamic worker 10 initialized with schema10.counted
2023-10-08 01:19:54.378 EDT [27776] 001_worker_spi.pl LOG:  statement: SELECT worker_spi_launch(11, 5, 16395);
2023-10-08 01:19:54.476 EDT [18120] 001_worker_spi.pl LOG:  statement: SELECT datname, usename, wait_event FROM
pg_stat_activity
                    WHERE backend_type = 'worker_spi dynamic' AND
                    pid IN (5820, 428) ORDER BY datname;
2023-10-08 01:19:54.548 EDT [428] LOG:  worker_spi dynamic worker 11 initialized with schema11.counted
2023-10-08 01:19:54.680 EDT [152] 001_worker_spi.pl LOG:  statement: SELECT datname, usename, wait_event FROM
pg_stat_activity
                    WHERE backend_type = 'worker_spi dynamic' AND
                    pid IN (5820, 428) ORDER BY datname;
2023-10-08 01:19:54.779 EDT [1675] 001_worker_spi.pl LOG:  statement: ALTER DATABASE mydb ALLOW_CONNECTIONS false;
2023-10-08 01:19:54.854 EDT [26562] 001_worker_spi.pl LOG:  statement: SELECT worker_spi_launch(12, 16394, 16395);
2023-10-08 01:19:54.878 EDT [23636] FATAL:  database "mydb" is not currently accepting connections
2023-10-08 01:19:54.888 EDT [21897] LOG:  background worker "worker_spi dynamic" (PID 23636) exited with exit code 1
2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl ERROR:  could not start background process
2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl HINT:  More details may be available in the server log.
2023-10-08 01:19:54.888 EDT [26562] 001_worker_spi.pl STATEMENT:  SELECT worker_spi_launch(12, 16394, 16395);
2023-10-08 01:19:54.912 EDT [21897] LOG:  received immediate shutdown request
2023-10-08 01:19:55.014 EDT [21897] LOG:  database system is shut down

What it looks like to me is that there is a code path by which "could
not start background process" is reported as a failure of the SELECT
worker_spi_launch() query itself.  The test script is not expecting
that, because it executes that query with

# bgworker cannot be launched with connection restriction.
my $worker3_pid = $node->safe_psql('postgres',
   qq[SELECT worker_spi_launch(12, $mydb_id, $myrole_id);]);
$node->wait_for_log(
   qr/database "mydb" is not currently accepting connections/, $log_offset);

so safe_psql bails out and we get no further.

Since this only seems to happen on slow machines, I'd call it a timing
problem or race condition.  Unless you want to argue that the race
should not happen, probably the fix is to make the test script cope
with this worker_spi_launch() call failing.  As long as we see the
expected result from wait_for_log, we can be pretty sure the right
thing happened.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-10-08%2001%3A00%3A22



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Does anyone ever use OPTIMIZER_DEBUG?
Next
From: Peter Smith
Date:
Subject: Re: Synchronizing slots from primary to standby