Re: [BUGS] 9.6rc1 Background worker starting multiple times - Mailing list pgsql-bugs

From Keith Fiske
Subject Re: [BUGS] 9.6rc1 Background worker starting multiple times
Date
Msg-id CAG1_KcCXZUoksYi23n7R0KP5DAm-F3BaTFa4ddp6jmCz9DdmiQ@mail.gmail.com
Whole thread Raw
In response to Re: [BUGS] 9.6rc1 Background worker starting multiple times  (Keith Fiske <keith@omniti.com>)
Responses Re: [BUGS] 9.6rc1 Background worker starting multiple times
List pgsql-bugs


-On Tue, Dec 20, 2016 at 10:26 AM, Keith Fiske <keith@omniti.com> wrote:


On Tue, Dec 20, 2016 at 1:42 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Dec 20, 2016 at 6:47 AM, Keith Fiske <keith@omniti.com> wrote:
> I'm not quite sure exactly how to debug this further to see what is actually
> setting the latch in 9.6.

The commit you are complaining about here is likely this one:
commit: db0f6cad4884bd4c835156d3a720d9a79dbd63a9
author: Robert Haas <rhaas@postgresql.org>
date: Fri, 9 Oct 2015 14:31:04 -0400
Remove set_latch_on_sigusr1 flag.

This flag has proven to be a recipe for bugs, and it doesn't seem like
it can really buy anything in terms of performance.  So let's just
*always* set the process latch when we receive SIGUSR1 instead of
trying to do it only when needed.

So your process gets a SIGUSR1 and its latch is set, causing the loop
to repeat... I would discard that as a backend bug, because the
removal of set_latch_on_sigusr1 was a good move.
--
Michael

Anything I can do in the mean time to fix this, or have to wait for a bugfix patch?

Keith

So, attached is my naive attempt to try and handle SIGUSR1. Can at least confirm this is the source of the problem, but I'm unable to figure out how to work around it. Handling it does in fact stop the background worker from going nuts anymore. But it also completely stops the maintenance loop from working properly. It runs once, then never again. It seems to get stuck after the call to WaitForBackgroundWorkerStartup() and never even gets to my WaitLatch() call. There is logging that it successfully unregisters the dynamic BGW, but never seems to go further than that. I tried adding latch_sigusr1_handler() similar to how the handler in procsignal.c does, but didn't seem to fix it. With some additional debugging I added, it doesn't even appear to be calling my custom signal handler function at all. got_sigusr1 is never set to true.

The simple act of registering to listen for the SIGUSR1 seems to stop the original bug. If I comment out just this line in pg_partman_bgw_main() (the one in pg_partman_bgw_run_maint() seems to make no difference):

    pqsignal(SIGUSR1, pg_partman_bgw_sigusr1);

it goes back to doing what it was doing before.

Guessing there's a lot more to SIGUSR1 than I can see right now. But figured I'd at least try to provide some additional investigation. I noticed in the BGW code, it mentions using SIGUSR1 on start/stop.

https://doxygen.postgresql.org/bgworker_8h_source.html#l00099

Perhaps that's part of the issue when I try and handle it instead?

partman maintenance timer is set for 10 seconds so the timestamp on the log entries shows it's not running on time.

2016-12-20 11:56:46 EST [] [14486]: [7-1] user=,db=,e=00000 LOG:  database system is ready to accept connections
2016-12-20 11:56:46 EST [] [14495]: [1-1] user=,db=,e=00000 LOG:  pg_partman master background worker master process initialized with role keith
2016-12-20 11:56:46 EST [] [14495]: [2-1] user=,db=,e=00000 DEBUG:  Latch status just before for loop: 1
2016-12-20 11:56:46 EST [] [14495]: [3-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-20 11:56:46 EST [] [14495]: [4-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [5-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [6-1] user=,db=,e=00000 DEBUG:  After sighup check: 0
2016-12-20 11:56:46 EST [] [14495]: [7-1] user=,db=,e=00000 DEBUG:  After sigterm check: 0
2016-12-20 11:56:46 EST [] [14495]: [8-1] user=,db=,e=00000 DEBUG:  Before sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [9-1] user=,db=,e=00000 DEBUG:  After sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [10-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-20 11:56:46 EST [] [14495]: [11-1] user=,db=,e=00000 DEBUG:  Registering dynamic background worker...
2016-12-20 11:56:46 EST [] [14495]: [12-1] user=,db=,e=00000 DEBUG:  Waiting for BGW startup...
2016-12-20 11:56:46 EST [] [14486]: [8-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:56:46 EST [] [14486]: [9-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:56:46 EST [] [14493]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started
2016-12-20 11:56:46 EST [] [14496]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-20 11:56:46 EST [] [14496]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-20 11:56:46 EST [] [14496]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-20 11:56:46 EST [] [14496]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-20 11:56:46 EST [] [14496]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-20 11:56:46 EST [] [14496]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-20 11:56:46 EST [] [14486]: [10-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 14496) exited with exit code 0
2016-12-20 11:56:46 EST [] [14486]: [11-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:57:16 EST [] [14506]: [1-1] user=,db=,e=00000 DEBUG:  autovacuum: processing database "keith"
2016-12-20 11:57:46 EST [] [14515]: [1-1] user=,db=,e=00000 DEBUG:  autovacuum: processing database "postgres"
2

Attachment

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: [BUGS] Postgres8.3 replication issue
Next
From: Mark Kirkwood
Date:
Subject: Re: [BUGS] pg_dump's results have quite different size