Re: Current master hangs under the debugger after Parallel Seq Scan (Linux, MacOS) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Current master hangs under the debugger after Parallel Seq Scan (Linux, MacOS)
Date
Msg-id 3kp64koynvdzepbyddpkel7dugnku7ksfevkovx3rrrsle4dcp@ah7gla44mxjh
Whole thread Raw
In response to Current master hangs under the debugger after Parallel Seq Scan (Linux, MacOS)  (Vladlen Popolitov <v.popolitov@postgrespro.ru>)
Responses Re: Current master hangs under the debugger after Parallel Seq Scan (Linux, MacOS)
List pgsql-hackers
Hi,

On 2025-03-26 21:53:35 +0700, Vladlen Popolitov wrote:
> During debug session I found, that queries with Parallel Seq Scan hang
> in the current master - the leader worker waits indefinitely the signal
> from parallel workers. A query is not possible to break, the leader
> does not check interrupt status in the waiting loop.
> 
> 1. How to reproduce:
> a) Create table:
> 
> CREATE DATABASE expr;
> \c expr
> CREATE TABLE testexpr(
> id INT,
> val INT
> );
> INSERT INTO testexpr (id, val)
> SELECT serie as id , MOD(serie, 10) as val
> FROM generate_series(1,1000000) as serie;
> EXPLAIN (ANALYZE) SELECT * FROM testexpr
> WHERE val=1 AND id<30;
> 
> b) start debugger for this connection
> 
> c) Run command (parallel workers should be enabled as it is by default
> configuration)
> EXPLAIN (ANALYZE) SELECT * FROM testexpr
> WHERE val=1 AND id<30;
> 
> d) Above query will start parallel worker(s). When worker(s) finish(es),
> it/they send SIGUSR1 that is caught by debugger. When you dimiss
> the signal message, you find that query continues to run, but really it
> waits (in latch.c or in waiteventset.c depending on commit version).

Isn't that to be expected? If I understand correctly, the way your gdb is
configured is that it intercepts SIGUSR1 signals *without* passing it on to
the application (i.e. postgres).  We rely on the signal to be delivered. Which
it isn't. Thus a hang.

At least my gdb doesn't intercept SIGUSR1 by default. It's a newer gdb though,
so that could have been different in the past (although I don't remember a
different behaviour).

(gdb) handle SIGUSR1
Signal        Stop    Print    Pass to program    Description
SIGUSR1       No    No    Yes        User defined signal 1

If I change the configuration to not pass it, but print it, I can reproduce a
hang:
handle SIGUSR1 print nopass


What does your gdb show for "handle SIGUSR1"? If it isn't what I reported, is
it possible that you set that in your .gdbinit or such?



> 2. Original commit with reproducible behaviour.
> I tracked this behaviour down to commit
> > commit 7202d72787d3b93b692feae62ee963238580c877
> > Date:   Fri Feb 21 08:03:33 2025 +0100
> > backend launchers void * arguments for binary data
> > Change backend launcher functions to take void * for binary data
> > instead of char *.  This removes the need for numerous casts.
> > Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org

I also find it very hard to believe that this commit introduced this problem -
it doesn't sound like a postgres issue to me.  I can reproduce it in PG 16,
after doing "handle SIGUSR1 print nopass".

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: 杨江华
Date:
Subject: Re: Use CLOCK_MONOTONIC_COARSE for instr_time when available
Next
From: Noah Misch
Date:
Subject: Re: AIO v2.5