Thread: Current master hangs under the debugger after Parallel Seq Scan (Linux, MacOS)

Hi!

During debug session I found, that queries with Parallel Seq Scan hang
in the current master - the leader worker waits indefinitely the signal
from parallel workers. A query is not possible to break, the leader
does not check interrupt status in the waiting loop.

1. How to reproduce:
a) Create table:

CREATE DATABASE expr;
\c expr
CREATE TABLE testexpr(
id INT,
val INT
);
INSERT INTO testexpr (id, val)
SELECT serie as id , MOD(serie, 10) as val
FROM generate_series(1,1000000) as serie;
EXPLAIN (ANALYZE) SELECT * FROM testexpr
WHERE val=1 AND id<30;

b) start debugger for this connection

c) Run command (parallel workers should be enabled as it is by default
configuration)
EXPLAIN (ANALYZE) SELECT * FROM testexpr
WHERE val=1 AND id<30;

d) Above query will start parallel worker(s). When worker(s) finish(es),
it/they send SIGUSR1 that is caught by debugger. When you dimiss
the signal message, you find that query continues to run, but really it
waits (in latch.c or in waiteventset.c depending on commit version).

2. Original commit with reproducible behaviour.
I tracked this behaviour down to commit
> commit 7202d72787d3b93b692feae62ee963238580c877
> Date:   Fri Feb 21 08:03:33 2025 +0100
> backend launchers void * arguments for binary data
> Change backend launcher functions to take void * for binary data
> instead of char *.  This removes the need for numerous casts.
> Discussion: 
> https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org


It could be, that this patch activated the side problem, that already 
was
in the system before. I looked for first commit with this problem from 6 
Jan 2025,
and 2 commits hanged the same way, but both did not reproduce it after 
repeat.
Starting from the patch above, the hang is reproduced on Linux and 
MacOS.

Also I afraid, the same behaviour will be for other types of parallel
workers under debugger (Parallel Hash etc).

-- 
Best regards,

Vladlen Popolitov.



Vladlen Popolitov <v.popolitov@postgrespro.ru> writes:
> d) Above query will start parallel worker(s). When worker(s) finish(es),
> it/they send SIGUSR1 that is caught by debugger. When you dimiss
> the signal message, you find that query continues to run, but really it
> waits (in latch.c or in waiteventset.c depending on commit version).

I'm fairly skeptical of this.  IME, when you see something like that,
the actual problem is that the debugger has failed to pass the signal
on to the program-under-test.

> I tracked this behaviour down to commit
> commit 7202d72787d3b93b692feae62ee963238580c877

... and that raises my skepticism to stratospheric levels, because
that commit did exactly nothing that would have changed runtime
behavior.

            regards, tom lane



Tom Lane писал(а) 2025-03-26 22:38:
> Vladlen Popolitov <v.popolitov@postgrespro.ru> writes:
>> d) Above query will start parallel worker(s). When worker(s) 
>> finish(es),
>> it/they send SIGUSR1 that is caught by debugger. When you dimiss
>> the signal message, you find that query continues to run, but really 
>> it
>> waits (in latch.c or in waiteventset.c depending on commit version).
> 
> I'm fairly skeptical of this.  IME, when you see something like that,
> the actual problem is that the debugger has failed to pass the signal
> on to the program-under-test.
> 
>> I tracked this behaviour down to commit
>> commit 7202d72787d3b93b692feae62ee963238580c877
> 
> ... and that raises my skepticism to stratospheric levels, because
> that commit did exactly nothing that would have changed runtime
> behavior.
> 
>             regards, tom lane
Hi Tom,

I have not had the problems with the debugger and parallel workers
until this patch. I am on Mac with VScode as debug environment.
I asked my colleague to check it on Linux, and he reproduced it
immediately. As I remember, he usually uses gdb.

Usually a parallel worker informs the leader
through shared memory about it status. I am not sure, debugger can
affect this. I think, it creates additional pause, and leader does,
what it did not do without pause.

I also did not find something suspicious in the commit, but I checked
before and after tens commits (30-40) and binary search stopped on
this patch. Everyone after it reproduce this behaviour.

-- 
Best regards,

Vladlen Popolitov.



Hi!

Vladlen, I've checked your example and also got this behavior
on the current master under Gdb (Ubuntu 22.04 LTS).

On the query above server waits in this state (as shown by gdb):
Program received signal SIGUSR1, User defined signal 1.
0x0000562beb21c6ed in ExecInterpExpr (state=0x562bec781858, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625
625                             return state->resvalue;

Although, the process continues to work normally after gdb is detached from it.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
On Wed, Mar 26, 2025 at 12:32 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> Vladlen, I've checked your example and also got this behavior
> on the current master under Gdb (Ubuntu 22.04 LTS).
>
> On the query above server waits in this state (as shown by gdb):
> Program received signal SIGUSR1, User defined signal 1.
> 0x0000562beb21c6ed in ExecInterpExpr (state=0x562bec781858, econtext=<optimized out>, isnull=<optimized out>) at
execExprInterp.c:625
> 625                             return state->resvalue;
>
> Although, the process continues to work normally after gdb is detached from it.

Do either of you have any theory as to how removing a cast could cause
this behavior?

The only idea that occurs to me is a bug in gdb.

--
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2025-03-26 21:53:35 +0700, Vladlen Popolitov wrote:
> During debug session I found, that queries with Parallel Seq Scan hang
> in the current master - the leader worker waits indefinitely the signal
> from parallel workers. A query is not possible to break, the leader
> does not check interrupt status in the waiting loop.
> 
> 1. How to reproduce:
> a) Create table:
> 
> CREATE DATABASE expr;
> \c expr
> CREATE TABLE testexpr(
> id INT,
> val INT
> );
> INSERT INTO testexpr (id, val)
> SELECT serie as id , MOD(serie, 10) as val
> FROM generate_series(1,1000000) as serie;
> EXPLAIN (ANALYZE) SELECT * FROM testexpr
> WHERE val=1 AND id<30;
> 
> b) start debugger for this connection
> 
> c) Run command (parallel workers should be enabled as it is by default
> configuration)
> EXPLAIN (ANALYZE) SELECT * FROM testexpr
> WHERE val=1 AND id<30;
> 
> d) Above query will start parallel worker(s). When worker(s) finish(es),
> it/they send SIGUSR1 that is caught by debugger. When you dimiss
> the signal message, you find that query continues to run, but really it
> waits (in latch.c or in waiteventset.c depending on commit version).

Isn't that to be expected? If I understand correctly, the way your gdb is
configured is that it intercepts SIGUSR1 signals *without* passing it on to
the application (i.e. postgres).  We rely on the signal to be delivered. Which
it isn't. Thus a hang.

At least my gdb doesn't intercept SIGUSR1 by default. It's a newer gdb though,
so that could have been different in the past (although I don't remember a
different behaviour).

(gdb) handle SIGUSR1
Signal        Stop    Print    Pass to program    Description
SIGUSR1       No    No    Yes        User defined signal 1

If I change the configuration to not pass it, but print it, I can reproduce a
hang:
handle SIGUSR1 print nopass


What does your gdb show for "handle SIGUSR1"? If it isn't what I reported, is
it possible that you set that in your .gdbinit or such?



> 2. Original commit with reproducible behaviour.
> I tracked this behaviour down to commit
> > commit 7202d72787d3b93b692feae62ee963238580c877
> > Date:   Fri Feb 21 08:03:33 2025 +0100
> > backend launchers void * arguments for binary data
> > Change backend launcher functions to take void * for binary data
> > instead of char *.  This removes the need for numerous casts.
> > Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org

I also find it very hard to believe that this commit introduced this problem -
it doesn't sound like a postgres issue to me.  I can reproduce it in PG 16,
after doing "handle SIGUSR1 print nopass".

Greetings,

Andres Freund



Andres Freund писал(а) 2025-03-27 01:22:
> Hi,
> 
> 
> Isn't that to be expected? If I understand correctly, the way your gdb 
> is
> configured is that it intercepts SIGUSR1 signals *without* passing it 
> on to
> the application (i.e. postgres).  We rely on the signal to be 
> delivered. Which
> it isn't. Thus a hang.
> 
> At least my gdb doesn't intercept SIGUSR1 by default. It's a newer gdb 
> though,
> so that could have been different in the past (although I don't 
> remember a
> different behaviour).
> 
> (gdb) handle SIGUSR1
> Signal        Stop    Print    Pass to program    Description
> SIGUSR1       No    No    Yes        User defined signal 1
> 
> If I change the configuration to not pass it, but print it, I can 
> reproduce a
> hang:
> handle SIGUSR1 print nopass
> 
> 
> What does your gdb show for "handle SIGUSR1"? If it isn't what I 
> reported, is
> it possible that you set that in your .gdbinit or such?
> 
> 
> 
>> 2. Original commit with reproducible behaviour.
>> I tracked this behaviour down to commit
>> > commit 7202d72787d3b93b692feae62ee963238580c877
>> > Date:   Fri Feb 21 08:03:33 2025 +0100
>> > backend launchers void * arguments for binary data
>> > Change backend launcher functions to take void * for binary data
>> > instead of char *.  This removes the need for numerous casts.
>> > Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
> 
> I also find it very hard to believe that this commit introduced this 
> problem -
> it doesn't sound like a postgres issue to me.  I can reproduce it in PG 
> 16,
> after doing "handle SIGUSR1 print nopass".
> 
> Greetings,
> 
> Andres Freund

Dear colleagues,

I changed the debugging extension in VScode (from C_Cpp to CodeLLDB) and
this hang disappeared. Thank you for pointing, that it more relates to
a debugger than Postgres.

  Both extensions process signals internally and do not expose the 
handlers
in settings.

  Thank you for your support and time.

-- 
Best regards,

Vladlen Popolitov.