Re: Performance issues with parallelism and LIMIT - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Performance issues with parallelism and LIMIT
Date
Msg-id c7a3cfcc-81a6-4dac-b064-7582ef9754bb@vondra.me
Whole thread Raw
In response to Re: Performance issues with parallelism and LIMIT  (David Geier <geidav.pg@gmail.com>)
List pgsql-hackers

On 11/19/25 13:28, David Geier wrote:
> On 18.11.2025 20:37, Tomas Vondra wrote:
>>
>>
>> On 11/18/25 19:35, David Geier wrote:
>>>
>>> On 18.11.2025 18:31, Tomas Vondra wrote:
>>>> On 11/18/25 17:51, Tom Lane wrote:
>>>>> David Geier <geidav.pg@gmail.com> writes:
>>>>>> On 18.11.2025 16:40, Tomas Vondra wrote:
>>>>>>> It'd need code in the parallel-aware scans, i.e. seqscan, bitmap, index.
>>>>>>> I don't think you'd need code in other plans, because all parallel plans
>>>>>>> have one "driving" table.
>>>>>
>>>>> You're assuming that the planner will insert Gather nodes at arbitrary
>>>>> places in the plan, which isn't true.  If it does generate plans that
>>>>> are problematic from this standpoint, maybe the answer is "don't
>>>>> parallelize in exactly that way".
>>>>>
>>>>
>>>> I think David has a point that nodes that "buffer" tuples (like Sort or
>>>> HashAgg) would break the approach making this the responsibility of the
>>>> parallel-aware scan. I don't see anything particularly wrong with such
>>>> plans - plans with partial aggregation often look like that.
>>>>
>>>> Maybe this should be the responsibility of execProcnode.c, not the
>>>> various nodes?
> I hadn't realized that this approach has the same limitation:
> ExecProcNode() is only called when e.g. heap_nextslot() or
> index_getnext_slot() have found a qualifying tuple. Otherwise, they just
> keep crunching without returning.
> 

Right, that's why I suggested to have a function the nodes would call in
suitable places.

>>> I like that idea, even though it would still not work while a node is
>>> doing the crunching. That is after it has pulled all rows and before it
>>> can return the first row. During this time the node won't call
>>> ExecProcNode().
>>>
>> True. Perhaps we could provide a function nodes could call in suitable
>> places to check whether to end?
> This function would then also be required by the base relation scans.
> And we would have to call it more or less in the same places
> CHECK_FOR_INTERRUPTS() is called today.
> 

Yes, but I don't think CHECK_FOR_INTERRUPTS() would be a good place to
manipulate the executor state. Maybe you could do some magic with
siglongjmp(), but I have "funny" feeling about that - I wouldn't be
surprised if that interfered with elog(), which is the only other place
using siglongjmp() AFAICS.

Which is why I suggested maybe it should be handled in execProcnode
(which would take care of cases where we produce a tuple), and then let
nodes to optionally check the flag too (through a new function).

I haven't tried doing this, so maybe I'm missing something ...

> Beyond that, code such as heap_nextslot() or index_getnext_slot() don't
> have access to the PlanState which would contain the stop flag. So that
> would have to be propagated downwards as well.
> 
> All of that would make for a fairly big patch, which the initial patch
> avoids.
> 

Right. I don't think we can set the flag in plan/executor state, because
that's not available in signal handler / ProcessInterrupts() ... It'd
need to be a global variable, I guess.

>>
>> Actually, how does canceling queries with parallel workers work? Is that
>> done similarly to what your patch did?
> Parallel workers use the same mechanism as normal backends, except that
> parallel workers quit instead of waiting for the next query.
> 
> The flow is as follows: parallel workers catch SIGINT via
> StatementCancelHandler() which sets QueryCancelPending = true. When
> ProcessInterrupts() is called the next time, it will elog(ERROR) out.
> BackgroundWorkerMain() will catch the error and proc_exit().
> 
> This mechanism is very similar to what I have in my patch, with the
> difference that (1) I use SendProcSignal() to inform the workers to stop
> and (2) that I added another sigsetjmp target around ExecutorRun() to be
> able to bail but still call all the shutdown code.
> 

OK

> (1) is necessary to be able to distinguish between query cancel and
> early stopping but not cancelling.
> 
> (2) is necessary because the parallel shutdown code needs to be called
> before exiting. I tried to piggy back on the existing error handling
> mechanism by siglongjmp(*PG_exception_stack, 1) and there to not calling
> EmitErrorReport() if got_stopped == true. That gave me the following error:
> 
> ERROR:  lost connection to parallel worker
> 

Not sure. I have my doubts about using siglongjmp() for this.

>>> Inspectability on that end seems useful. Maybe only with VERBOSE,
>>> similarly to the extended per-worker information.
>>>
>>
>> Maybe, no opinion. But it probably needs to apply to all nodes in the
>> parallel worker, right? Or maybe it's even a per-worker detail.
> I thought to make it a per-worker detail such as time or number of rows
> returned. Let's discuss this again, once we've settled on a solution.
> 

OK


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: 10% drop in code line count in PG 17
Next
From: Sami Imseih
Date:
Subject: Re: [Proposal] Adding callback support for custom statistics kinds