Re: Consider parallel for lateral subqueries with limit - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Consider parallel for lateral subqueries with limit |
Date | |
Msg-id | 2888169.1646174122@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Consider parallel for lateral subqueries with limit (James Coleman <jtc331@gmail.com>) |
Responses |
Re: Consider parallel for lateral subqueries with limit
Re: Consider parallel for lateral subqueries with limit |
List | pgsql-hackers |
James Coleman <jtc331@gmail.com> writes: > On Tue, Jan 4, 2022 at 9:59 PM James Coleman <jtc331@gmail.com> wrote: >> On Tue, Jan 4, 2022 at 5:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't really see why this patch is even a little bit safe. > Suppose lateral is not in play. Then if we have a plan like: > Gather > NestLoop > Scan X > Limit > Scan Y > Because we have the result "X join Limit(Y)" we need "Limit(Y)" to be > consistent across all of the possible executions of "Limit(Y)" (i.e., > in each worker it executes in). That means (absent infrastructure for > guaranteeing a unique ordering) we obviously can't parallelize the > inner side of the join as the limit may be applied in different ways > in each worker's execution. > Now suppose lateral is in play. Then (given the same plan) instead of > our result being "X join Limit(Y)" the result is "X join Limit(Y sub > X)", that is, each row in X is joined to a unique invocation of > "Limit(Y)". This argument seems to be assuming that Y is laterally dependent on X, but the patch as written will take *any* lateral dependency as a get-out-of-jail-free card. If what we have is "Limit(Y sub Z)" where Z is somewhere else in the query tree, it's not apparent to me that your argument holds. But more generally, I don't think you've addressed the fundamental concern, which is that a query involving Limit is potentially nondeterministic (if it lacks a fully-deterministic ORDER BY), so that different workers could get different answers from it if they're using a plan type that permits that to happen. (See the original discussion that led to 75f9c4ca5, at [1].) I do not see how a lateral dependency removes that hazard. The bug report that started the original discussion hit the problem because it generated a plan like Gather -> Hash Semi Join -> Parallel Seq Scan -> Hash -> Limit -> Seq Scan We didn't make the submitter drill down far enough to verify exactly why he got nondeterministic results from the Limit, but I suppose the reason was that the table was big enough to trigger "synchronize_seqscans" behavior, allowing different workers to read different parts of that table. Now, that particular case didn't have any lateral dependency, but if there was one it'd just have resulted in changing the hash join to a nestloop join, and the nondeterminism hazard would be exactly the same AFAICS. > In this case we are already conceivably getting different > results for each execution of the subquery "Limit(Y)" even if we're > not running those executions across multiple workers. That seems to be about the same argument Andres made initially in the old thread, but we soon shot that down as not being the level of guarantee we want to provide. There's nothing in the SQL standard that says that select * from events where account in (select account from events where data->>'page' = 'success.html' limit 3); (the original problem query) shall execute the sub-query only once, but people expect it to act that way. If you want to improve this area, my feeling is that it'd be better to look into what was speculated about in the old thread: LIMIT doesn't create nondeterminism if the query has an ORDER BY that imposes a unique row ordering, ie order-by-primary-key. We didn't have planner infrastructure that would allow checking that cheaply in 2018, but maybe there is some now? regards, tom lane [1] https://www.postgresql.org/message-id/flat/153417684333.10284.11356259990921828616%40wrigleys.postgresql.org
pgsql-hackers by date: