Re: Consider parallel for lateral subqueries with limit - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Consider parallel for lateral subqueries with limit |
Date | |
Msg-id | CA+TgmoZw+mbjyoeBhTFp_Za8dUT4i0=4j3WN_fByfqy-7uzsUA@mail.gmail.com Whole thread Raw |
In response to | Re: Consider parallel for lateral subqueries with limit (James Coleman <jtc331@gmail.com>) |
Responses |
Re: Consider parallel for lateral subqueries with limit
|
List | pgsql-hackers |
On Thu, Sep 22, 2022 at 5:19 PM James Coleman <jtc331@gmail.com> wrote: > > Your sample query gets a plan like this: > > > > Nested Loop Left Join (cost=0.00..1700245.00 rows=10000 width=8) > > -> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=4) > > -> Limit (cost=0.00..170.00 rows=1 width=4) > > -> Seq Scan on bar (cost=0.00..170.00 rows=1 width=4) > > Filter: (foo.a = a) > > > > If this were to occur inside a larger plan tree someplace, it would be > > OK to insert a Gather node above the Nested Loop node without doing > > anything further, because then the parameter that stores foo.a would > > be both set and used in the worker. If you wanted to insert a Gather > > at any other place in this plan, things get more complicated. But just > > because you have LATERAL doesn't mean that you have this problem, > > because if you delete the "limit 1" then the subqueries get flattened > > together and the parameter disappears, > > For future reference in this email thread when you remove the "limit > 1" this is the plan you get: > > Merge Right Join (cost=372.18..815.71 rows=28815 width=8) > Merge Cond: (bar.a = foo.a) > -> Sort (cost=158.51..164.16 rows=2260 width=8) > Sort Key: bar.a > -> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=8) > -> Sort (cost=179.78..186.16 rows=2550 width=4) > Sort Key: foo.a > -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4) > > Just to make sure I'm following: by "doesn't mean that you have this > problem" you mean "doesn't mean you have this limitation on parallel > query"? I'm talking specifically about whether there's a parameter. The problem here isn't created by LATERAL, but by parameters. In the nested loop plan, there's a parameter that's storing foo.a, and the storage location that holds that parameter value is in backend-private memory, so you can't set the value in the leader and then use it in a worker, and that restricts where you can insert a Gather node. But in the Merge Join plan (or if you got a Hash Join plan) there is no parameter. So the fact that parameter storage isn't shared between leaders and workers does not matter. > Yes, that's a good point too. I need to play with these examples and > confirm whether lateral_relids gets set in that case. IIRC when that's > set isn't exactly the same as whether or not the LATERAL keyword is > used, and I should clarify that my claims here are meant to be about > when we execute it that way regardless of the keyword usage. The > keyword usage I'd assumed just made it easier to talk about, but maybe > you're implying that it's actually generating confusion. Yes, I think so. Stepping back a bit, commit 75f9c4ca5a8047d7a9cfbc7d51a610933d04dc7f introduced the code that is at issue here, and it took the position that limit/offset should be marked parallel-restricted because they're nondeterministic. I'm not sure that's really quite the right thing. The original idea behind having things be parallel-restricted was that there might be stuff which depends on state in the leader that is not replicated to or shared with the workers, and thus it must be executed in the leader to get the right answer. Here, however, there is no such problem. Something like LIMIT/OFFSET that is nondeterministic is perfectly safe to execute in a worker, or in the leader. It's just not safe to execute the same computation more than once and assume that we will get the same answer each time. But that's really a different problem. And the problem that you've run into here, I think, is that if a Limit node is getting fed a parameter from higher up in the query plan, then it's not really the same computation being performed every time, and thus the non-determinism doesn't matter, and thus the parallel-restriction goes away, but the code doesn't know that. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: