Re: Consider parallel for lateral subqueries with limit - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: Consider parallel for lateral subqueries with limit |
Date | |
Msg-id | CAAaqYe8yfUkgYsG418a=_PjVdBnX4dxv+qq5bWcYy+KiampuPw@mail.gmail.com Whole thread Raw |
In response to | Re: Consider parallel for lateral subqueries with limit (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Consider parallel for lateral subqueries with limit
Re: Consider parallel for lateral subqueries with limit |
List | pgsql-hackers |
On Mon, Sep 19, 2022 at 4:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 19, 2022 at 3:58 PM James Coleman <jtc331@gmail.com> wrote: > > But in the case where there's correlation via LATERAL we already don't > > guarantee unique executions for a given set of params into the lateral > > subquery execution, right? For example, suppose we have: > > > > select * > > from foo > > left join lateral ( > > select n > > from bar > > where bar.a = foo.a > > limit 1 > > ) on true > > > > and suppose that foo.a (in the table scan) returns these values in > > order: 1, 2, 1. In that case we'll execute the lateral subquery 3 > > separate times rather than attempting to order the results of foo.a > > such that we can re-execute the subquery only when the param changes > > to a new unique value (and we definitely don't cache the results to > > guarantee unique subquery executions). > > I think this is true, but I don't really understand why we should > focus on LATERAL here. What we really need, and I feel like we've > talked about this before, is a way to reason about where parameters > are set and used. Yes, over in the thread "Parallelize correlated subqueries that execute within each worker" [1] which was meant to build on top of this work (though is technically separate). Your bringing it up here too is making me wonder if we can combine the two and instead of always allowing subqueries with LIMIT instead (like the other patch does) delay final determination of parallel safety of rels and paths (perhaps, as that thread is discussing, until gather/gather merge path creation). Side note: I'd kinda like a way (maybe a development GUC or even a compile time option) to have EXPLAIN show where params are set. If something like that exists, egg on my face I guess, but I'd love to know about it. > Your sample query gets a plan like this: > > Nested Loop Left Join (cost=0.00..1700245.00 rows=10000 width=8) > -> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=4) > -> Limit (cost=0.00..170.00 rows=1 width=4) > -> Seq Scan on bar (cost=0.00..170.00 rows=1 width=4) > Filter: (foo.a = a) > > If this were to occur inside a larger plan tree someplace, it would be > OK to insert a Gather node above the Nested Loop node without doing > anything further, because then the parameter that stores foo.a would > be both set and used in the worker. If you wanted to insert a Gather > at any other place in this plan, things get more complicated. But just > because you have LATERAL doesn't mean that you have this problem, > because if you delete the "limit 1" then the subqueries get flattened > together and the parameter disappears, For future reference in this email thread when you remove the "limit 1" this is the plan you get: Merge Right Join (cost=372.18..815.71 rows=28815 width=8) Merge Cond: (bar.a = foo.a) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: bar.a -> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=8) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: foo.a -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4) Just to make sure I'm following: by "doesn't mean that you have this problem" you mean "doesn't mean you have this limitation on parallel query"? The "flattening out" (conversion to join between two table scans executed a single time each) is an interesting case because I consider that to be "just" a performance optimization, and therefore I don't think anything about the guarantees a user expects should change. But interestingly: it does end up providing stronger guarantees about the query results than it would if the conversion didn't happen (the subquery results in only a single table scan whereas without the change a scan per outer row means it's *possible* to get different results across different outer rows even with the same join key value). > and if you delete the lateral > reference (i.e. WHERE foo.a = bar.a) then there's still a subquery but > it no longer refers to an outer parameter. And on the flip side just > because you don't have LATERAL doesn't mean that you don't have this > problem. e.g. the query could instead be: > > select *, (select n from bar where bar.a = foo.a limit 1) from foo; > > ...which I think is pretty much equivalent to your formulation and has > the same problem as far as parallel query as your formulation but does > not involve the LATERAL keyword. Yes, that's a good point too. I need to play with these examples and confirm whether lateral_relids gets set in that case. IIRC when that's set isn't exactly the same as whether or not the LATERAL keyword is used, and I should clarify that my claims here are meant to be about when we execute it that way regardless of the keyword usage. The keyword usage I'd assumed just made it easier to talk about, but maybe you're implying that it's actually generating confusion. James Coleman 1: https://www.postgresql.org/message-id/CA%2BTgmoYXm2NCLt1nikWfYj1_r3%3DfsoNCHCtDVdN7X1uX_xuXgw%40mail.gmail.com
pgsql-hackers by date: