Thread: nested query vs left join: query planner very confused
I've got two tables, sigs and mags. It's a one-to-one relationship, mags is just split out because we store a big, less-often-used field there. "signum" is the key field. Sometimes I want to know if I have any orphans in mags, so I do a query like this: select signum from lp.Mags where signum is not null and signum not in (select lp.Sigs.signum from lp.Sigs) (I do this as a subquery because we originally had a old Sybase DB where outer joins were a non-standard pain and this way works the same and is DBMS-agnostic.) At my location, this query runs very fast (~50ms on a ~100k row table) and 'explain' shows a plan with this structure: Seq scan on mags Filter: SubPlan 1 Seq scan on sigs At my client's location, the query is very slow (same table size, similar hardware/config, although they are running 9.0.x and I'm on 9.2.x). By "slow" I mean like an *hour*. 'explain' has this structure: Seq scan on mags Filter: SubPlan 1 Materialize Seq scan on sigs I'd never heard of Materialize before, so I looked into it. Seems to make a virtual table of the subquery so repetitions of the parent query don't have to re-do the work. Sounds like it should only help, right? The client's 'explain analyze' shows this: Seq Scan on mags (cost=0.00..187700750.56 rows=47476 width=4) (actual time=3004851.889..3004851.889 rows=0 loops=1) Filter: ((signum IS NOT NULL) AND (NOT (SubPlan 1))) SubPlan 1 -> Materialize (cost=0.00..3713.93 rows=95862 width=4) (actual time=0.011..16.145 rows=48139 loops=94951) -> Seq Scan on sigs (cost=0.00..2906.62 rows=95862 width=4) (actual time=0.010..674.201 rows=95862 loops=1) Total runtime: 3004852.005 ms If I'm reading this correctly, the Materialize is running ~95k times, taking the majority of the time. Why? The only thing I can think of is this scenario: 1) server thinks it has a LOT of RAM 2) decides to Materialize subquery to take advantage 3) machine does not actually have that RAM, so it gets swapped 4) server notices it was swapped and decides to re-run rather than unswap 5) goto 2 I don't know if that's a realistic scenario, but it's all I got. I'm already well into unknown territory, performance-tuning-wise. I also decided to try doing the query a different way: select lp.mags.signum from lp.mags left join lp.sigs on lp.mags.signum = lp.sigs.signum where lp.mags.signum is not null and lp.sigs.signum is null; This one runs fast for both of us. So I guess my second question is: why can't the query planner tell these are the same query?
On Wed, Nov 27, 2013 at 7:56 AM, David Rysdam <drysdam@ll.mit.edu> wrote:
At my client's location, the query is very slow (same table size,
similar hardware/config, although they are running 9.0.x and I'm on
9.2.x). By "slow" I mean like an *hour*. 'explain' has this structure:
Seq scan on mags
Filter:
SubPlan 1
Materialize
Seq scan on sigs
I'd never heard of Materialize before, so I looked into it. Seems to
make a virtual table of the subquery so repetitions of the parent query
don't have to re-do the work. Sounds like it should only help, right?
The client's 'explain analyze' shows this:
Seq Scan on mags (cost=0.00..187700750.56 rows=47476 width=4) (actual time=3004851.889..3004851.889 rows=0 loops=1)
Filter: ((signum IS NOT NULL) AND (NOT (SubPlan 1)))
SubPlan 1
-> Materialize (cost=0.00..3713.93 rows=95862 width=4) (actual time=0.011..16.145 rows=48139 loops=94951)
-> Seq Scan on sigs (cost=0.00..2906.62 rows=95862 width=4) (actual time=0.010..674.201 rows=95862 loops=1)
Total runtime: 3004852.005 ms
Has the client ANALYZEd recently? What happens if the client issues the following commands before executing the query?
VACUUM ANALYZE lp.sigs;
VACUUM ANALYZE lp.mags;
If that doesn't change the plan, could you post the values for effective_cache_size, shared_buffers, random_page_cost, cpu_tuple_cost, work_mem and how much RAM is in the client machine?
On 11/27/2013 04:56 PM, David Rysdam wrote: > I've got two tables, sigs and mags. It's a one-to-one relationship, mags > is just split out because we store a big, less-often-used field > there. "signum" is the key field. > > Sometimes I want to know if I have any orphans in mags, so I do a query > like this: > > select signum from lp.Mags where signum is not null and signum not > in (select lp.Sigs.signum from lp.Sigs) [...] > I also decided to try doing the query a different way: > > select lp.mags.signum from lp.mags left join lp.sigs on > lp.mags.signum = lp.sigs.signum where lp.mags.signum is not null > and lp.sigs.signum is null; > > This one runs fast for both of us. So I guess my second question is: why > can't the query planner tell these are the same query? Because they're not the same query. NOT IN has a bunch of semantics issues regarding nulls which the anti-join in the second query does not have. -- Vik
David Rysdam <drysdam@ll.mit.edu> writes: > Sometimes I want to know if I have any orphans in mags, so I do a query > like this: > select signum from lp.Mags where signum is not null and signum not > in (select lp.Sigs.signum from lp.Sigs) > (I do this as a subquery because we originally had a old Sybase DB where > outer joins were a non-standard pain and this way works the same and is > DBMS-agnostic.) DBMS-agnostic frequently means "sucks equally badly on all DBMSes" :-( This query is hard to optimize because of the weird behavior of NOT IN when nulls are involved. Since you aren't complaining that the query fails entirely, I'm supposing that lp.Sigs.signum contains no nulls, but the planner doesn't know that. If you can transform it to a NOT EXISTS, you'll likely get a much better plan: select signum from lp.Mags where signum is not null and not exists (select 1 from lp.Sigs where lp.Sigs.signum = lp.Mags.signum) What you want is an "anti join" plan, or at least a plan that mentions a "hashed subplan". Plain subplans are death performance-wise, because they amount to being nestloop joins rather than anything smarter. (In this case it's likely not choosing a hashed subplan because work_mem is too small to allow that.) > I also decided to try doing the query a different way: > select lp.mags.signum from lp.mags left join lp.sigs on > lp.mags.signum = lp.sigs.signum where lp.mags.signum is not null > and lp.sigs.signum is null; That's another way to get an anti-join (at least on recent PGs, I forget if 9.0 recognizes it). > This one runs fast for both of us. So I guess my second question is: why > can't the query planner tell these are the same query? They aren't. See comment about behavior with NULLs. regards, tom lane
David Rysdam wrote > I'd never heard of Materialize before, so I looked into it. Seems to > make a virtual table of the subquery so repetitions of the parent query > don't have to re-do the work. Sounds like it should only help, right? Forgive any inaccuracies but I'm pretty sure about the following: Materialize is this sense means what you need doesn't fit in memory (likely work-mem setting) and needs to be saved to disk and streamed from there. Since IO is expensive this kills. The virtual table concept is mostly implemented by hash (tables) and not materialize. The materialize is only running once and creating a 95k record table, then scanning that table 95k times to locate a potential match for each input row. Since materialize does not index it has to sequential scan which takes forever. The other question, why the difference, is that IN has to accomodate NULLs in the lookup table; join does not. neither does EXISTS. If you can replace the NOT IN with NOT EXISTS and write a correlated sub-query you should get the same plan as the LEFT JOIN version, IIRC. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/nested-query-vs-left-join-query-planner-very-confused-tp5780585p5780596.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Wed, 27 Nov 2013 11:21:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > DBMS-agnostic frequently means "sucks equally badly on all DBMSes" :-( We've generally been OK (cf the ~50ms runtime for the same query at our site), but we also notice problems sooner than our client sometimes does and can make algorithm improvements where we don't know how to make DB ones. > This query is hard to optimize because of the weird behavior of NOT IN > when nulls are involved. Since you aren't complaining that the query > fails entirely, I'm supposing that lp.Sigs.signum contains no nulls, > but the planner doesn't know that. If you can transform it to a NOT > EXISTS, you'll likely get a much better plan: > > select signum from lp.Mags where signum is not null and > not exists (select 1 from lp.Sigs where lp.Sigs.signum = lp.Mags.signum) We've already shipped to the client, but I'm looking at how extensive a patch would have to be. Very surprising we haven't hit this issue before. > What you want is an "anti join" plan, or at least a plan that mentions > a "hashed subplan". Plain subplans are death performance-wise, because > they amount to being nestloop joins rather than anything smarter. (In > this case it's likely not choosing a hashed subplan because work_mem is > too small to allow that.) I've got the client looking for this and other values already. We'll soon know...
Attachment
On Wed, 27 Nov 2013 11:06:51 -0500, bricklen <bricklen@gmail.com> wrote: > Has the client ANALYZEd recently? What happens if the client issues > the following commands before executing the query? > VACUUM ANALYZE lp.sigs; > VACUUM ANALYZE lp.mags; > > If that doesn't change the plan, could you post the values for > effective_cache_size, shared_buffers, random_page_cost, > cpu_tuple_cost, work_mem and how much RAM is in the client machine? Yes, I did have them do a vacuum analyze with no result. Here's their reply on configuration: It is 24 Gig. effective_cache_size - 12000MB shared_buffers - 1024MB random_page_cost - is commented out cpu_tuple_cost - commented out work_mem - commented out I assume you guys already know the default values for those last 3 on a 9.0.x server...
Attachment
David Rysdam <drysdam@ll.mit.edu> writes: > effective_cache_size - 12000MB > shared_buffers - 1024MB > random_page_cost - is commented out > cpu_tuple_cost - commented out > work_mem - commented out > I assume you guys already know the default values for those last 3 on a > 9.0.x server... Default work_mem is only 1MB, so that probably explains why you're not getting a hashed subplan here. Have them knock it up some, say on the order of 10MB. (If none of your queries are any more complicated than this one, you could go higher. But keep in mind that a backend can use work_mem per sort/hash/materialize step, not per query --- so complex queries can use many times work_mem. Multiply that by the number of backends, and you can end up in swap hell pretty quickly with an over optimistic value.) regards, tom lane
On Wed, 27 Nov 2013 13:02:20 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rysdam <drysdam@ll.mit.edu> writes: > > effective_cache_size - 12000MB > > shared_buffers - 1024MB > > random_page_cost - is commented out > > cpu_tuple_cost - commented out > > work_mem - commented out > > > I assume you guys already know the default values for those last 3 on a > > 9.0.x server... > > Default work_mem is only 1MB, so that probably explains why you're not > getting a hashed subplan here. Have them knock it up some, say on the > order of 10MB. (If none of your queries are any more complicated than > this one, you could go higher. But keep in mind that a backend can use > work_mem per sort/hash/materialize step, not per query --- so complex > queries can use many times work_mem. Multiply that by the number of > backends, and you can end up in swap hell pretty quickly with an over > optimistic value.) We deliberately try to keep our queries fairly simple for several reasons. This isn't the most complicated, but they don't get much more than this. I'll have them start with 10MB and see what they get.
Attachment
On Wed, 27 Nov 2013 13:04:54 -0500, David Rysdam <drysdam@ll.mit.edu> wrote: > We deliberately try to keep our queries fairly simple for several > reasons. This isn't the most complicated, but they don't get much more > than this. I'll have them start with 10MB and see what they get. 10MB was enough to get that query to come back instantly. The same query on some larger tables were still slow so we ended up bumping up to 50MB to get the entire job done. That probably sounds like a lot to you guys, but now that we know the behavior and what kind of queries we have I think we're OK. (And if you want a shocker, when I looked at our own DB, we've had work_mem set to 250MB on a lot of our servers and 1GB on our main production machine. Heh. I've got some PG tuning books here next to me now...) Thanks!