Thread: subselect requires offset 0 for good performance.
Howdy. I seem to have inherited this problem: http://www.postgresql.org/message-id/EF524E81-B815-4122-A337-7E50BCED514E@silentmedia.com Basically a subselect with no offset is resulting in really poor performance with 120s queries but adding an offset 0 to the inner sub select results in 0.5s query times, and I get the same output. The original answer Robert Haas asks for a self contained test case. I am running 8.4.15 and can try 8.4.17 if some patch has been applied to it to address this issue. I just want to know should I A: upgrade to 8.4.17 or B: create a self contained test case. -- To understand recursion, one must first understand recursion.
On Thu, Aug 1, 2013 at 2:40 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Howdy. I seem to have inherited this problem: > > http://www.postgresql.org/message-id/EF524E81-B815-4122-A337-7E50BCED514E@silentmedia.com > > Basically a subselect with no offset is resulting in really poor > performance with 120s queries but adding an offset 0 to the inner sub > select results in 0.5s query times, and I get the same output. > > The original answer Robert Haas asks for a self contained test case. > > I am running 8.4.15 and can try 8.4.17 if some patch has been applied > to it to address this issue. I just want to know should I > > A: upgrade to 8.4.17 > or > B: create a self contained test case. IMNSHO, I would pursue both (unless A solves your problem in which case B is moot). merlin
Scott Marlowe <scott.marlowe@gmail.com> writes: > I am running 8.4.15 and can try 8.4.17 if some patch has been applied > to it to address this issue. I just want to know should I > A: upgrade to 8.4.17 > or > B: create a self contained test case. A quick look at the release notes shows no planner fixes in 8.4.16 or 8.4.17, so it would be rather surprising if (A) helps. regards, tom lane
On Thu, Aug 1, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> I am running 8.4.15 and can try 8.4.17 if some patch has been applied >> to it to address this issue. I just want to know should I > >> A: upgrade to 8.4.17 >> or >> B: create a self contained test case. > > A quick look at the release notes shows no planner fixes in 8.4.16 or > 8.4.17, so it would be rather surprising if (A) helps. OK. I was doing some initial testing and if I select out the 4 columns into a test table the query runs fast. If I select all the columns into a test table it runs slow, so it appears table width affects this. Will have more to report tomorrow on it.
On 08/02/2013 03:22 AM, Scott Marlowe wrote: > On Thu, Aug 1, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Scott Marlowe <scott.marlowe@gmail.com> writes: >>> I am running 8.4.15 and can try 8.4.17 if some patch has been applied >>> to it to address this issue. I just want to know should I >>> A: upgrade to 8.4.17 >>> or >>> B: create a self contained test case. >> A quick look at the release notes shows no planner fixes in 8.4.16 or >> 8.4.17, so it would be rather surprising if (A) helps. > OK. I was doing some initial testing and if I select out the 4 columns > into a test table the query runs fast. If I select all the columns > into a test table it runs slow, so it appears table width affects > this. Will have more to report tomorrow on it. I don't know what your query is, but here's one I was working on yesterday that shows the problem. It may not be the smallest test case possible, but it works. EXPLAIN ANALYZE WITH RECURSIVE x (start_time) AS ( SELECT generate_series(1, 1000000) ), t (time, timeround) AS ( SELECT time, time - time % 900000 AS timeround FROM (SELECT min(start_time) AS time FROM x) AS tmp UNION ALL SELECT time, time - time % 900000 FROM (SELECT (SELECT min(start_time) AS time FROM x WHERE start_time >= t.timeround + 900000) FROM t WHERE t.time IS NOT NULL OFFSET 0 ) tmp ) SELECT count(*) FROM t WHERE time IS NOT NULL; If you remove the OFFSET 0, you'll see two more subplans (because "time" is referenced three times). The difference is much more noticeable if you make the x CTE its own table. Vik PS: This query is emulating a LooseIndexScan. http://wiki.postgresql.org/wiki/Loose_indexscan
On Thu, Aug 1, 2013 at 7:22 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Aug 1, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Scott Marlowe <scott.marlowe@gmail.com> writes: >>> I am running 8.4.15 and can try 8.4.17 if some patch has been applied >>> to it to address this issue. I just want to know should I >> >>> A: upgrade to 8.4.17 >>> or >>> B: create a self contained test case. >> >> A quick look at the release notes shows no planner fixes in 8.4.16 or >> 8.4.17, so it would be rather surprising if (A) helps. > > OK. I was doing some initial testing and if I select out the 4 columns > into a test table the query runs fast. If I select all the columns > into a test table it runs slow, so it appears table width affects > this. Will have more to report tomorrow on it. Here's the query: SELECT * FROM dba.pp_test_wide p LEFT JOIN ( SELECT tree_sortkey FROM dba.pp_test_wide WHERE tree_sortkey BETWEEN '00000000000101010000010001010100'::VARBIT AND public.tree_right('00000000000101010000010001010100'::VARBIT) AND product_name IS NOT NULL AND tree_sortkey <> '00000000000101010000010001010100'::VARBIT ) pp ON p.tree_sortkey BETWEEN pp.tree_sortkey AND public.tree_right(pp.tree_sortkey) WHERE p.tree_sortkey BETWEEN '00000000000101010000010001010100'::VARBIT AND public.tree_right('00000000000101010000010001010100'::VARBIT) AND p.tree_sortkey BETWEEN '00000000'::VARBIT AND public.tree_right('00000000'::VARBIT) AND p.deleted_at IS NULL AND pp.tree_sortkey IS NULL I extracted all the data like so: select * into dba.pp_test_wide from original table; and get this query plan from explain analyze: http://explain.depesz.com/s/EPx which takes 20 minutes to run. If I extract it this way: select tree_sortkey, product_name, deleted_at into db.pp_test_3col from original table; I get this plan: http://explain.depesz.com/s/gru which gets a materialize in it, and suddenly takes 106 ms. the factor in performance increase is therefore ~ 11,342. that's pretty huge. I'll try to make a self contained test case now. Hopefully that at least points in the right direction tho to a bug of some kind.
Scott Marlowe <scott.marlowe@gmail.com> writes: > I extracted all the data like so: > select * into dba.pp_test_wide from original table; > and get this query plan from explain analyze: > http://explain.depesz.com/s/EPx which takes 20 minutes to run. > If I extract it this way: > select tree_sortkey, product_name, deleted_at into db.pp_test_3col > from original table; > I get this plan: http://explain.depesz.com/s/gru which gets a > materialize in it, and suddenly takes 106 ms. There's no reason why suppressing some unrelated columns would change the rowcount estimates, but those two plans show different rowcount estimates. I suspect the *actual* reason for the plan change was that autovacuum had had a chance to update statistics for the one table, and not yet for the other. Please do a manual ANALYZE on both tables and see if there's still a plan difference. regards, tom lane
On Fri, Aug 2, 2013 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> I extracted all the data like so: >> select * into dba.pp_test_wide from original table; >> and get this query plan from explain analyze: >> http://explain.depesz.com/s/EPx which takes 20 minutes to run. >> If I extract it this way: >> select tree_sortkey, product_name, deleted_at into db.pp_test_3col >> from original table; >> I get this plan: http://explain.depesz.com/s/gru which gets a >> materialize in it, and suddenly takes 106 ms. > > There's no reason why suppressing some unrelated columns would change the > rowcount estimates, but those two plans show different rowcount estimates. > > I suspect the *actual* reason for the plan change was that autovacuum had > had a chance to update statistics for the one table, and not yet for the > other. Please do a manual ANALYZE on both tables and see if there's > still a plan difference. Interesting. I ran analyze on both tables and sure enough the new test table runs fast. Ran analyze on the old table and it runs slow. The only thing the old table and its plan are missing is the materialize. So what is likely to change from the old table to the new one? Here's the explain analyze output from the old table and the same query against it: http://explain.depesz.com/s/CtZ and here's the plan with offset 0 in it: http://explain.depesz.com/s/Gug note that while the estimates are a bit off, the really huge difference here says to me some suboptimal method is getting deployed in the background somewhere. Do we need a stack trace?
On Fri, Aug 2, 2013 at 1:58 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Fri, Aug 2, 2013 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Scott Marlowe <scott.marlowe@gmail.com> writes: >>> I extracted all the data like so: >>> select * into dba.pp_test_wide from original table; >>> and get this query plan from explain analyze: >>> http://explain.depesz.com/s/EPx which takes 20 minutes to run. >>> If I extract it this way: >>> select tree_sortkey, product_name, deleted_at into db.pp_test_3col >>> from original table; >>> I get this plan: http://explain.depesz.com/s/gru which gets a >>> materialize in it, and suddenly takes 106 ms. >> >> There's no reason why suppressing some unrelated columns would change the >> rowcount estimates, but those two plans show different rowcount estimates. >> >> I suspect the *actual* reason for the plan change was that autovacuum had >> had a chance to update statistics for the one table, and not yet for the >> other. Please do a manual ANALYZE on both tables and see if there's >> still a plan difference. > > Interesting. I ran analyze on both tables and sure enough the new test > table runs fast. Ran analyze on the old table and it runs slow. The > only thing the old table and its plan are missing is the materialize. > So what is likely to change from the old table to the new one? Here's > the explain analyze output from the old table and the same query > against it: http://explain.depesz.com/s/CtZ and here's the plan with > offset 0 in it: http://explain.depesz.com/s/Gug note that while the > estimates are a bit off, the really huge difference here says to me > some suboptimal method is getting deployed in the background > somewhere. Do we need a stack trace? So as a followup. I ran vacuum verbose analyze on the original table, thinking it might be bloated but it wasn't. Out of 320k or so rows there were 4k dead tuples recovered, and none that it couldn't recover. So now I'm trying to recreate the original table with a select into with an order by random() on the end. Nope it gets a materialize in it and runs fast. Well it's danged hard to make a test case when copying the table with random ordering results in a much faster query against the same data. I'm at a loss on how to reproduce this. Are the indexes on the master table leading it astray maybe? Yep. Added the indexes and performance went right into the dumper. New plan on new table with old data added in random order now looks like the old table, only worse because it's on a slower drive. Just to be complete here's the plan: http://explain.depesz.com/s/PYH Note that I created new table with order by random() and created indexes. Ran analyze on it, and the select plan looks similar now: http://explain.depesz.com/s/bsE So maybe I can make a test case now. But to summarize, when it can use indexes this query gets REAL slow because it lacks a materialize step. That seem about right?
Scott Marlowe <scott.marlowe@gmail.com> writes: > Yep. Added the indexes and performance went right into the dumper. New > plan on new table with old data added in random order now looks like > the old table, only worse because it's on a slower drive. Just to be > complete here's the plan: http://explain.depesz.com/s/PYH Note that I > created new table with order by random() and created indexes. Ran > analyze on it, and the select plan looks similar now: > http://explain.depesz.com/s/bsE > So maybe I can make a test case now. But to summarize, when it can use > indexes this query gets REAL slow because it lacks a materialize step. > That seem about right? Well, the plans shown here could *not* use a materialize step because the inner scan makes use of a value from the current outer row. The materialized plan has to omit one of the index conditions from the inner scan and then apply it as a join condition. I suspect the real reason that the fast case is fast is that the inner relation, even without the p.tree_sortkey >= pro_partners.tree_sortkey condition, is empty, and thus the join runs very quickly. But the planner doesn't know that. Its estimate of the row count isn't very large, but it's definitely not zero, plus it thinks that adding the additional index condition reduces the rowcount by a factor of 3 from there. So it comes to the wrong conclusion about the value of materializing a fixed inner relation as opposed to using a parameterized indexscan. Have you tried increasing the statistics targets for these join columns? It's also possible that what you need to do is adjust the planner's cost parameters ... regards, tom lane
On Fri, Aug 2, 2013 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> Yep. Added the indexes and performance went right into the dumper. New >> plan on new table with old data added in random order now looks like >> the old table, only worse because it's on a slower drive. Just to be >> complete here's the plan: http://explain.depesz.com/s/PYH Note that I >> created new table with order by random() and created indexes. Ran >> analyze on it, and the select plan looks similar now: >> http://explain.depesz.com/s/bsE > >> So maybe I can make a test case now. But to summarize, when it can use >> indexes this query gets REAL slow because it lacks a materialize step. >> That seem about right? > > Well, the plans shown here could *not* use a materialize step because the > inner scan makes use of a value from the current outer row. The > materialized plan has to omit one of the index conditions from the inner > scan and then apply it as a join condition. > > I suspect the real reason that the fast case is fast is that the inner > relation, even without the p.tree_sortkey >= pro_partners.tree_sortkey > condition, is empty, and thus the join runs very quickly. But the planner > doesn't know that. Its estimate of the row count isn't very large, but > it's definitely not zero, plus it thinks that adding the additional index > condition reduces the rowcount by a factor of 3 from there. So it comes > to the wrong conclusion about the value of materializing a fixed inner > relation as opposed to using a parameterized indexscan. > > Have you tried increasing the statistics targets for these join columns? > It's also possible that what you need to do is adjust the planner's > cost parameters ... I've tried changing random_page_cost, sequential_page_cost, the cpu* costs, and setting effective_cache_size all over the place and it stays just as slow. our default stats target is 100. Did a stats target = 1000 on the three cols we access. Same terrible performance. Plan here: http://explain.depesz.com/s/XVt stats target=10000, same bad performance, plan: http://explain.depesz.com/s/kJ54 pretty much the same. Setting effective_cache_size='1000GB' make no difference, still slow. If I set random_page_cost to 75 makes it work, i.e. a materialize shows up. Note that we run on FusionIO cards, and the whole db fits in memory, so a very large effective cache size and random page cost of 1.0 is actually accurate for our hardware.
On Fri, Aug 2, 2013 at 3:27 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Fri, Aug 2, 2013 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Scott Marlowe <scott.marlowe@gmail.com> writes: >>> Yep. Added the indexes and performance went right into the dumper. New >>> plan on new table with old data added in random order now looks like >>> the old table, only worse because it's on a slower drive. Just to be >>> complete here's the plan: http://explain.depesz.com/s/PYH Note that I >>> created new table with order by random() and created indexes. Ran >>> analyze on it, and the select plan looks similar now: >>> http://explain.depesz.com/s/bsE >> >>> So maybe I can make a test case now. But to summarize, when it can use >>> indexes this query gets REAL slow because it lacks a materialize step. >>> That seem about right? >> >> Well, the plans shown here could *not* use a materialize step because the >> inner scan makes use of a value from the current outer row. The >> materialized plan has to omit one of the index conditions from the inner >> scan and then apply it as a join condition. >> >> I suspect the real reason that the fast case is fast is that the inner >> relation, even without the p.tree_sortkey >= pro_partners.tree_sortkey >> condition, is empty, and thus the join runs very quickly. But the planner >> doesn't know that. Its estimate of the row count isn't very large, but >> it's definitely not zero, plus it thinks that adding the additional index >> condition reduces the rowcount by a factor of 3 from there. So it comes >> to the wrong conclusion about the value of materializing a fixed inner >> relation as opposed to using a parameterized indexscan. >> >> Have you tried increasing the statistics targets for these join columns? >> It's also possible that what you need to do is adjust the planner's >> cost parameters ... > > I've tried changing random_page_cost, sequential_page_cost, the cpu* > costs, and setting effective_cache_size all over the place and it > stays just as slow. > > our default stats target is 100. Did a stats target = 1000 on the > three cols we access. Same terrible performance. Plan here: > http://explain.depesz.com/s/XVt > stats target=10000, same bad performance, plan: > http://explain.depesz.com/s/kJ54 pretty much the same. Setting > effective_cache_size='1000GB' make no difference, still slow. > > If I set random_page_cost to 75 makes it work, i.e. a materialize > shows up. Note that we run on FusionIO cards, and the whole db fits in > memory, so a very large effective cache size and random page cost of > 1.0 is actually accurate for our hardware. OK I've done some more research. Here's the plan for it with a low random page cost: Nested Loop Anti Join (cost=0.00..1862107.69 rows=9274 width=994) (actual time=0.181..183413.479 rows=17391 loops=1) Join Filter: (p.tree_sortkey <= tree_right(pp_test_wide.tree_sortkey)) -> Index Scan using pp_test_wide_tree_sortkey_idx on pp_test_wide p (cost=0.00..10108.15 rows=10433 width=983) (actual time=0.164..218.230 rows=17391 loops=1) Index Cond: ((tree_sortkey >= B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey <= B'0000000000010101000001000101010011111111111111111111111111111111'::bit varying) AND (tree_sortkey >= B'00000000'::bit varying) AND (tree_sortkey <= B'0000000011111111111111111111111111111111'::bit varying)) Filter: (deleted_at IS NULL) -> Index Scan using pp_test_wide_tree_sortkey_idx on pp_test_wide (cost=0.00..194.86 rows=14 width=11) (actual time=10.530..10.530 rows=0 loops=17391) Index Cond: ((pp_test_wide.tree_sortkey >= B'00000000000101010000010001010100'::bit varying) AND (pp_test_wide.tree_sortkey <= B'0000000000010101000001000101010011111111111111111111111111111111'::bit varying) AND (p.tree_sortkey >= pp_test_wide.tree_sortkey)) Filter: ((pp_test_wide.product_name IS NOT NULL) AND (pp_test_wide.tree_sortkey <> B'00000000000101010000010001010100'::bit varying)) Total runtime: 183421.226 ms Note that it's got 0 rows with 17391 loops on the bottom half, time of 10ms each. which adds up to just under the total cost in the nested loop anti-join of ~183,000 ms. If I crank up random_page_cost to say 100 I get this plan: Nested Loop Anti Join (cost=15202.10..144269.58 rows=9292 width=997) (actual time=71.281..271.018 rows=17391 loops=1) Join Filter: ((p.tree_sortkey >= pp_test_wide.tree_sortkey) AND (p.tree_sortkey <= tree_right(pp_test_wide.tree_sortkey))) -> Seq Scan on pp_test_wide p (cost=0.00..16009.92 rows=10453 width=986) (actual time=0.024..183.341 rows=17391 loops=1) Filter: ((deleted_at IS NULL) AND (tree_sortkey >= B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey <= B'0000000000010101000001000101010011111111111111111111111111111111'::bit varying) AND (tree_sortkey >= B'00000000'::bit varying) AND (tree_sortkey <= B'0000000011111111111111111111111111111111'::bit varying)) -> Materialize (cost=15202.10..15202.54 rows=44 width=11) (actual time=0.004..0.004 rows=0 loops=17391) -> Seq Scan on pp_test_wide (cost=0.00..15202.06 rows=44 width=11) (actual time=71.245..71.245 rows=0 loops=1) Filter: ((product_name IS NOT NULL) AND (tree_sortkey >= B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey <= B'0000000000010101000001000101010011111111111111111111111111111111'::bit varying) AND (tree_sortkey <> B'00000000000101010000010001010100'::bit varying)) Total runtime: 272.204 ms Note that here we feed two seq scans into the anti-join and the speed of both seq scan is quite low. I've played around with the selectivity to where the inner scan delivers 53 rows instead of none, and the query performance was about the same for both plans (i.e. random_page_cost = 1 is super slow lots of loops of 10ms, random_page_cost = 100 is fast with two seq scans.). It seems to me that deciding to do the index scan is a mistake whether the lower portion of the esitmates 14 or 0 rows seems like a mistake, since it is going to run an index scan's worth of loops from the upper index scan. I.e. it should be guestimating that it's gonna do ~10,000 loops in the index scan query. It's like somehow the query planner thinks that each loop is going to be MUCH faster than 10ms. I can create an extract of this table for testing if someone wants one to explore with. Let me know.
On 08/02/2013 09:37 AM, Vik Fearing wrote: > EXPLAIN ANALYZE > WITH RECURSIVE > x (start_time) AS > ( > SELECT generate_series(1, 1000000) > ), > t (time, timeround) AS > ( > SELECT time, time - time % 900000 AS timeround > FROM (SELECT min(start_time) AS time FROM x) AS tmp > UNION ALL > SELECT time, time - time % 900000 > FROM (SELECT (SELECT min(start_time) AS time > FROM x > WHERE start_time >= t.timeround + 900000) > FROM t > WHERE t.time IS NOT NULL OFFSET 0 > ) tmp > ) > SELECT count(*) FROM t WHERE time IS NOT NULL; > > If you remove the OFFSET 0, you'll see two more subplans (because "time" > is referenced three times). Is this not interesting to anyone? Vik
OK I'm bumping this one last time in the hopes that someone has an idea what to do to fix it. Query plan: http://explain.depesz.com/s/kJ54 This query takes 180 seconds. It loops 17391 times across the lower index using entries from the upper index. That seems buggy to me. While the exact estimates are off, the fact that it estimates 10,433 rows in the above index scan means that it's expecting to do that many loops on the bottom index scan. This is one of those queries where adding an offset 0 to the inner select reduces run time from 180s to 0.2s. And I can't imagine a plan that thinks running 10k loops on an index scan is a good idea.
Scott Marlowe <scott.marlowe@gmail.com> writes: > OK I'm bumping this one last time in the hopes that someone has an > idea what to do to fix it. > Query plan: http://explain.depesz.com/s/kJ54 > This query takes 180 seconds. It loops 17391 times across the lower > index using entries from the upper index. That seems buggy to me. There isn't all that much that the planner can do with that query. There are no equality join clauses, so no possibility of a merge or hash join; the only way to implement the join is a nestloop. Things would probably be better if it left out the one join clause it's putting into the inner indexscan condition, so it could materialize the result of the inner indexscan and then do a nestloop join against the Material node. I'd expect 9.0 and up to consider that a good idea ... but looking back, I see this is 8.4, which means you're probably out of luck on getting a better plan. 8.4's nearly out of warranty anyway --- consider upgrading. regards, tom lane
On Tue, Aug 13, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> OK I'm bumping this one last time in the hopes that someone has an >> idea what to do to fix it. > >> Query plan: http://explain.depesz.com/s/kJ54 > >> This query takes 180 seconds. It loops 17391 times across the lower >> index using entries from the upper index. That seems buggy to me. > > There isn't all that much that the planner can do with that query. There > are no equality join clauses, so no possibility of a merge or hash join; > the only way to implement the join is a nestloop. > > Things would probably be better if it left out the one join clause it's > putting into the inner indexscan condition, so it could materialize the > result of the inner indexscan and then do a nestloop join against the > Material node. I'd expect 9.0 and up to consider that a good idea ... > but looking back, I see this is 8.4, which means you're probably out of > luck on getting a better plan. 8.4's nearly out of warranty anyway --- > consider upgrading. Thanks for the hints, we'll try them. As for the upgrade an upgrade to 9.1, possibly 9.2 is already in the planning stages. But you know how production upgrades go, slow...