Thread: Bad Plan for Questionnaire-Type Query
I have a query [1] that Postgres is insisting on using a Nested Loop for some reason when a Hash Join is much faster. It seems like the estimates are way off. I've set default_statistics_target to 250, 500, 1000 and analyzed, but they never seem to improve. If I disable nestloops, the query completes in around 3-5s. With them enabled, it takes anywhere from 45 to 60 seconds. Here is the DDL for the tables and the month_last_day function [4]. Any help would be appreciated! David Blewett 1. http://dpaste.com/hold/41842/ 2. http://explain.depesz.com/s/Wg 3. http://explain.depesz.com/s/1s 4. http://dpaste.com/hold/41846/
On Thu, May 7, 2009 at 12:53 PM, David Blewett <david@dawninglight.net> wrote: > 1. http://dpaste.com/hold/41842/ > 2. http://explain.depesz.com/s/Wg > 3. http://explain.depesz.com/s/1s > 4. http://dpaste.com/hold/41846/ Forgot to mention that I'm using Postgres 8.3.6 on linux 2.6.24. Shared buffers are set to 1GB, effective_cache_size is set to 3GB. Server has 6GB RAM, running on a SCSI 4-disk RAID10. David Blewett
David Blewett <david@dawninglight.net> writes: > On Thu, May 7, 2009 at 12:53 PM, David Blewett <david@dawninglight.net> wrote: >> 1. http://dpaste.com/hold/41842/ >> 2. http://explain.depesz.com/s/Wg >> 3. http://explain.depesz.com/s/1s >> 4. http://dpaste.com/hold/41846/ > Forgot to mention that I'm using Postgres 8.3.6 on linux 2.6.24. Well, the reason it likes the nestloop plan is the estimate of just one row out of the lower joins --- that case is pretty much always going to favor a nestloop over other kinds of joins. If it were estimating even as few as ten rows out, it'd likely switch to a different plan. So the question to ask is why the rowcount estimates are so abysmally bad. You mentioned having tried to increase the stats targets, but without seeing the actual stats data it's hard to speculate about this. regards, tom lane
On Thu, May 7, 2009 at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > as few as ten rows out, it'd likely switch to a different plan. So the > So the question to ask is why the rowcount estimates are so abysmally bad. > You mentioned having tried to increase the stats targets, but without > seeing the actual stats data it's hard to speculate about this. How do I get that data for you? David
David Blewett <david@dawninglight.net> writes: > On Thu, May 7, 2009 at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> as few as ten rows out, it'd likely switch to a different plan. �So the >> So the question to ask is why the rowcount estimates are so abysmally bad. >> You mentioned having tried to increase the stats targets, but without >> seeing the actual stats data it's hard to speculate about this. > How do I get that data for you? Look into pg_stats for the rows concerning the columns used in the query's WHERE and JOIN/ON clauses. regards, tom lane
On Thu, May 7, 2009 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Look into pg_stats for the rows concerning the columns used in the > query's WHERE and JOIN/ON clauses. Okay, here you go: http://rafb.net/p/20y8Oh72.html David
David Blewett <david@dawninglight.net> writes: > On Thu, May 7, 2009 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Look into pg_stats for the rows concerning the columns used in the >> query's WHERE and JOIN/ON clauses. > Okay, here you go: > http://rafb.net/p/20y8Oh72.html I got some time to poke into this, but didn't get very far --- the joins that seem to be the main problem involve canvas_textresponse.submission_id which you didn't include stats for. regards, tom lane
David Blewett <david@dawninglight.net> writes: > Apparently there was a typo in the query that I didn't notice that > excluded that table's columns. Here is the new output including it: > http://pastesite.com/7017 Thanks. Could I trouble you for one other data point --- about how many rows are in each of these tables? regards, tom lane
On Fri, May 8, 2009 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thanks. Could I trouble you for one other data point --- about how many
> rows are in each of these tables?
Not a problem:
canvas_dateresponse 263819
canvas_foreignkeyresponse 646484
canvas_integerresponse 875375
canvas_submission 135949
canvas_textresponse 142698
David
> Thanks. Could I trouble you for one other data point --- about how many
> rows are in each of these tables?
Not a problem:
canvas_dateresponse 263819
canvas_foreignkeyresponse 646484
canvas_integerresponse 875375
canvas_submission 135949
canvas_textresponse 142698
David
David Blewett <david@dawninglight.net> writes: > On Fri, May 8, 2009 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thanks. Could I trouble you for one other data point --- about how many >> rows are in each of these tables? > Not a problem: As best I can tell, the selectivity numbers are about what they should be --- for instance, using these stats I get a selectivity of 0.0000074 for the join clause fkr.submission_id = tr.submission_id. Over the entire relations (646484 and 142698 rows) that's predicting a join size of 683551, which seems to be in the right ballpark (it looks like actually it's one join row per canvas_foreignkeyresponse row, correct?). The thing that is strange here is that the one-to-one ratio holds up despite strong and apparently uncorrelated restrictions on the relations: -> Hash Join (cost=1485.69..3109.78 rows=28 width=24) (actual time=5.576..22.737 rows=4035 loops=1) Hash Cond: (fkr.submission_id = tr.submission_id) -> Bitmap Heap Scan on canvas_foreignkeyresponse fkr (cost=14.52..1628.19 rows=580 width=4) (actual time=0.751..4.497rows=4035 loops=1) Recheck Cond: ((question_id = ANY ('{79,1037}'::integer[])) AND (object_id < 3)) -> Bitmap Index Scan on canvas_foreignkeyresponse_qv2_idx (cost=0.00..14.38 rows=580 width=0) (actualtime=0.671..0.671 rows=4035 loops=1) Index Cond: ((question_id = ANY ('{79,1037}'::integer[])) AND (object_id < 3)) -> Hash (cost=1388.48..1388.48 rows=6615 width=20) (actual time=4.805..4.805 rows=6694 loops=1) -> Bitmap Heap Scan on canvas_textresponse tr (cost=131.79..1388.48 rows=6615 width=20) (actual time=0.954..2.938rows=6694 loops=1) Recheck Cond: (question_id = ANY ('{4,1044}'::integer[])) -> Bitmap Index Scan on canvas_textresponse_question_id (cost=0.00..130.14 rows=6615 width=0) (actualtime=0.920..0.920 rows=6694 loops=1) Index Cond: (question_id = ANY ('{4,1044}'::integer[])) How is it that each fkr row matching those question_ids has a join match in tr that has those other two question_ids? It seems like there must be a whole lot of hidden correlation here. regards, tom lane
On Sat, May 9, 2009 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The design is that for each submission_id there are any number of responses of different types. This particular questionnaire has 78 questions, 2 of which are text responses and 28 are foreignkey responses. The restrictions on the question_id limit the rows returned from those tables to 1 each in this case however. So yes, it's one to one in this case.
As I mentioned before, they are all linked by the submission_id which indicates they are part of a single submission against a particular questionnaire (chart_id in the ddl). It is a design that I based on Elein Mustain's Question/Answer problem [1]. This particular query includes 2 chart_id's because they contain virtually the same data (sets of questions), but have different validation requirements. Does that shed any more light?
Thanks again for the help.
David
1. http://www.varlena.com/GeneralBits/110.php
As best I can tell, the selectivity numbers are about what they should
be --- for instance, using these stats I get a selectivity of 0.0000074
for the join clause fkr.submission_id = tr.submission_id. Over the
entire relations (646484 and 142698 rows) that's predicting a join size
of 683551, which seems to be in the right ballpark (it looks like
actually it's one join row per canvas_foreignkeyresponse row, correct?).
The design is that for each submission_id there are any number of responses of different types. This particular questionnaire has 78 questions, 2 of which are text responses and 28 are foreignkey responses. The restrictions on the question_id limit the rows returned from those tables to 1 each in this case however. So yes, it's one to one in this case.
How is it that each fkr row matching those question_ids has a join match
in tr that has those other two question_ids? It seems like there must
be a whole lot of hidden correlation here.
As I mentioned before, they are all linked by the submission_id which indicates they are part of a single submission against a particular questionnaire (chart_id in the ddl). It is a design that I based on Elein Mustain's Question/Answer problem [1]. This particular query includes 2 chart_id's because they contain virtually the same data (sets of questions), but have different validation requirements. Does that shed any more light?
Thanks again for the help.
David
1. http://www.varlena.com/GeneralBits/110.php
On Sat, May 9, 2009 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I took the time to load this data into an 8.4beta2 install, and the same query runs in a much more reasonable timeframe (~3s as opposed to ~50s). I set the statistics target to 500, and got this explain [1].
David
1. http://explain.depesz.com/s/pw
David Blewett <david@dawninglight.net> writes:> On Fri, May 8, 2009 at 10:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:As best I can tell, the selectivity numbers are about what they should
>> Thanks. Could I trouble you for one other data point --- about how many
>> rows are in each of these tables?
> Not a problem:
be --- for instance, using these stats I get a selectivity of 0.0000074
for the join clause fkr.submission_id = tr.submission_id. Over the
entire relations (646484 and 142698 rows) that's predicting a join size
of 683551, which seems to be in the right ballpark (it looks like
actually it's one join row per canvas_foreignkeyresponse row, correct?).
I took the time to load this data into an 8.4beta2 install, and the same query runs in a much more reasonable timeframe (~3s as opposed to ~50s). I set the statistics target to 500, and got this explain [1].
David
1. http://explain.depesz.com/s/pw
David Blewett <david@dawninglight.net> writes: > I took the time to load this data into an 8.4beta2 install, and the same > query runs in a much more reasonable timeframe (~3s as opposed to ~50s). I > set the statistics target to 500, and got this explain [1]. > 1. http://explain.depesz.com/s/pw Hmm... the join size estimates are no better than before, so I'm afraid that 8.4 is just as vulnerable to picking a bad plan as the previous versions were. I don't think you should assume anything's been fixed. It still feels like this schema design is obscuring correlations that the planner needs to know about in order to make decent estimates. You mentioned earlier that the seemingly unrelated question_ids were linked via a common submission_id. I wonder whether it's possible to query using the submission_id instead? regards, tom lane
On Sun, May 24, 2009 at 2:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not sure how to make the planner aware of these correlations. Is there something inherently flawed with this design? It seems pretty close to the one on the Varlena website [1].
It still feels like this schema design is obscuring correlations that
the planner needs to know about in order to make decent estimates.
I'm not sure how to make the planner aware of these correlations. Is there something inherently flawed with this design? It seems pretty close to the one on the Varlena website [1].
You mentioned earlier that the seemingly unrelated question_ids were
linked via a common submission_id. I wonder whether it's possible to
query using the submission_id instead?
Well, I do join the different response tables [text/date/etc] together via the submission_id. However, in order to be able to apply the where clauses appropriately, I have to limit the responses to the appropriate question_id's. Would it matter to push that requirement down to the where clause instead of part of the join clause?
David
1. http://www.varlena.com/GeneralBits/110.php
On Mon, May 25, 2009 at 11:22 AM, David Blewett <david@dawninglight.net> wrote: > On Sun, May 24, 2009 at 2:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> It still feels like this schema design is obscuring correlations that >> the planner needs to know about in order to make decent estimates. > > I'm not sure how to make the planner aware of these correlations. Is there > something inherently flawed with this design? It seems pretty close to the > one on the Varlena website [1]. > >> You mentioned earlier that the seemingly unrelated question_ids were >> linked via a common submission_id. I wonder whether it's possible to >> query using the submission_id instead? > > Well, I do join the different response tables [text/date/etc] together via > the submission_id. However, in order to be able to apply the where clauses > appropriately, I have to limit the responses to the appropriate > question_id's. Would it matter to push that requirement down to the where > clause instead of part of the join clause? > > David > > 1. http://www.varlena.com/GeneralBits/110.php > Anyone have thoughts on this? Bueller? David
David, My first thought would be to increase statistics dramatically on the filtered columns in hopes of making PG realize there's a lot of rows there; it's off by 8x. Correlations stats are an ongoing issue in PostgreSQL. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Fri, Jun 5, 2009 at 7:32 PM, Josh Berkus <josh@agliodbs.com> wrote: > My first thought would be to increase statistics dramatically on the > filtered columns in hopes of making PG realize there's a lot of rows there; > it's off by 8x. Correlations stats are an ongoing issue in PostgreSQL. I started at a stats_target of 250, then tried 500 and finally the plan that I pasted before resorting to disabling nestloops was at 1000 (and re-analyzing in between of course). Will a CLUSTER or REINDEX help at all? David
On Fri, Jun 5, 2009 at 8:29 PM, David Blewett<david@dawninglight.net> wrote: > On Fri, Jun 5, 2009 at 7:32 PM, Josh Berkus <josh@agliodbs.com> wrote: >> My first thought would be to increase statistics dramatically on the >> filtered columns in hopes of making PG realize there's a lot of rows there; >> it's off by 8x. Correlations stats are an ongoing issue in PostgreSQL. > > I started at a stats_target of 250, then tried 500 and finally the > plan that I pasted before resorting to disabling nestloops was at 1000 > (and re-analyzing in between of course). Will a CLUSTER or REINDEX > help at all? Probably not. Your problem is similar to the one Anne Rosset was complaining about on -performance a couple of days ago, though your case is appears to be more complex. http://archives.postgresql.org/pgsql-performance/2009-06/msg00023.php It's really not clear what to do about this problem. In Anne's case, it would probably be enough to gather MCVs over the product space of her folder_id and is_deleted columns, but I'm not certain that would help you. It almost seems like we need a way to say "for every distinct value that appears in column X, you need to gather separate statistics for the other columns of the table". But that could make statistics gathering and query planning very expensive. Another angle of attack, which we've talked about before, is to teach the executor that when a nestloop with a hash-joinable condition executes too many times, it should hash the inner side on the next pass and then switch to a hash join. But none of this helps you very much right now... ...Robert