Re: BUG #5059: Planner ignores estimates when planning an IN () subquery - Mailing list pgsql-bugs

From Kenaniah Cerny
Subject Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date
Msg-id f647f4600909162232h27d115f4of2e299177a1d3a67@mail.gmail.com
Whole thread Raw
In response to Re: BUG #5059: Planner ignores estimates when planning an IN () subquery  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-bugs
My apologies, I meant user_activity_log, and not user_anime_log in the
previous email.

This isn't a huge issue as the query could easily be rewritten using a more
pragmatic join structure, but I thought it would be worth mentioning as
occasionally bugs involve more than just what meets the eye. Thank you for
the help.

On Wed, Sep 16, 2009 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> On Wed, Sep 16, 2009 at 6:39 PM, Kenaniah Cerny <kenaniah@gmail.com>
> wrote:
> > I can provide the output of statistics queries if you would like. Just
> let
> > me know which statistics you want and I'll pastebin them.
> >
> > As far as selectivity goes, the selectivity estimate for the
> > user_anime_log.user_account_id was definitely miscalculated. The
> > user_anime_log contains up to 15 entries per user (70,000 users, but only
> > 475,811 rows). The default statistics target on that relation is set to
> > 1000. But even with poor statistics, guessing 62,000 rows when there's a
> > maximum of 15 per user account is still quite a miss.
>
> You've changed the table name on me, vs. what you pasted in the query,
> which had a user_activity_log but no user_anime_log...
>
> > Analysis of SELECT * FROM user_activity_log WHERE user_account_id =
> 17238;
> > estimates 13 rows and returns 15, which is quite reasonable considering
> the
> > statistics targets.
> >
> > Please forgive my ignorance, but in the case of my subquery, is the
> > estimated number of rows and cost being taken into account (or only the
> > selectivity)?
>
> Well, selectivity is just a term that refers to the fraction of rows
> that match some condition (rows themselves do not have selectivity).
> Usually the initial estimating is done in terms of selectivity, which
> is then multiplied by the total number of rows to find the number of
> rows that will remain after the condition is applied.
>
> So, yes, rows and cost are taken into account.  The problem here is
> that the planner is mis-estimating the selectivity, therefore it
> computes the wrong number of rows (way too high), therefore it makes
> the wrong decision.
>
> > Granted I don't understand much about the planner internals,
> > but it seems strange that a nested loop semi join would be chosen when
> the
> > inner table is estimated to return an extremely low number of rows with
> low
> > cost and low selectivity. Shouldn't the planner also estimate the cost of
> an
> > inner (er, left?) join in that scenario?
>
> Well... you can't replace a semi join with an inner or left join,
> because it doesn't do the same thing.   You could use a hash semi join
> or merge join semi join, but that doesn't make sense if, as you say,
> the inner table is estimated to return an extremely low number of
> rows.
>
> It might be a bit easier to analyze this if you stripped out all the
> joins that aren't necessary to reproduce the problem.  Also, I would
> prefer EXPLAIN ANALYZE output posted in-line to the mailing list
> rather than pasted to a separate web site - it screws up the
> formatting.
>
> But honestly I'm not sure how much time it's worth spending on this.
> You have a way to rewrite the query that works...  and fixing the
> estimator is going to be hard...  so I suggest doing it the way that
> works, and moving on!
>
> ...Robert
>

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: 답장: bug report
Next
From: Andre Rothe
Date:
Subject: Wrong default values of columns