Re: BUG #5059: Planner ignores estimates when planning an IN () subquery - Mailing list pgsql-bugs

From Robert Haas
Subject Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date
Msg-id 603c8f070909161432k3fe8c46i7b36ec3655767c5c@mail.gmail.com
Whole thread Raw
In response to BUG #5059: Planner ignores estimates when planning an IN () subquery  ("Kenaniah Cerny" <kenaniah@gmail.com>)
Responses Re: BUG #5059: Planner ignores estimates when planning an IN () subquery  (Kenaniah Cerny <kenaniah@gmail.com>)
List pgsql-bugs
On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah@gmail.com> wrote:
>
> The following bug has been logged online:
>
> Bug reference: =A0 =A0 =A05059
> Logged by: =A0 =A0 =A0 =A0 =A0Kenaniah Cerny
> Email address: =A0 =A0 =A0kenaniah@gmail.com
> PostgreSQL version: 8.4.1
> Operating system: =A0 Centos5.2
> Description: =A0 =A0 =A0 =A0Planner ignores estimates when planning an IN=
 ()
> subquery
> Details:
>
> Consider the following query:
>
> http://pgsql.privatepaste.com/aa5DAtiwws
>
> When planning the subquery of the IN () statement, the planner chose to s=
can
> the indexes of the outer and inner columns in parallel using a nested loop
> semi join.
>
> http://pgsql.privatepaste.com/4eXj3zRcy7
>
> By not enabling the planner to sort via the index of the outer column in =
the
> WHERE clause (query above), the a nested loop version of the plan executes
> in a fraction of the time.
>
> http://pgsql.privatepaste.com/5c0bOcL3t6
>
> As you can see from the above query, forcing the materialization of the
> subquery produces a much superior plan.
>
> http://pgsql.privatepaste.com/371nl6KFrI
>
> For comparison, this query replaces the subquery with hard-coded values.
>
> The planner appears to not be weighing the benefits of materializing the
> subquery of the IN () statement properly when ordering is involved, and
> still produces an inferior plan when ordering is not a factor.
>
> Please feel free to contact me for additional test cases if needed.

I've seen this kind of plan before.  The planner knows that a
backwards index scan over user_activity_log is slow, but it thinks
that it won't have to scan very much of it because it will accumulate
enough rows to satisfy the LIMIT before it goes very far.  When it
turns out that there are not as many matching rows in user_friends as
anticipated, things get very slow.

Unfortunately, there's not an easy solution to this problem.  If it
were really true that most rows in user_activity_log had matches in
user_friends, then planner would be correct to choose the plan it did
- and wrong to choose the plan you want it to use.  In this case, your
table is small enough that a bitmap heap scan on user_activity_log is
reasonable, but for a larger table in a situation where the join
selectivity is higher, it might really be necessary to use the index.
The problem is just that the planner is doing a poor job figuring out
where the breakpoint is between those two strategies.

I'm not 100% sure exactly which selectivity estimate is wrong.  It's
not clear to me whether things are already wrong here:

INDEX Scan USING user_friends_user_account_id_friend_id_key ON
user_friends (cost=3D0.00..0.27 rows=3D1 width=3D4) (actual
time=3D0.004..0.004 rows=3D0 loops=3D350713)

But they're definitely wrong here:

Nested Loop Semi JOIN  (cost=3D0.00..144564.33 rows=3D62298 width=3D52)
(actual time=3D5138.961..5405.075 rows=3D10 loops=3D1)

I *think* the root cause of the problem is that we don't have
multi-column statistics.  When you look at user_friends, we can tell
you the selectivity of user_friends.user_account_id =3D 74650 pretty
accurately, and we should also be able to spit out MCVs to compare
against the MCVs of user_activity_log, but if the selectivity of those
two expressions together doesn't resemble the product of their
individual selectivities, which is probably the case here, the planner
has no way to detect that.

...Robert

pgsql-bugs by date:

Previous
From: Merlin Moncure
Date:
Subject: 'missing parameter $1' for sql or pl/pgsql COPY
Next
From: Kenaniah Cerny
Date:
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery