Re: Any idea on how to improve the statistics estimates for this plan? - Mailing list pgsql-performance

From Jeff Janes
Subject Re: Any idea on how to improve the statistics estimates for this plan?
Date
Msg-id CAMkU=1x9GoZb_6HQf3sF=mN6NEJhwRFsJMO1AwJBh0nDk34sJQ@mail.gmail.com
Whole thread Raw
In response to Any idea on how to improve the statistics estimates for this plan?  (Guillaume Smet <guillaume.smet@gmail.com>)
Responses Re: Any idea on how to improve the statistics estimates for this plan?
Re: Any idea on how to improve the statistics estimates for this plan?
List pgsql-performance
On Wed, Dec 5, 2012 at 11:39 AM, Guillaume Smet
<guillaume.smet@gmail.com> wrote:
> Hi,
>
> I'm struggling with a query for some time and the major problem of the
> query is that the statistics are way wrong on a particular operation:
>  ->  Nested Loop  (cost=3177.72..19172.84 rows=*2* width=112) (actual
> time=139.221..603.929 rows=*355331* loops=1)
>      Join Filter: (l.location_id = r.location_id)
>      ->  Hash Join  (cost=3177.71..7847.52 rows=*33914* width=108)
> (actual time=138.343..221.852 rows=*36664* loops=1)
>              Hash Cond: (el.location_id = l.location_id)
>              ...
>      ->  Index Scan using idx_test1 on representations r
> (cost=0.01..0.32 rows=*1* width=12) (actual time=0.002..0.008
> rows=*10* loops=36664)
>              ...
> (extracted from the original plan which is quite massive)

Could you reduce the plan size by removing joins that are extraneous
to this specific problem?

> I tried to improve the statistics of l.location_id, el.location_id,
> r.location_id and idx_test1.location_id (up to 5000) but it doesn't
> get better.

If there is a correlation that PostgreSQL is incapable of
understanding, than no amount of increase is going to help.

>
> Any idea on how I could get better statistics in this particular
> example and why the estimate of the nested loop is so wrong while the
> ones for each individual operations are quite good?

The trivial answer to "why" is that it thinks that the vast majority
of the 33914 rows from the hash join will find no partners in r, but
in fact each has about 10 partner in r.  Why does it think that?
Without seeing all the join conditions and filter conditions on those
tables, plus the size of each unfiltered pair-wise joins, it is hard
to speculate.

If you remove all filters (all members of the "where" which are not
join criteria), then what does the plan look like?

If those estimates are better, it probably means that your filter
condition is picking a part of the "el JOIN l" that has much different
selectivity to r than the full set does, and PostgreSQL has no way of
knowing that.

Cheers,

Jeff


pgsql-performance by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Slow query: bitmap scan troubles
Next
From: Tom Lane
Date:
Subject: Re: Any idea on how to improve the statistics estimates for this plan?