Re: Incorrect estimates on columns - Mailing list pgsql-performance

From Chris Kratz
Subject Re: Incorrect estimates on columns
Date
Msg-id 200710181603.51151.chris.kratz@vistashare.com
Whole thread Raw
In response to Re: Incorrect estimates on columns  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Wednesday 17 October 2007 20:23, Tom Lane wrote:
>  Chris Kratz <chris.kratz@vistashare.com> writes:
> > On Wednesday 17 October 2007 14:49, Tom Lane wrote:
> >> Evidently it's not realizing that every row of par will have a join
> >> partner, but why not?  I suppose a.activityid is unique, and in most
> >> cases that I've seen the code seems to get that case right.
> >>
> >> Would you show us the pg_stats rows for par.activity and a.activityid?
> >
> > Here are the pg_stats rows for par.activity and a.activityid.
>
> Hmm, nothing out of the ordinary there.
>
> I poked at this a bit and realized that what seems to be happening is
> that the a.programid = 171 condition is reducing the selectivity
> estimate --- that is, it knows that that will filter out X percent of
> the activity rows, and it assumes that *the size of the join result will
> be reduced by that same percentage*, since join partners would then be
> missing for some of the par rows.  The fact that the join result doesn't
> actually decrease in size at all suggests that there's some hidden
> correlation between the programid condition and the condition on
> par.provider_lfm.  Is that true?  Maybe you could eliminate one of the
> two conditions from the query?
>
> Since PG doesn't have any cross-table (or even cross-column) statistics
> it's not currently possible for the optimizer to deal very well with
> hidden correlations like this ...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

Yes, you are correct.  Programid is a "guard" condition to make sure a user
cannot look at rows outside of their program.  In this particular case the
par table only has rows for this agency (at one point in time, all rows were
in one table), so I was able to remove  the check on programid on "a". This
causes my example query to run in 200ms.   That's wonderful.

So, to recap.  We had a filter on the join clause which really didn't in this
case affect the selectivity of the join table.  But the optimizer assumed
that the selectivity would be affected causing it to think the join would
generate only a few rows.  Since it thought that there would be relatively
few rows returned, it used a nestloop instead of another type of join that
would have been faster with larger data sets.

Thanks for all your help.

-Chris

pgsql-performance by date:

Previous
From: John Major
Date:
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)
Next
From: ismo.tuononen@solenovo.fi
Date:
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)