Re: Strange query optimization in 7.3.2 - Mailing list pgsql-general

From Tom Lane
Subject Re: Strange query optimization in 7.3.2
Date
Msg-id 5249.1050385185@sss.pgh.pa.us
Whole thread Raw
In response to Strange query optimization in 7.3.2  (Alec Mitchell <apm13@columbia.edu>)
Responses Re: Strange query optimization in 7.3.2  (Alec Mitchell <apm13@columbia.edu>)
List pgsql-general
Alec Mitchell <apm13@columbia.edu> writes:
> On Monday 14 April 2003 04:08 pm, Tom Lane wrote:
>> Could you see your way to sending me a dump of these tables for testing
>> purposes?  You could exclude the columns not used in the FROM/WHERE
>> clauses, if that is needed to satisfy privacy worries.

> I've attached dumps of the relevant tables (not including the stops
> table (s), it is huge, and doesn't come into play until after the bad
> planner estimate).

Indeed, there's a problem here with nulls --- there is a case in
eqjoinsel() that didn't account for nulls, and should've.  I've applied
the attached patch to fix it.  This reduces the estimate of the tr/t/m
join size from 1119 to 332, which is still large compared to the true
size of 52, but it's a lot better.  Without the stops table, I can't
really tell if this changes the complete plan or not --- could you apply
the patch and find out?

BTW, the remaining error seems to be because the tr.group_num = 1
constraint skews the fraction of matching "trailer" values.  I fear
there's little chance of persuading the system to figure that out in the
near future --- it can't be done without cross-column correlation
statistics, which we do not keep.

            regards, tom lane

*** src/backend/utils/adt/selfuncs.c.orig    Sat Mar 22 20:49:13 2003
--- src/backend/utils/adt/selfuncs.c    Tue Apr 15 01:13:01 2003
***************
*** 1552,1578 ****
          {
              /*
               * We do not have MCV lists for both sides.  Estimate the join
!              * selectivity as MIN(1/nd1, 1/nd2).  This is plausible if we
!              * assume that the values are about equally distributed: a
!              * given tuple of rel1 will join to either 0 or N2/nd2 rows of
!              * rel2, so total join rows are at most N1*N2/nd2 giving a
!              * join selectivity of not more than 1/nd2.  By the same logic
!              * it is not more than 1/nd1, so MIN(1/nd1, 1/nd2) is an upper
!              * bound.  Using the MIN() means we estimate from the point of
!              * view of the relation with smaller nd (since the larger nd
!              * is determining the MIN).  It is reasonable to assume that
!              * most tuples in this rel will have join partners, so the
!              * bound is probably reasonably tight and should be taken
!              * as-is.
               *
               * XXX Can we be smarter if we have an MCV list for just one
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
              if (nd1 > nd2)
!                 selec = 1.0 / nd1;
              else
!                 selec = 1.0 / nd2;
          }

          if (have_mcvs1)
--- 1552,1584 ----
          {
              /*
               * We do not have MCV lists for both sides.  Estimate the join
!              * selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2).
!              * This is plausible if we assume that the join operator is
!              * strict and the non-null values are about equally distributed:
!              * a given non-null tuple of rel1 will join to either zero or
!              * N2*(1-nullfrac2)/nd2 rows of rel2, so total join rows are at
!              * most N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join
!              * selectivity of not more than (1-nullfrac1)*(1-nullfrac2)/nd2.
!              * By the same logic it is not more than
!              * (1-nullfrac1)*(1-nullfrac2)/nd1, so the expression with MIN()
!              * is an upper bound.  Using the MIN() means we estimate from the
!              * point of view of the relation with smaller nd (since the larger
!              * nd is determining the MIN).  It is reasonable to assume that
!              * most tuples in this rel will have join partners, so the bound
!              * is probably reasonably tight and should be taken as-is.
               *
               * XXX Can we be smarter if we have an MCV list for just one
               * side? It seems that if we assume equal distribution for the
               * other side, we end up with the same answer anyway.
               */
+             double        nullfrac1 = stats1->stanullfrac;
+             double        nullfrac2 = stats2->stanullfrac;
+
+             selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
              if (nd1 > nd2)
!                 selec /= nd1;
              else
!                 selec /= nd2;
          }

          if (have_mcvs1)


pgsql-general by date:

Previous
From: Neil Conway
Date:
Subject: Re: CVS doesn't compile initdb and other binaries
Next
From: Tony Grant
Date:
Subject: Re: Are we losing momentum?