Thread: Re: [HACKERS] Enhanced containment selectivity function

Re: [HACKERS] Enhanced containment selectivity function

From
Bruce Momjian
Date:
Cleaned-up patch attached and applied.   Catalog version updated in
separate patch.

---------------------------------------------------------------------------

Matteo Beccati wrote:
> Hi,
>
> >>Moving it in contrib/ltree would be more difficult to me because it
> >>depends on other functions declared in selfuncs.c
> >>(get_restriction_variable, etc).
> >
> > I'd be willing to consider exporting those functions from selfuncs.c.
>
> In the meanwhile here is the latest patch which uses both mcv and
> histogram values.
>
>
> BTW, when restoring my test database I've found out that there were many
> errors on ALTER INDEX "something" OWNER TO ... :
>
> ERROR:  "something" is not a table, view, or sequence
>
> This using 8.1devel pg_restore and a 8.0.3 compressed dump. I could be
> wrong, but I didn't get those errors a few days ago (some cvs updates ago).
>
>
> Best regards
> --
> Matteo Beccati
> http://phpadsnew.com/
> http://phppgads.com/

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: contrib/ltree/ltree.sql.in
===================================================================
RCS file: /cvsroot/pgsql/contrib/ltree/ltree.sql.in,v
retrieving revision 1.10
diff -c -c -r1.10 ltree.sql.in
*** contrib/ltree/ltree.sql.in    27 Feb 2006 16:09:48 -0000    1.10
--- contrib/ltree/ltree.sql.in    26 Apr 2006 18:25:16 -0000
***************
*** 230,236 ****
      RIGHTARG = ltree,
      PROCEDURE = ltree_isparent,
          COMMUTATOR = '<@',
!         RESTRICT = contsel,
      JOIN = contjoinsel
  );

--- 230,236 ----
      RIGHTARG = ltree,
      PROCEDURE = ltree_isparent,
          COMMUTATOR = '<@',
!         RESTRICT = parentsel,
      JOIN = contjoinsel
  );

***************
*** 248,254 ****
      RIGHTARG = ltree,
      PROCEDURE = ltree_risparent,
          COMMUTATOR = '@>',
!         RESTRICT = contsel,
      JOIN = contjoinsel
  );

--- 248,254 ----
      RIGHTARG = ltree,
      PROCEDURE = ltree_risparent,
          COMMUTATOR = '@>',
!         RESTRICT = parentsel,
      JOIN = contjoinsel
  );

Index: src/backend/utils/adt/geo_selfuncs.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/geo_selfuncs.c,v
retrieving revision 1.27
diff -c -c -r1.27 geo_selfuncs.c
*** src/backend/utils/adt/geo_selfuncs.c    5 Mar 2006 15:58:42 -0000    1.27
--- src/backend/utils/adt/geo_selfuncs.c    26 Apr 2006 18:25:21 -0000
***************
*** 20,26 ****

  #include "utils/geo_decls.h"

-
  /*
   *    Selectivity functions for geometric operators.    These are bogus -- unless
   *    we know the actual key distribution in the index, we can't make a good
--- 20,25 ----
***************
*** 93,95 ****
--- 92,95 ----
  {
      PG_RETURN_FLOAT8(0.001);
  }
+
Index: src/backend/utils/adt/selfuncs.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/selfuncs.c,v
retrieving revision 1.199
diff -c -c -r1.199 selfuncs.c
*** src/backend/utils/adt/selfuncs.c    20 Apr 2006 17:50:18 -0000    1.199
--- src/backend/utils/adt/selfuncs.c    26 Apr 2006 18:25:28 -0000
***************
*** 4852,4854 ****
--- 4852,5033 ----

      PG_RETURN_VOID();
  }
+
+
+ #define DEFAULT_PARENT_SEL 0.001
+
+ /*
+  *    parentsel - Selectivity of parent relationship for ltree data types.
+  */
+ Datum
+ parentsel(PG_FUNCTION_ARGS)
+ {
+     PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+     Oid            operator = PG_GETARG_OID(1);
+     List       *args = (List *) PG_GETARG_POINTER(2);
+     int            varRelid = PG_GETARG_INT32(3);
+     VariableStatData vardata;
+     Node       *other;
+     bool        varonleft;
+     Datum       *values;
+     int            nvalues;
+     float4       *numbers;
+     int            nnumbers;
+     double        selec = 0.0;
+
+     /*
+      * If expression is not variable <@ something or something <@ variable,
+      * then punt and return a default estimate.
+      */
+     if (!get_restriction_variable(root, args, varRelid,
+                                   &vardata, &other, &varonleft))
+         PG_RETURN_FLOAT8(DEFAULT_PARENT_SEL);
+
+     /*
+      * If the something is a NULL constant, assume operator is strict and
+      * return zero, ie, operator will never return TRUE.
+      */
+     if (IsA(other, Const) &&
+         ((Const *) other)->constisnull)
+     {
+         ReleaseVariableStats(vardata);
+         PG_RETURN_FLOAT8(0.0);
+     }
+
+     if (HeapTupleIsValid(vardata.statsTuple))
+     {
+         Form_pg_statistic stats;
+         double        mcvsum = 0.0;
+         double        mcvsel = 0.0;
+         double        hissel = 0.0;
+
+         stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
+
+         if (IsA(other, Const))
+         {
+             /* Variable is being compared to a known non-null constant */
+             Datum        constval = ((Const *) other)->constvalue;
+             bool        match = false;
+             int            i;
+
+             /*
+              * Is the constant "<@" to any of the column's most common values?
+              */
+             if (get_attstatsslot(vardata.statsTuple,
+                                  vardata.atttype, vardata.atttypmod,
+                                  STATISTIC_KIND_MCV, InvalidOid,
+                                  &values, &nvalues,
+                                  &numbers, &nnumbers))
+             {
+                 FmgrInfo    contproc;
+
+                 fmgr_info(get_opcode(operator), &contproc);
+
+                 for (i = 0; i < nvalues; i++)
+                 {
+                     /* be careful to apply operator right way 'round */
+                     if (varonleft)
+                         match = DatumGetBool(FunctionCall2(&contproc,
+                                                            values[i],
+                                                            constval));
+                     else
+                         match = DatumGetBool(FunctionCall2(&contproc,
+                                                            constval,
+                                                            values[i]));
+
+                     /* calculate total selectivity of all most-common-values */
+                     mcvsum += numbers[i];
+
+                     /* calculate selectivity of matching most-common-values */
+                     if (match)
+                         mcvsel += numbers[i];
+                 }
+             }
+             else
+             {
+                 /* no most-common-values info available */
+                 values = NULL;
+                 numbers = NULL;
+                 i = nvalues = nnumbers = 0;
+             }
+
+             free_attstatsslot(vardata.atttype, values, nvalues, NULL, 0);
+
+             /*
+              * Is the constant "<@" to any of the column's histogram values?
+              */
+             if (get_attstatsslot(vardata.statsTuple,
+                                  vardata.atttype, vardata.atttypmod,
+                                  STATISTIC_KIND_HISTOGRAM, InvalidOid,
+                                  &values, &nvalues,
+                                  NULL, NULL))
+             {
+                 FmgrInfo    contproc;
+
+                 fmgr_info(get_opcode(operator), &contproc);
+
+                 for (i = 0; i < nvalues; i++)
+                 {
+                     /* be careful to apply operator right way 'round */
+                     if (varonleft)
+                         match = DatumGetBool(FunctionCall2(&contproc,
+                                                            values[i],
+                                                            constval));
+                     else
+                         match = DatumGetBool(FunctionCall2(&contproc,
+                                                            constval,
+                                                            values[i]));
+                     /* count matching histogram values */
+                     if (match)
+                         hissel++;
+                 }
+
+                 if (hissel > 0.0)
+                 {
+                     /*
+                      * some matching values found inside histogram, divide
+                      * matching entries number by total histogram entries to
+                      * get the histogram related selectivity
+                      */
+                     hissel /= nvalues;
+                 }
+             }
+             else
+             {
+                 /* no histogram info available */
+                 values = NULL;
+                 i = nvalues = 0;
+             }
+
+             free_attstatsslot(vardata.atttype, values, nvalues,
+                               NULL, 0);
+
+
+             /*
+              * calculate selectivity based on MCV and histogram result
+              * histogram selectivity needs to be scaled down if there are any
+              * most-common-values
+              */
+             selec = mcvsel + hissel * (1.0 - mcvsum);
+
+             /*
+              * don't return 0.0 selectivity unless all table values are inside
+              * mcv
+              */
+             if (selec == 0.0 && mcvsum != 1.0)
+                 selec = DEFAULT_PARENT_SEL;
+         }
+         else
+             selec = DEFAULT_PARENT_SEL;
+     }
+     else
+         selec = DEFAULT_PARENT_SEL;
+
+     ReleaseVariableStats(vardata);
+
+     /* result should be in range, but make sure... */
+     CLAMP_PROBABILITY(selec);
+
+     PG_RETURN_FLOAT8((float8) selec);
+ }
+
Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.406
diff -c -c -r1.406 pg_proc.h
*** src/include/catalog/pg_proc.h    25 Apr 2006 00:25:20 -0000    1.406
--- src/include/catalog/pg_proc.h    26 Apr 2006 18:25:35 -0000
***************
*** 3812,3817 ****
--- 3812,3819 ----
  DESCR("GiST support");
  DATA(insert OID = 2592 (  gist_circle_compress    PGNSP PGUID 12 f f t f i 1 2281 "2281" _null_ _null_ _null_
gist_circle_compress- _null_ )); 
  DESCR("GiST support");
+ DATA(insert OID = 2599 (  parentsel              PGNSP PGUID 12 f f t f s 4 701 "2281 26 2281 23" _null_ _null_
_null_parentsel - _null_ )); 
+ DESCR("enhanced restriction selectivity for ltree isparent comparison operators");


  /*
Index: src/include/utils/selfuncs.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/utils/selfuncs.h,v
retrieving revision 1.28
diff -c -c -r1.28 selfuncs.h
*** src/include/utils/selfuncs.h    5 Mar 2006 15:59:07 -0000    1.28
--- src/include/utils/selfuncs.h    26 Apr 2006 18:25:35 -0000
***************
*** 134,137 ****
--- 134,139 ----
  extern Datum hashcostestimate(PG_FUNCTION_ARGS);
  extern Datum gistcostestimate(PG_FUNCTION_ARGS);

+ extern Datum parentsel(PG_FUNCTION_ARGS);
+
  #endif   /* SELFUNCS_H */

Re: [HACKERS] Enhanced containment selectivity function

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Cleaned-up patch attached and applied.

Please revert this patch.  This has not been updated to satisfy the
previous agreement about how it should work.  It is completely
inappropriate to be dropping code that's specific to one contrib module
into the core selfuncs.c file.  What we had agreed to do was look at
exporting some of the currently-static functions in selfuncs.c so that
contrib modules could make use of them from outside the core --- but
this patch doesn't do that.

            regards, tom lane

Re: [HACKERS] Enhanced containment selectivity function

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Cleaned-up patch attached and applied.
>
> Please revert this patch.  This has not been updated to satisfy the
> previous agreement about how it should work.  It is completely
> inappropriate to be dropping code that's specific to one contrib module
> into the core selfuncs.c file.  What we had agreed to do was look at
> exporting some of the currently-static functions in selfuncs.c so that
> contrib modules could make use of them from outside the core --- but
> this patch doesn't do that.

OK, reverted, but I saw it using contsel() so I figured we were allowing
it, but I see contsel() is used by our "box", so ltree was just using
something that was already there.  Let me see if I can break out the new
selectivity function into /contrib.

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Enhanced containment selectivity function

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> OK, reverted, but I saw it using contsel() so I figured we were allowing
> it, but I see contsel() is used by our "box", so ltree was just using
> something that was already there.  Let me see if I can break out the new
> selectivity function into /contrib.

What really needs to happen next is to think about which bits of
selfuncs.c should be exposed --- what's generally useful, and do we
think that it has an API clean/stable enough to expose?  The reason I
kept all that stuff static so far was because it got whacked around
every release or two, and I didn't want to be constrained by worries
about breaking outside modules.  I'd still prefer to minimize the number
of routines exposed, so some thought is needed.

            regards, tom lane

Re: [HACKERS] Enhanced containment selectivity function

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > OK, reverted, but I saw it using contsel() so I figured we were allowing
> > it, but I see contsel() is used by our "box", so ltree was just using
> > something that was already there.  Let me see if I can break out the new
> > selectivity function into /contrib.
>
> What really needs to happen next is to think about which bits of
> selfuncs.c should be exposed --- what's generally useful, and do we
> think that it has an API clean/stable enough to expose?  The reason I
> kept all that stuff static so far was because it got whacked around
> every release or two, and I didn't want to be constrained by worries
> about breaking outside modules.  I'd still prefer to minimize the number
> of routines exposed, so some thought is needed.

Well, the ltree routine just needs struct VariableStatData and
macro ReleaseVariableStats(), and we need to change one function from
static to global.  That and some #includes, and it works.  Do we want to
be more formal about it?

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Enhanced containment selectivity function

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> What really needs to happen next is to think about which bits of
>> selfuncs.c should be exposed --- what's generally useful, and do we
>> think that it has an API clean/stable enough to expose?

> Well, the ltree routine just needs struct VariableStatData and
> macro ReleaseVariableStats(), and we need to change one function from
> static to global.  That and some #includes, and it works.  Do we want to
> be more formal about it?

If we're going to expose VariableStatData then I think the minimum
reasonable set of supporting functions would be examine_variable(),
get_restriction_variable(), get_join_variable().  Matteo's patch only
uses the second one but that's because he's only implementing one type
of estimator (no joinsel function for instance).

I'd be inclined to create a new header file, too, instead of cluttering
builtins.h with this stuff.  utils/selfuncs.h probably.

I'm willing to work on this, but it doesn't look like you reverted the
prior patch yet?

            regards, tom lane

Re: [HACKERS] Enhanced containment selectivity function

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> What really needs to happen next is to think about which bits of
> >> selfuncs.c should be exposed --- what's generally useful, and do we
> >> think that it has an API clean/stable enough to expose?
>
> > Well, the ltree routine just needs struct VariableStatData and
> > macro ReleaseVariableStats(), and we need to change one function from
> > static to global.  That and some #includes, and it works.  Do we want to
> > be more formal about it?
>
> If we're going to expose VariableStatData then I think the minimum
> reasonable set of supporting functions would be examine_variable(),
> get_restriction_variable(), get_join_variable().  Matteo's patch only
> uses the second one but that's because he's only implementing one type
> of estimator (no joinsel function for instance).
>
> I'd be inclined to create a new header file, too, instead of cluttering
> builtins.h with this stuff.  utils/selfuncs.h probably.
>
> I'm willing to work on this, but it doesn't look like you reverted the
> prior patch yet?

Uh, I just moved the selectivity function over to /contrib/ltree, and
moved what I needed, so it now works.  You can continue with the plan
above, or I can.

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Enhanced containment selectivity function

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> I'm willing to work on this, but it doesn't look like you reverted the
>> prior patch yet?

> Uh, I just moved the selectivity function over to /contrib/ltree, and
> moved what I needed, so it now works.  You can continue with the plan
> above, or I can.

I'll see what I can do with it.  I've got an evening of watching my
laptop reinstall anyway (just replaced a failing hard drive :-()

            regards, tom lane

Re: [HACKERS] Enhanced containment selectivity function

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> I'm willing to work on this, but it doesn't look like you reverted the
> >> prior patch yet?
>
> > Uh, I just moved the selectivity function over to /contrib/ltree, and
> > moved what I needed, so it now works.  You can continue with the plan
> > above, or I can.
>
> I'll see what I can do with it.  I've got an evening of watching my
> laptop reinstall anyway (just replaced a failing hard drive :-()

Interesting. I took one of my laptops into service today and will get it
back tomorrow with a new empty harddrive.  One big thing I have found to
help is to document all laptop changes in a backed-up server file, so
when I get a new laptop, I can just go through the list and make the
changes in 1/2 hour.  I also keep installed program tarballs on the
server so those can be installed with the same versions as the other
laptops.

--
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Enhanced containment selectivity function

From
Matteo Beccati
Date:
Hi,

Bruce Momjian ha scritto:
> Uh, I just moved the selectivity function over to /contrib/ltree, and
> moved what I needed, so it now works.  You can continue with the plan
> above, or I can.

I've just been able to install a 8.2-devel to test the ltree selectivity
  improvements I suggested.

It was a big surprise having no improvements at all in the query I used
for all my previous tests, until I found out that the test against
histogram values was removed by Tom. I strongly think it should be
reintroduced: ltree columns are likely to have a unique constraint and
the current ltreeparentsel function will behave just like contsel in
these cases.

Here is the commit:

> Log Message:
> -----------
> Fix ltreeparentsel so it actually works ...
>
> Modified Files:
> --------------
>     pgsql/contrib/ltree:
>         ltree_op.c (r1.10 -> r1.11)
>         (http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/ltree/ltree_op.c.diff?r1=1.10&r2=1.11)
>


Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

Re: [HACKERS] Enhanced containment selectivity function

From
Tom Lane
Date:
Matteo Beccati <php@beccati.com> writes:
> It was a big surprise having no improvements at all in the query I used
> for all my previous tests, until I found out that the test against
> histogram values was removed by Tom. I strongly think it should be
> reintroduced: ltree columns are likely to have a unique constraint and
> the current ltreeparentsel function will behave just like contsel in
> these cases.

The histogram values seem completely meaningless in this context ---
for containment purposes, they are just ten or so randomly chosen
values.  I don't believe that the estimator works better with them.
Certainly, whether the column is unique or not is totally irrelevant
to whether they are representative.

What would seem saner to me is to add a datatype-specific analyze
function that collects some statistics that are actually relevant
to containment, and then make use of those in the estimator.

            regards, tom lane

Re: [HACKERS] Enhanced containment selectivity function

From
Matteo Beccati
Date:
Hi,

> The histogram values seem completely meaningless in this context ---
> for containment purposes, they are just ten or so randomly chosen
> values.  I don't believe that the estimator works better with them.
> Certainly, whether the column is unique or not is totally irrelevant
> to whether they are representative.

Right, but if the column has a high number of stats, I think that the
samples found in the histogram could put the estimator on the right way:
i.e. in my case 80% of the values have '1041' as their root leaf and
most of the values in the histogram reflect this.

You're right saying that the column uniqueness isn't relevant to the
histogram, but if the column is unique, there won't be any mcv, and the
patch becomes useless.

> What would seem saner to me is to add a datatype-specific analyze
> function that collects some statistics that are actually relevant
> to containment, and then make use of those in the estimator.

Perhaps you're right, but unfortunately it's not a thing I can do
myself, because of lack of knowledge about both pg and ltree internals :(


Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com