Thread: Improve selectivity estimate for range queries

Improve selectivity estimate for range queries

From
"Yuzuko Hosoya"
Date:
Hi,

I found the problem about selectivity estimate for range queries
on master as follows.  This is my test case:

postgres=# CREATE TABLE test(id int, val text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i;
INSERT 0 30000
postgres=# VACUUM analyze test;
VACUUM
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000;
                                              QUERY PLAN
----------------------------------------------------------------------------------------------------
---
 Seq Scan on test  (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999
loops=1)
   Filter: ((id > 0) AND (id < 10000))
   Rows Removed by Filter: 20001
 Planning Time: 0.099 ms
 Execution Time: 37.142 ms
(5 rows)

Although the actual number of rows was 9999, the estimated number
of rows was 150.

So I dug to see what happened and thought that the following part
in clauselist_selectivity caused this problem.

------------
/*
  * Now scan the rangequery pair list.
  */
 while (rqlist != NULL)
 {
     RangeQueryClause *rqnext;

     if (rqlist->have_lobound && rqlist->have_hibound)
     {
         /* Successfully matched a pair of range clauses */
         Selectivity s2;

         /*
          * Exact equality to the default value probably means the
          * selectivity function punted.  This is not airtight but should
          * be good enough.
          */
         if (rqlist->hibound == DEFAULT_INEQ_SEL ||
             rqlist->lobound == DEFAULT_INEQ_SEL)
         {
             s2 = DEFAULT_RANGE_INEQ_SEL;
         }
         else
         {
             s2 = rqlist->hibound + rqlist->lobound - 1.0;
------------

In my environment, the selectivity for id > 0 was 0.99990000000000001,
and the selectivity for id < 10000 was 0.33333333333333331. Then, the
value of rqlist->hibound and rqlist->lobound are set respectively.
On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a result,
the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
since the condition of the second if statement was satisfied. However,
these selectivities were computed according to the statistics, so the
final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.

My test case might be uncommon, but I think it would cause some problems.
To handle such cases I've thought up of an idea based on a previous commit[1]
which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
this modification, I added a flag which shows whether the selectivity
was calculated according to the statistics or not to clauselist_selectivity,
and used it as a condition of the if statement instead of 
rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
I am afraid that changes were more than I had expected, but we can estimate
selectivity correctly.

Patch attached.  Do you have any thoughts?

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce

Best regards,
Yuzuko Hosoya
NTT Open Source Software Center

Attachment

Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> In my environment, the selectivity for id > 0 was 0.99990000000000001,
> and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> value of rqlist->hibound and rqlist->lobound are set respectively.
> On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a result,
> the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
> since the condition of the second if statement was satisfied. However,
> these selectivities were computed according to the statistics, so the
> final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
> My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date:   Tue Nov 9 00:34:46 2004 +0000
| 
|     Use a hopefully-more-reliable method of detecting default selectivity
|     estimates when combining the estimates for a range query.  As pointed out
|     by Miquel van Smoorenburg, the existing check for an impossible combined
|     result would quite possibly fail to detect one default and one non-default
|     input.  It seems better to use the default range query estimate in such
|     cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
|     This is a bit ugly because it introduces additional coupling between
|     clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
|     there wasn't plenty already...

> To handle such cases I've thought up of an idea based on a previous commit[1]
> which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
> this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

> was calculated according to the statistics or not to clauselist_selectivity,
> and used it as a condition of the if statement instead of 
> rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> I am afraid that changes were more than I had expected, but we can estimate
> selectivity correctly.
> 
> Patch attached.  Do you have any thoughts?

I've just run over the patch, but some have comments.

+            if (isdefault)
+                rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?


     Selectivity lobound;        /* Selectivity of a var > something clause */
     Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
 } RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.


> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Improve selectivity estimate for range queries

From
"Yuzuko Hosoya"
Date:
Hi,

Thanks for the comments.
I attach the v2 patch.

> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> Sent: Friday, December 21, 2018 12:25 PM
> 
> Hello.
> 
> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> > In my environment, the selectivity for id > 0 was 0.99990000000000001,
> > and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> > value of rqlist->hibound and rqlist->lobound are set respectively.
> > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a
> > result, the final selectivity is computed according to
> > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
> > was satisfied. However, these selectivities were computed according to
> > the statistics, so the final selectivity had to be calculated from rqlist->hibound +
rqlist->lobound
> - 1.0.
> > My test case might be uncommon, but I think it would cause some problems.
> 
> Yeah, its known behavior as the comment just above. If in your example query, the table weer very
large
> and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
> how much it matters and your example doesn't show that.
> 
> The behvavior discussed here is introduced by this commit.
> 
> | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
> | Author: Tom Lane <tgl@sss.pgh.pa.us>
> | Date:   Tue Nov 9 00:34:46 2004 +0000
> |
> |     Use a hopefully-more-reliable method of detecting default selectivity
> |     estimates when combining the estimates for a range query.  As pointed out
> |     by Miquel van Smoorenburg, the existing check for an impossible combined
> |     result would quite possibly fail to detect one default and one non-default
> |     input.  It seems better to use the default range query estimate in such
> |     cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
> |     This is a bit ugly because it introduces additional coupling between
> |     clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
> |     there wasn't plenty already...
> 
> > To handle such cases I've thought up of an idea based on a previous
> > commit[1] which solved a similar problem related to
> > DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
> > which shows whether the selectivity
> 
> The commit [1] added almost no complexity, but this does. Affects many functions to introduce
tighter
> coupling between operator-selectivity functions and clause selectivity functions.
> isdefault travells alone too long just to bind remote functions. We might need more pricipled way
to
> handle that.
>
Yes, indeed.  But I have not found better way than this approach yet.
 
> > was calculated according to the statistics or not to
> > clauselist_selectivity, and used it as a condition of the if statement
> > instead of
> > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> > I am afraid that changes were more than I had expected, but we can
> > estimate selectivity correctly.
> >
> > Patch attached.  Do you have any thoughts?
> 
> I've just run over the patch, but some have comments.
> 
> +            if (isdefault)
> +                rqelem->lobound_isdefault = true;
> 
> This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
> isdefault.
> 
Indeed.  I fixed it.

> As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
> to similar kind of problem?
>
I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL.  I changed these conditions into "!isdefault".

> 
> 
>      Selectivity lobound;        /* Selectivity of a var > something clause */
>      Selectivity hibound;        /* Selectivity of a var < something clause */
> +    bool        lobound_isdefault;    /* lobound is a default selectivity? */
> +    bool        hibound_isdefault;    /* hibound is a default selectivity? */
>  } RangeQueryClause;
> 
> Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
> The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is
saying
> that it's better that Selectivity has certinty component then building a selectivity arithmetics
> involving uncertainty, though I don't have a clear picture.
>
I see.  Indeed if we would change Selectivity so that it includes 
information about default/non-default, such problems I mentioned 
would be solved.  But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity.  I also don't have a concreate idea, so I will think 
about it.  Besides that, I'd like to ask community whether such
modification would be acceptable or not.

Best regards,

Yuzuko Hosoya
NTT Open Source Software Center

> 
> > [1]
> > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2
> > 777d0b733220d9029f78817af8ce
> 
> regards.
> 
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center


Attachment

Re: Improve selectivity estimate for range queries

From
Tom Lane
Date:
"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
>> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
>>> To handle such cases I've thought up of an idea based on a previous
>>> commit[1] which solved a similar problem related to
>>> DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
>>> which shows whether the selectivity

>> The commit [1] added almost no complexity, but this does. Affects many
>> functions to introduce tighter coupling between operator-selectivity
>> functions and clause selectivity functions. isdefault travells alone
>> too long just to bind remote functions. We might need more pricipled
>> way to handle that.

Yeah, I don't think that this approach is a reasonable tradeoff to
eliminate a narrow case.  In particular, what's been done here to
clause_selectivity is totally unprincipled and poorly thought out:
you can't add an output parameter that's set in only some cases.
Yet, there's no good way to decide how to set it in many cases.
For instance, what would you do for an AND or OR where some of
the branches are default estimates and some not?

>> Around the [1], there was a discussion about introducing the notion of
>> uncertainty to selectivity.  The isdefualt is a kind of uncertainty
>> value indicating '0/100% uncertain'. So my feeling is saying that
>> it's better that Selectivity has certinty component then building a
>> selectivity arithmetics involving uncertainty, though I don't have a
>> clear picture.

> I see.  Indeed if we would change Selectivity so that it includes
> information about default/non-default, such problems I mentioned
> would be solved.  But I'm afraid that this modification would lead
> to a big impact, since there are lots of functions that calculate
> selectivity.  I also don't have a concreate idea, so I will think
> about it.  Besides that, I'd like to ask community whether such
> modification would be acceptable or not.

The planner definitely has a lot of problems that could be addressed
with some sort of uncertainty framework ... but it'd be a huge
undertaking, which is why nobody's tried yet.

A smaller-footprint way to fix the immediate problem might be to
change the values of DEFAULT_INEQ_SEL and friends so that they're
even less likely to be matched by accident.  That is, instead of
0.3333333333333333, use 0.333333333333342 or some such.  It might
seem that that's just moving the problem around, but I think it
might be possible to show that such a value couldn't be computed
by scalarltsel given a histogram with no more than 10000 members.
(I haven't tried to actually prove that, but it seems intuitive
that the set of possible results would be quantized with no more
than about 5 digits precision.)

            regards, tom lane


Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>
> "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:
> > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> >> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
> >> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> >>> To handle such cases I've thought up of an idea based on a previous
> >>> commit[1] which solved a similar problem related to
> >>> DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
> >>> which shows whether the selectivity
> 
> >> The commit [1] added almost no complexity, but this does. Affects many
> >> functions to introduce tighter coupling between operator-selectivity
> >> functions and clause selectivity functions. isdefault travells alone
> >> too long just to bind remote functions. We might need more pricipled
> >> way to handle that.
> 
> Yeah, I don't think that this approach is a reasonable tradeoff to
> eliminate a narrow case.  In particular, what's been done here to
> clause_selectivity is totally unprincipled and poorly thought out:
> you can't add an output parameter that's set in only some cases.
> Yet, there's no good way to decide how to set it in many cases.
> For instance, what would you do for an AND or OR where some of
> the branches are default estimates and some not?
> 
> >> Around the [1], there was a discussion about introducing the notion of
> >> uncertainty to selectivity.  The isdefualt is a kind of uncertainty
> >> value indicating '0/100% uncertain'. So my feeling is saying that
> >> it's better that Selectivity has certinty component then building a
> >> selectivity arithmetics involving uncertainty, though I don't have a
> >> clear picture.
> 
> > I see.  Indeed if we would change Selectivity so that it includes 
> > information about default/non-default, such problems I mentioned 
> > would be solved.  But I'm afraid that this modification would lead
> > to a big impact, since there are lots of functions that calculate
> > selectivity.  I also don't have a concreate idea, so I will think 
> > about it.  Besides that, I'd like to ask community whether such
> > modification would be acceptable or not.
> 
> The planner definitely has a lot of problems that could be addressed
> with some sort of uncertainty framework ... but it'd be a huge
> undertaking, which is why nobody's tried yet.

Sure.

> A smaller-footprint way to fix the immediate problem might be to
> change the values of DEFAULT_INEQ_SEL and friends so that they're
> even less likely to be matched by accident.  That is, instead of
> 0.3333333333333333, use 0.333333333333342 or some such.  It might

Sounds reasonable.

> seem that that's just moving the problem around, but I think it
> might be possible to show that such a value couldn't be computed
> by scalarltsel given a histogram with no more than 10000 members.
> (I haven't tried to actually prove that, but it seems intuitive
> that the set of possible results would be quantized with no more
> than about 5 digits precision.)

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
   d = 1.000000e+30 match
   d = 1.000000e+31: 0.000000 match
x = 0.33333333333334197
   d = 0.000000e+00 match
   d = 1.000000e+00: 0.000000 match

==== t.c
#include <stdio.h>
#include <math.h>

int test(double x)
{
  double d = 1.0;
  double d0 = 0;

  fprintf(stderr, "x = %19.17f\n", x);
  while (d != d0)
  {
    double z = floor(d * 3);
    double z1 = z + 1.0;
    double y = d / z;
    double y1 = d / z1;

    /* Check if both sides of d * 3 doesn't make match */
    if (y != x && y1 != x)
    {
      fprintf(stderr, "   d = %e match\n", d0);
      fprintf(stderr, "   d = %e: %f match\n", d);
      return 0;
    }

    d0 = d;
    d = d * 10;
  }
}

int main(void)
{
  test(0.3333333333333333);
  test(0.333333333333342);

  return 0;
}
====

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>
> FWIW, I got the following result on my environment. It seems
> different enough if this holds on all supported platforms, though
> there still is a case where the result of a sequence of
> arithmetics makes false match.
> 
> x = 0.33333333333333331
>    d = 1.000000e+30 match
>    d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".


-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
Sigh.. In the frrst place, the loop should not stop until the maximum exponent.
sorry, but I don't have a time now..

2019年1月8日(火) 16:43 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>:
Mmm.

At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>
> FWIW, I got the following result on my environment. It seems
> different enough if this holds on all supported platforms, though
> there still is a case where the result of a sequence of
> arithmetics makes false match.
>
> x = 0.33333333333333331
>    d = 1.000000e+30 match
>    d = 1.000000e+31: 0.000000 match

Of course the "match" in the last line above is a mistake of "NOT
match".
--
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello.
> 
> At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>
> > seem that that's just moving the problem around, but I think it
> > might be possible to show that such a value couldn't be computed
> > by scalarltsel given a histogram with no more than 10000 members.
> > (I haven't tried to actually prove that, but it seems intuitive
> > that the set of possible results would be quantized with no more
> > than about 5 digits precision.)

I think we don't need a perfect proof for that. The fact that
exactly 1/3 is quite natural and common but 1/3 + ε is not would
be enough.

> FWIW, I got the following result on my environment. It seems
> different enough if this holds on all supported platforms, though
> there still is a case where the result of a sequence of
> arithmetics makes false match.

Simple selectivity of a relation theoretically cannot match with
the epsilon. (Of couse on *my* environment.)

(0.333..)
binary format: 3f d5 55 55 55 55 55 55
x = 0.333333333333333315
   231 matches, 79 no_matches

(0.3{13}42..)
binary format: 3f d5 55 55 55 55 55 f1
x = 0.333333333333341975
   0 matches, 310 no_matches

(0.3{15}42..)
binary format: 3f d5 55 55 55 55 55 57
x = 0.333333333333333426
   0 matches, 310 no_matches


It seems that 0.3{13}42 is correctly 0.3{15}42, which makes just
two LSBs difference from 1/3. I believe C is well standardized on
the translation. Other DEFAULT_*_SELs are not compared in this
way.

The attached small patch fixes the case mentioned in this thread,
but I'm not sure where to put a test. Static assertion is not
usable. Assertion in somewhere perhaps in clauselist_selectivity
seems somewhat overdone.. I don't find a suitable place in
regression test..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include <stdio.h>
#include <math.h>

int test(double x)
{
  double d = 1.0;
  double d0 = 0;
  unsigned char *c_x = (unsigned char *) &x;
  int nmatches = 0;
  int nnomatches = 0;
  int i;
  
  fprintf(stderr, "binary format: ");
  for (i = 7 ; i >= 0 ; i--)
    fprintf(stderr, "%s%02x", i < 7 ? " " : "", c_x[i]);
  fprintf(stderr, "\n");

  fprintf(stderr, "x = %20.18f\n", x);
  while (d != d0)
  {
    double z = floor(d * 3);
    double z1 = z + 1.0;
    double y = d / z;
    double y1 = d / z1;

    /* Check if both sides of d * 3 doesn't make match */
    if (y == x || y1 == x)
      nmatches++;
    else
      nnomatches++;

    d0 = d;
    d = d * 10;
  }

  fprintf(stderr, "   %d matches, %d no_matches\n", nmatches, nnomatches);

}

int main(void)
{
  test(0.3333333333333333);
  test(0.333333333333342);
  test(0.33333333333333342);

  return 0;
}
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 3739b9817a..cdeaac22c8 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -109,7 +109,7 @@ clauselist_selectivity(PlannerInfo *root,
     ListCell   *l;
     int            listidx;
 
-    /*
+     /*
      * If there's exactly one clause, just go directly to
      * clause_selectivity(). None of what we might do below is relevant.
      */
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 5cc4cf15e2..15a8d2402a 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -33,8 +33,12 @@
 /* default selectivity estimate for equalities such as "A = b" */
 #define DEFAULT_EQ_SEL    0.005
 
-/* default selectivity estimate for inequalities such as "A < b" */
-#define DEFAULT_INEQ_SEL  0.3333333333333333
+/*
+ * default selectivity estimate for inequalities such as "A < b"
+ * The last two digits prevent it from making a false match with 1/3 computed
+ * from histogram and/or MCV.
+ */
+#define DEFAULT_INEQ_SEL  0.33333333333333342
 
 /* default selectivity estimate for range inequalities "A > b AND A < c" */
 #define DEFAULT_RANGE_INEQ_SEL    0.005

Re: Improve selectivity estimate for range queries

From
Robert Haas
Date:
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A smaller-footprint way to fix the immediate problem might be to
> change the values of DEFAULT_INEQ_SEL and friends so that they're
> even less likely to be matched by accident.  That is, instead of
> 0.3333333333333333, use 0.333333333333342 or some such.  It might
> seem that that's just moving the problem around, but I think it
> might be possible to show that such a value couldn't be computed
> by scalarltsel given a histogram with no more than 10000 members.
> (I haven't tried to actually prove that, but it seems intuitive
> that the set of possible results would be quantized with no more
> than about 5 digits precision.

That's not a dumb idea, but it seems pretty unprincipled to me, and to
be honest I'm kind of surprised that you're not proposing something
cleaner.  If you admit the need to distinguish between real estimates
and last-ditch ones, why shouldn't we have an explicit representation
of that rather than trying to pick a last-ditch value that is unlikely
to be confused with a real one?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Improve selectivity estimate for range queries

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A smaller-footprint way to fix the immediate problem might be to
>> change the values of DEFAULT_INEQ_SEL and friends so that they're
>> even less likely to be matched by accident.  That is, instead of
>> 0.3333333333333333, use 0.333333333333342 or some such.

> That's not a dumb idea, but it seems pretty unprincipled to me, and to
> be honest I'm kind of surprised that you're not proposing something
> cleaner.

The problem is the invasiveness of such a change (large) vs the benefit
(not so large).  The upthread patch attempted to add a separate signaling
path, but it was very incomplete --- and yet both I and Horiguchi-san
thought it was already too messy.

Maybe at some point we'll go over to something reasonably principled,
like adding confidence intervals to all selectivity estimates.  That
would be *really* invasive but perhaps would bring enough benefit to
justify itself.  But the current patch is just attempting to fix one
extremely narrow pain point, and if that is all it's doing it should
have a commensurately small footprint.  So I don't think the submitted
patch looks good from a cost/benefit standpoint.

            regards, tom lane


RE: Improve selectivity estimate for range queries

From
"Yuzuko Hosoya"
Date:
Hi,

Thanks for the comments, and I'm sorry for the late reply.

> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Friday, January 11, 2019 7:04 AM
> > Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> A smaller-footprint way to fix the immediate problem might be to
> >> change the values of DEFAULT_INEQ_SEL and friends so that they're
> >> even less likely to be matched by accident.  That is, instead of
> >> 0.3333333333333333, use 0.333333333333342 or some such.
> 
> > That's not a dumb idea, but it seems pretty unprincipled to me, and to
> > be honest I'm kind of surprised that you're not proposing something
> > cleaner.
> 
> The problem is the invasiveness of such a change (large) vs the benefit (not so large).  The
upthread
> patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I
and
> Horiguchi-san thought it was already too messy.
> 
> Maybe at some point we'll go over to something reasonably principled, like adding confidence
intervals
> to all selectivity estimates.  That would be *really* invasive but perhaps would bring enough
benefit
> to justify itself.  But the current patch is just attempting to fix one extremely narrow pain
point,
> and if that is all it's doing it should have a commensurately small footprint.  So I don't think
the
> submitted patch looks good from a cost/benefit standpoint.
> 
Yes, I agree with you.  Indeed the patch I attached is insufficient in cost-effectiveness.
However, I want to solve problems of that real estimates happened to equal to the default 
values such as this case, even though it's a narrow pain point.  So I tried distinguishing
explicitly between real estimates and otherwise as Robert said.

The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
any range queries really cannot match 0.333333333333342 (or some such) by accident in any 
environments.  Is the way which Horiguchi-san did enough to prove that?


Best regards,
Yuzuko Hosoya
NTT Open Source Software Center




Re: Improve selectivity estimate for range queries

From
Kyotaro HORIGUCHI
Date:
Hi.

At Fri, 11 Jan 2019 11:36:47 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<000001d4a956$806a2ab0$813e8010$@lab.ntt.co.jp>
> Hi,
> 
> Thanks for the comments, and I'm sorry for the late reply.
> 
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Sent: Friday, January 11, 2019 7:04 AM
> > > Robert Haas <robertmhaas@gmail.com> writes:
> > > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> A smaller-footprint way to fix the immediate problem might be to
> > >> change the values of DEFAULT_INEQ_SEL and friends so that they're
> > >> even less likely to be matched by accident.  That is, instead of
> > >> 0.3333333333333333, use 0.333333333333342 or some such.
> > 
> > > That's not a dumb idea, but it seems pretty unprincipled to me, and to
> > > be honest I'm kind of surprised that you're not proposing something
> > > cleaner.
> > 
> > The problem is the invasiveness of such a change (large) vs the benefit (not so large).  The
> upthread
> > patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I
> and
> > Horiguchi-san thought it was already too messy.
> > 
> > Maybe at some point we'll go over to something reasonably principled, like adding confidence
> intervals
> > to all selectivity estimates.  That would be *really* invasive but perhaps would bring enough
> benefit
> > to justify itself.  But the current patch is just attempting to fix one extremely narrow pain
> point,
> > and if that is all it's doing it should have a commensurately small footprint.  So I don't think
> the
> > submitted patch looks good from a cost/benefit standpoint.
> > 
> Yes, I agree with you.  Indeed the patch I attached is insufficient in cost-effectiveness.
> However, I want to solve problems of that real estimates happened to equal to the default 
> values such as this case, even though it's a narrow pain point.  So I tried distinguishing
> explicitly between real estimates and otherwise as Robert said.
> 
> The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether
> any range queries really cannot match 0.333333333333342 (or some such) by accident in any 
> environments.  Is the way which Horiguchi-san did enough to prove that?

It cannot be perfect in priciple, but I think it is reliable
enough. This is not principled at all but very effective
comparatively the complexity, I think.

Actually, i/(i*3+1) for some 10^n's are:

                 1/                 4:binary format: 3f d0 00 00 00 00 00 00
                10/                31:binary format: 3f d4 a5 29 4a 52 94 a5
               100/               301:binary format: 3f d5 43 30 7a 78 c5 51
              1000/              3001:binary format: 3f d5 53 83 74 70 f1 95
             10000/             30001:binary format: 3f d5 55 26 bb 44 2b a8
            100000/            300001:binary format: 3f d5 55 50 ac 4a 74 6d
           1000000/           3000001:binary format: 3f d5 55 54 de 07 5a 96
          10000000/          30000001:binary format: 3f d5 55 55 49 67 22 6d
         100000000/         300000001:binary format: 3f d5 55 55 54 23 e9 d7
        1000000000/        3000000001:binary format: 3f d5 55 55 55 36 ca 95
       10000000000/       30000000001:binary format: 3f d5 55 55 55 52 47 75
      100000000000/      300000000001:binary format: 3f d5 55 55 55 55 07 25
     1000000000000/     3000000000001:binary format: 3f d5 55 55 55 55 4d 84
    10000000000000/    30000000000001:binary format: 3f d5 55 55 55 55 54 8d
   100000000000000/   300000000000001:binary format: 3f d5 55 55 55 55 55 41
  1000000000000000/  3000000000000001:binary format: 3f d5 55 55 55 55 55 53

So, (as Tom said upthread) intuitively we can expect that we are
safe up to roughly 10^10? for the simple cases.

# Of course involving histogram makes things complex, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center