Re: Improve selectivity estimate for range queries - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Improve selectivity estimate for range queries
Date
Msg-id 20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Improve selectivity estimate for range queries  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Improve selectivity estimate for range queries  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: Improve selectivity estimate for range queries  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
Hello.

At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us>
> "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes:
> > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> >> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
> >> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> >>> To handle such cases I've thought up of an idea based on a previous
> >>> commit[1] which solved a similar problem related to
> >>> DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
> >>> which shows whether the selectivity
> 
> >> The commit [1] added almost no complexity, but this does. Affects many
> >> functions to introduce tighter coupling between operator-selectivity
> >> functions and clause selectivity functions. isdefault travells alone
> >> too long just to bind remote functions. We might need more pricipled
> >> way to handle that.
> 
> Yeah, I don't think that this approach is a reasonable tradeoff to
> eliminate a narrow case.  In particular, what's been done here to
> clause_selectivity is totally unprincipled and poorly thought out:
> you can't add an output parameter that's set in only some cases.
> Yet, there's no good way to decide how to set it in many cases.
> For instance, what would you do for an AND or OR where some of
> the branches are default estimates and some not?
> 
> >> Around the [1], there was a discussion about introducing the notion of
> >> uncertainty to selectivity.  The isdefualt is a kind of uncertainty
> >> value indicating '0/100% uncertain'. So my feeling is saying that
> >> it's better that Selectivity has certinty component then building a
> >> selectivity arithmetics involving uncertainty, though I don't have a
> >> clear picture.
> 
> > I see.  Indeed if we would change Selectivity so that it includes 
> > information about default/non-default, such problems I mentioned 
> > would be solved.  But I'm afraid that this modification would lead
> > to a big impact, since there are lots of functions that calculate
> > selectivity.  I also don't have a concreate idea, so I will think 
> > about it.  Besides that, I'd like to ask community whether such
> > modification would be acceptable or not.
> 
> The planner definitely has a lot of problems that could be addressed
> with some sort of uncertainty framework ... but it'd be a huge
> undertaking, which is why nobody's tried yet.

Sure.

> A smaller-footprint way to fix the immediate problem might be to
> change the values of DEFAULT_INEQ_SEL and friends so that they're
> even less likely to be matched by accident.  That is, instead of
> 0.3333333333333333, use 0.333333333333342 or some such.  It might

Sounds reasonable.

> seem that that's just moving the problem around, but I think it
> might be possible to show that such a value couldn't be computed
> by scalarltsel given a histogram with no more than 10000 members.
> (I haven't tried to actually prove that, but it seems intuitive
> that the set of possible results would be quantized with no more
> than about 5 digits precision.)

FWIW, I got the following result on my environment. It seems
different enough if this holds on all supported platforms, though
there still is a case where the result of a sequence of
arithmetics makes false match.

x = 0.33333333333333331
   d = 1.000000e+30 match
   d = 1.000000e+31: 0.000000 match
x = 0.33333333333334197
   d = 0.000000e+00 match
   d = 1.000000e+00: 0.000000 match

==== t.c
#include <stdio.h>
#include <math.h>

int test(double x)
{
  double d = 1.0;
  double d0 = 0;

  fprintf(stderr, "x = %19.17f\n", x);
  while (d != d0)
  {
    double z = floor(d * 3);
    double z1 = z + 1.0;
    double y = d / z;
    double y1 = d / z1;

    /* Check if both sides of d * 3 doesn't make match */
    if (y != x && y1 != x)
    {
      fprintf(stderr, "   d = %e match\n", d0);
      fprintf(stderr, "   d = %e: %f match\n", d);
      return 0;
    }

    d0 = d;
    d = d * 10;
  }
}

int main(void)
{
  test(0.3333333333333333);
  test(0.333333333333342);

  return 0;
}
====

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: OpenBSD versus semaphores
Next
From: Tom Lane
Date:
Subject: Re: OpenBSD versus semaphores