Re: Risk Estimation WAS: Planner hints in Postgresql - Mailing list pgsql-hackers

From Atri Sharma
Subject Re: Risk Estimation WAS: Planner hints in Postgresql
Date
Msg-id CAOeZVifb5TEKiKjxJDsb_LZTk=ufwjuVK99x4qetx5XyLfSVcg@mail.gmail.com
Whole thread Raw
In response to Re: Risk Estimation WAS: Planner hints in Postgresql  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Risk Estimation WAS: Planner hints in Postgresql
List pgsql-hackers



On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> One of the factors that leads to bad estimates is that the histogram of the
>> values of a column maintained by the planner gets old by time and the data
>> in the column changes. So, the histogram is no longer a quite accurate view
>> of the data and it leads to bad selectivity.
>
> TBH, this is so far down the list of problems that it'll be a long time
> before we need to worry about it.  It's certainly not the number one
> priority for any project to model risk in the planner.
>
> The thing that I think is probably the number one problem is estimates
> that depend on an assumption of uniform distribution of sought-after rows
> among those encountered by a scan.  This is usually where bad plans for
> LIMIT queries are coming from.  We could certainly add some sort of fudge
> factor to those costs, but I'd like to have a more-or-less principled
> framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

 

Cannot we reuse the same histogram we have in the planner right now for this? I mean, AFAIK, the heuristic we have is that we divide the histogram into equal size buckets and then find the bucket in which our predicate value lies, then take some part of that bucket and the rest of the buckets before that bucket,right?

So, suppose a query is SELECT * FROM table WHERE a > 10, we shall find the bucket that 10 lies in, right?

Now, why cannot we take the estimate of all the buckets behind the bucket in which our value is present? Will that estimate not give us the fraction of tuples that are expected to be before the first matching row?

Its pretty wild, but I wanted to know if my understanding of this scenario is correct or not.

Regards,

Atri

--
Regards,
 
Atri
l'apprenant

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Review: plpgsql.extra_warnings, plpgsql.extra_errors
Next
From: Tom Lane
Date:
Subject: Re: Risk Estimation WAS: Planner hints in Postgresql