Thread: Introducing floating point cast into filter drastically changes row estimate

Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
I was chasing down a query that ran fine in 8.1 but had an near
infinite runtime in 9.2.  It turned out to be from a bad filter
estimate that is surprisingly simple to reproduce:

postgres=# create table foo(i int);
CREATE TABLE
postgres=# insert into foo select 1000 + (v/200) from generate_series(1,5000) v;
INSERT 0 5000
postgres=# ANALYZE foo;
ANALYZE

The following query runs fine: it estimates the returned rows pretty wel:
postgres=# explain analyze  select * from foo where i > 100 and i < 10000;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..98.00 rows=4999 width=4) (actual
time=0.018..1.071 rows=5000 loops=1)
   Filter: ((i > 100) AND (i < 10000))
 Total runtime: 1.425 ms

...but if you introduce a floating point cast, it drastically changes
the returned rows (why?):

postgres=# explain analyze  select * from foo where i::float8 > 100
and i::float8 < 10000;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..123.00 rows=25 width=4) (actual
time=0.022..1.566 rows=5000 loops=1)
   Filter: (((i)::double precision > 100::double precision) AND
((i)::double precision < 10000::double precision))

merlin

Re: Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
On Wed, Oct 24, 2012 at 10:06 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I was chasing down a query that ran fine in 8.1 but had an near
> infinite runtime in 9.2.  It turned out to be from a bad filter
> estimate that is surprisingly simple to reproduce:

Testing some more it turns out that this isn't really a bug -- it's
just a general property of expression based filters: the planner
assumes they are very selective when in this case they are not.

merlin
Merlin Moncure <mmoncure@gmail.com> writes:
> The following query runs fine: it estimates the returned rows pretty wel:
> postgres=# explain analyze  select * from foo where i > 100 and i < 10000;

> ...but if you introduce a floating point cast, it drastically changes
> the returned rows (why?):

> postgres=# explain analyze  select * from foo where i::float8 > 100
> and i::float8 < 10000;

The planner has stats about "i", but none about "i::float8", so you're
getting a default estimate in the second case.  It does, however,
realize that you're applying a range restriction condition to
"i::float8", and the default selectivity estimate for that is
intentionally pretty small.

            regards, tom lane

Re: Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
On Wed, Oct 24, 2012 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> The following query runs fine: it estimates the returned rows pretty wel:
>> postgres=# explain analyze  select * from foo where i > 100 and i < 10000;
>
>> ...but if you introduce a floating point cast, it drastically changes
>> the returned rows (why?):
>
>> postgres=# explain analyze  select * from foo where i::float8 > 100
>> and i::float8 < 10000;
>
> The planner has stats about "i", but none about "i::float8", so you're
> getting a default estimate in the second case.  It does, however,
> realize that you're applying a range restriction condition to
> "i::float8", and the default selectivity estimate for that is
> intentionally pretty small.

Yeah -- I have a case where a large number of joins are happening that
have a lot of filtering based on expressions and things like that.  I
didn't write the SQL, but the characteristics are pretty typical for
code in this particular branch of the application.  Unfortunately, the
effects multiply (both in the where clause in and in various joins)
and the row count estimates quickly degrade to 1 which is off by
orders of magnitude.  The planner then favors materialization and
nestloops which leads to basically unbounded query times given that
the 'inner' scan is a seqscan.  Disabling nestloops fixes the issue,
but now the server favors hash joins (in this particular case it's ok,
but the hash memory usage is quite high).

(see here: http://explain.depesz.com/s/gmL for an example of real word
explain ... the "Seq Scan on yankee_bravo (cost=0.000..727319.040
rows=14 width=71) (actual time=.. rows= loops=)" returns 1000's of
rows, not 14).

I've been thinking about this all morning and I think there's a
fundamental problem here: the planner is using low confidence
estimates in order to pick plans that really only be used when the
plan is relatively precise.  In particular, I think the broad
assumption that rows pruned via default selectivity should be capped,
say to the lesser of 1000 or the greatest known value if otherwise
constrained.

merlin
Merlin Moncure <mmoncure@gmail.com> writes:
> Yeah -- I have a case where a large number of joins are happening that
> have a lot of filtering based on expressions and things like that.

Might be worth your while to install some indexes on those expressions,
if only to trigger collection of stats about them.

> I've been thinking about this all morning and I think there's a
> fundamental problem here: the planner is using low confidence
> estimates in order to pick plans that really only be used when the
> plan is relatively precise.  In particular, I think the broad
> assumption that rows pruned via default selectivity should be capped,
> say to the lesser of 1000 or the greatest known value if otherwise
> constrained.

I think that any such thing would probably just move the pain around.
As a recent example, just the other day somebody was bleating about
a poor rowcount estimate for a pattern match expression, which I suspect
was due to the arbitrary limit in patternsel() on how small a
selectivity it will believe.  I'd rather look for excuses to remove
those sorts of things than add more.

            regards, tom lane

Re: Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
On Wed, Oct 24, 2012 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> Yeah -- I have a case where a large number of joins are happening that
>> have a lot of filtering based on expressions and things like that.
>
> Might be worth your while to install some indexes on those expressions,
> if only to trigger collection of stats about them.

Not practical -- these expressions are all about 'outlier culling'.
It's just wasteful to materialize indexes for stastical purposes only.
 Anyways, in this case, I just refactored the query into a CTE.

Hm -- what if you could flag a table dependent expression for being
interesting for statistics?  Or what about planner hints for boolean
expressions (ducks) ... 'likely(boolexpr)'?

merlin

Re: Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
On Wed, Oct 24, 2012 at 3:51 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 24, 2012 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> Yeah -- I have a case where a large number of joins are happening that
>>> have a lot of filtering based on expressions and things like that.
>>
>> Might be worth your while to install some indexes on those expressions,
>> if only to trigger collection of stats about them.
>
> Not practical -- these expressions are all about 'outlier culling'.
> It's just wasteful to materialize indexes for stastical purposes only.
>  Anyways, in this case, I just refactored the query into a CTE.
>
> Hm -- what if you could flag a table dependent expression for being
> interesting for statistics?  Or what about planner hints for boolean
> expressions (ducks) ... 'likely(boolexpr)'?

By the way, just in case it wasn't obvious, that was a very much
tongue-in-cheek suggestion.  I was just hoping that the final
disposition of this problem isn't: 'there are some queries that are
impossible to plan correctly'.  Anyways, thanks for the help.

merlin

Re: Introducing floating point cast into filter drastically changes row estimate

From
Merlin Moncure
Date:
On Wed, Oct 24, 2012 at 5:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 24, 2012 at 3:51 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 24, 2012 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Merlin Moncure <mmoncure@gmail.com> writes:
>>>> Yeah -- I have a case where a large number of joins are happening that
>>>> have a lot of filtering based on expressions and things like that.
>>>
>>> Might be worth your while to install some indexes on those expressions,
>>> if only to trigger collection of stats about them.
>>
>> Not practical -- these expressions are all about 'outlier culling'.
>> It's just wasteful to materialize indexes for stastical purposes only.
>>  Anyways, in this case, I just refactored the query into a CTE.

Apologies for blabbing, but I was wondering if a solution to this
problem might be to have the planner identify low cost/high impact
scenarios that would qualify for simply running some of the stored
statistical values through qualifying stable expressions, particularly
when the input variables are constant or single sourced from a table.
 Over the years, the planner has been getting very precise in terms of
algorithm choice and this is making the costs of statistics misses
increasingly dangerous, a trend which I think has been reflected by
regression reports on -performance.

merlin