Thread: row estimation off the mark when generate_series calls are involved

row estimation off the mark when generate_series calls are involved

From

Nikhil Sontakke

Date:

19 April 2010, 03:51:44

Another email which went into the wilderness when I sent it to pgsql-patches.

Regards,
Nikhils

---------- Forwarded message ----------
From: Nikhil Sontakke <nikhil.sontakke@enterprisedb.com>
Date: Fri, Apr 16, 2010 at 6:50 PM
Subject: row estimation off the mark when generate_series calls are involved
To: pgsql-patches@postgresql.org

Hi,

I observed the following behavior on PG head:

postgres=# create table x(x int);
CREATE TABLE
postgres=# explain verbose insert into public.x values (generate_series(1,10));
------------------------------------------------
 Insert  (cost=0.00..0.01 rows=1 width=0)

postgres=# explain verbose insert into public.x values
(generate_series(1,1000));
------------------------------------------------
 Insert  (cost=0.00..0.01 rows=1 width=0)

So even though generate_series has a prorows value of 1000 (why did we
pick this value, just a guesstimate I guess?), its effects are not
shown in the plan at all. I think the place where we set the
targetlist of the result_plan to sub_tlist, immediately after that we
should update the plan_rows estimate by walking this latest
targetlist. I did that and now we seem to get proper row estimates.

Comments?

Regards,
Nikhils
--
http://www.enterprisedb.com

--
http://www.enterprisedb.com

Attachment

pghead_estimate.patch

Re: row estimation off the mark when generate_series calls are involved

From

Takahiro Itagaki

Date:

19 April 2010, 21:54:13

Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote:

> I observed the following behavior on PG head:
> postgres=# explain verbose insert into public.x values (generate_series(1,10));
> ------------------------------------------------
> �Insert �(cost=0.00..0.01 rows=1 width=0)

Hmmm, there are differences between "SELECT SRF()" and "SELECT * FROM SRF()":

postgres=# EXPLAIN INSERT INTO public.x SELECT generate_series(1,10);                  QUERY PLAN
------------------------------------------------Insert  (cost=0.00..0.02 rows=1 width=4)  ->  Result  (cost=0.00..0.01
rows=1width=0) 

postgres=# EXPLAIN INSERT INTO public.x SELECT * FROM generate_series(1,10);                                 QUERY PLAN
------------------------------------------------------------------------------Insert  (cost=0.00..10.00 rows=1000
width=4) ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=4) 


> I think the place where we set the
> targetlist of the result_plan to sub_tlist, immediately after that we
> should update the plan_rows estimate by walking this latest
> targetlist. I did that and now we seem to get proper row estimates.

I agree the estimation should be improved, but your patch *adds*
the estimated number of rows to the result:

postgres=# EXPLAIN INSERT INTO public.x SELECT generate_series(1,10);                   QUERY PLAN
---------------------------------------------------Insert  (cost=0.00..12.52 rows=1002 width=4)  ->  Result
(cost=0.00..2.51rows=1001 width=0) 

Should it be 1000 rather than 1001?

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Re: row estimation off the mark when generate_series calls are involved

From

Tom Lane

Date:

19 April 2010, 23:08:20

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:
> Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote:
>> I think the place where we set the
>> targetlist of the result_plan to sub_tlist, immediately after that we
>> should update the plan_rows estimate by walking this latest
>> targetlist. I did that and now we seem to get proper row estimates.

> I agree the estimation should be improved, but your patch *adds*
> the estimated number of rows to the result:

I'm not very impressed with that patch, even discounting the
sum-vs-product thinko.  Trawling the tlist for SRFs will add a significant
number of cycles, to modify the rowcount in a way that is practically
always wrong (since the estimates for SRF output rowcounts are so bad).
What's more, most of the time we don't really care, because the
top-level rowcount estimate is of no interest for future planning
purposes.  It might be worth doing something about this inside
sub-selects, but not till we have less-bogus SRF rowcount estimates.

BTW, another reason for not being excited about this is that someday we
ought to disallow SRFs in the tlist altogether --- once we have LATERAL,
which might happen in 9.1, that will be a much more semantically
consistent way of getting the desired behavior.
        regards, tom lane

Re: row estimation off the mark when generate_series calls are involved

From

Nikhil Sontakke

Date:

20 April 2010, 03:48:34

Hi,

> I'm not very impressed with that patch, even discounting the
> sum-vs-product thinko.  Trawling the tlist for SRFs will add a significant
> number of cycles, to modify the rowcount in a way that is practically
> always wrong (since the estimates for SRF output rowcounts are so bad).

It's trawling *just* the tlist and not the entire query. Given that
rowcount estimates are way off the mark for SRF targets, it seems
worth the cycles. It's not like sane queries will have very many tlist
entries to make this a costly affair IMO.

> What's more, most of the time we don't really care, because the
> top-level rowcount estimate is of no interest for future planning
> purposes.  It might be worth doing something about this inside
> sub-selects, but not till we have less-bogus SRF rowcount estimates.
>

But SRF rowcount estimates will generally be bogus, no? Unless we can
come up with a mechanism to gather plan-time arguments based
statistics mechanism, that will still be the case.

To mention Itagaki san's example again:

INSERT INTO public.x SELECT generate_series(1,1000);

If we have valid row estimates we might use a proper plan to maybe
materialize the SELECT portion into a temp table for example and
insert into the target. The performance might be much better in that
case..

Regards,
Nikhils
--
http://www.enterprisedb.com