Thread: row estimation off the mark when generate_series calls are involved
Another email which went into the wilderness when I sent it to pgsql-patches. Regards, Nikhils ---------- Forwarded message ---------- From: Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> Date: Fri, Apr 16, 2010 at 6:50 PM Subject: row estimation off the mark when generate_series calls are involved To: pgsql-patches@postgresql.org Hi, I observed the following behavior on PG head: postgres=# create table x(x int); CREATE TABLE postgres=# explain verbose insert into public.x values (generate_series(1,10)); ------------------------------------------------ Insert (cost=0.00..0.01 rows=1 width=0) postgres=# explain verbose insert into public.x values (generate_series(1,1000)); ------------------------------------------------ Insert (cost=0.00..0.01 rows=1 width=0) So even though generate_series has a prorows value of 1000 (why did we pick this value, just a guesstimate I guess?), its effects are not shown in the plan at all. I think the place where we set the targetlist of the result_plan to sub_tlist, immediately after that we should update the plan_rows estimate by walking this latest targetlist. I did that and now we seem to get proper row estimates. Comments? Regards, Nikhils -- http://www.enterprisedb.com -- http://www.enterprisedb.com
Attachment
Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote: > I observed the following behavior on PG head: > postgres=# explain verbose insert into public.x values (generate_series(1,10)); > ------------------------------------------------ > �Insert �(cost=0.00..0.01 rows=1 width=0) Hmmm, there are differences between "SELECT SRF()" and "SELECT * FROM SRF()": postgres=# EXPLAIN INSERT INTO public.x SELECT generate_series(1,10); QUERY PLAN ------------------------------------------------Insert (cost=0.00..0.02 rows=1 width=4) -> Result (cost=0.00..0.01 rows=1width=0) postgres=# EXPLAIN INSERT INTO public.x SELECT * FROM generate_series(1,10); QUERY PLAN ------------------------------------------------------------------------------Insert (cost=0.00..10.00 rows=1000 width=4) -> Function Scan on generate_series (cost=0.00..10.00 rows=1000 width=4) > I think the place where we set the > targetlist of the result_plan to sub_tlist, immediately after that we > should update the plan_rows estimate by walking this latest > targetlist. I did that and now we seem to get proper row estimates. I agree the estimation should be improved, but your patch *adds* the estimated number of rows to the result: postgres=# EXPLAIN INSERT INTO public.x SELECT generate_series(1,10); QUERY PLAN ---------------------------------------------------Insert (cost=0.00..12.52 rows=1002 width=4) -> Result (cost=0.00..2.51rows=1001 width=0) Should it be 1000 rather than 1001? Regards, --- Takahiro Itagaki NTT Open Source Software Center
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes: > Nikhil Sontakke <nikhil.sontakke@enterprisedb.com> wrote: >> I think the place where we set the >> targetlist of the result_plan to sub_tlist, immediately after that we >> should update the plan_rows estimate by walking this latest >> targetlist. I did that and now we seem to get proper row estimates. > I agree the estimation should be improved, but your patch *adds* > the estimated number of rows to the result: I'm not very impressed with that patch, even discounting the sum-vs-product thinko. Trawling the tlist for SRFs will add a significant number of cycles, to modify the rowcount in a way that is practically always wrong (since the estimates for SRF output rowcounts are so bad). What's more, most of the time we don't really care, because the top-level rowcount estimate is of no interest for future planning purposes. It might be worth doing something about this inside sub-selects, but not till we have less-bogus SRF rowcount estimates. BTW, another reason for not being excited about this is that someday we ought to disallow SRFs in the tlist altogether --- once we have LATERAL, which might happen in 9.1, that will be a much more semantically consistent way of getting the desired behavior. regards, tom lane
Hi, > I'm not very impressed with that patch, even discounting the > sum-vs-product thinko. Trawling the tlist for SRFs will add a significant > number of cycles, to modify the rowcount in a way that is practically > always wrong (since the estimates for SRF output rowcounts are so bad). It's trawling *just* the tlist and not the entire query. Given that rowcount estimates are way off the mark for SRF targets, it seems worth the cycles. It's not like sane queries will have very many tlist entries to make this a costly affair IMO. > What's more, most of the time we don't really care, because the > top-level rowcount estimate is of no interest for future planning > purposes. It might be worth doing something about this inside > sub-selects, but not till we have less-bogus SRF rowcount estimates. > But SRF rowcount estimates will generally be bogus, no? Unless we can come up with a mechanism to gather plan-time arguments based statistics mechanism, that will still be the case. To mention Itagaki san's example again: INSERT INTO public.x SELECT generate_series(1,1000); If we have valid row estimates we might use a proper plan to maybe materialize the SELECT portion into a temp table for example and insert into the target. The performance might be much better in that case.. Regards, Nikhils -- http://www.enterprisedb.com