Thread: Why does row estimation on nested loop make no sense to me

Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:
On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product
ofthe inner nodes of the nested loop. 
However in this case, I am stumped!

explain
select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345

Nested Loop  (cost=0.00..2903.37 rows=29107 width=4)
  ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4)
        Index Cond: (user_id = 10954)
  ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8)
        Index Cond: (parent_entity = ue.entity_id)


How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables
beingjoined? 
Not only does it exceed it - but it is orders of magnitude greater. 

Am I missing something obvious here?  I an see the nested loop row estimate being LESS but certainly not more.



PostgreSQL 9.2.4 on x86_64-pc-solaris2.10, compiled by gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 64-bit



Re: Why does row estimation on nested loop make no sense to me

From
Tom Lane
Date:
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables
beingjoined? 
> Not only does it exceed it - but it is orders of magnitude greater.

Can you provide a self-contained test case that does this?

            regards, tom lane


Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
On Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
> On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product
ofthe inner nodes of the nested loop. 
> However in this case, I am stumped!
>
> explain
> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>
> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4)
>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4)
>         Index Cond: (user_id = 10954)
>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8)
>         Index Cond: (parent_entity = ue.entity_id)
>
>
> How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables
beingjoined? 
> Not only does it exceed it - but it is orders of magnitude greater.
>
> Am I missing something obvious here?  I an see the nested loop row estimate being LESS but certainly not more.
>

Can you also post the output of explain analyze <your-query>?


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:



> Can you provide a self-contained test case that does this?

That response scares me.
:)
I can try - Every other table set (small, easy to experiment with)  returns results as expected -
Is the implication that this looks 'unusual'?


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:



----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Friday, May 17, 2013 11:37 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me

On Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
> On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product
ofthe inner nodes of the nested loop. 
> However in this case, I am stumped!
>
> explain
> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>
> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4)
>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4)
>         Index Cond: (user_id = 10954)
>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8)
>         Index Cond: (parent_entity = ue.entity_id)
>
>
> How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables
beingjoined? 
> Not only does it exceed it - but it is orders of magnitude greater.
>
> Am I missing something obvious here?  I an see the nested loop row estimate being LESS but certainly not more.
>

> Can you also post the output of explain analyze <your-query>?

I'm not worried about performance (per se) but the row estimation issue which propagates up as part of a bigger query. 
Buthere ya go: 

explain analyze
select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345

Nested Loop  (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
  ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012
rows=1loops=1) 
        Index Cond: (user_id = 12345)
        Heap Fetches: 1
  ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164
rows=201loops=1) 
        Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms


Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
On Sat, May 18, 2013 at 1:47 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
>
>
>
>
> ----- Original Message -----
> From: Amit Langote <amitlangote09@gmail.com>
> To: Jeff Amiel <becauseimjeff@yahoo.com>
> Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
> Sent: Friday, May 17, 2013 11:37 AM
> Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me
>
> On Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
>> On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a
productof the inner nodes of the nested loop. 
>> However in this case, I am stumped!
>>
>> explain
>> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>>
>> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4)
>>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4)
>>         Index Cond: (user_id = 10954)
>>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8)
>>         Index Cond: (parent_entity = ue.entity_id)
>>
>>
>> How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the
tablesbeing joined? 
>> Not only does it exceed it - but it is orders of magnitude greater.
>>
>> Am I missing something obvious here?  I an see the nested loop row estimate being LESS but certainly not more.
>>
>
>> Can you also post the output of explain analyze <your-query>?
>
> I'm not worried about performance (per se) but the row estimation issue which propagates up as part of a bigger
query. But here ya go: 
>
> explain analyze
> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>
> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012
rows=1loops=1) 
>         Index Cond: (user_id = 12345)
>         Heap Fetches: 1
>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164
rows=201loops=1) 
>         Index Cond: (parent_entity = ue.entity_id)
> Total runtime: 0.361 ms

Have you tried analyze (it's probably a case of insufficient/outdated
statistics to planner's disposal) or probably consider changing
default_statistics_target?


--
Amit Langote


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:



> explain analyze
> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>
> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012
rows=1loops=1) 
>         Index Cond: (user_id = 12345)
>         Heap Fetches: 1
>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164
rows=201loops=1) 
>         Index Cond: (parent_entity = ue.entity_id)
> Total runtime: 0.361 ms

>Have you tried analyze (it's probably a case of insufficient/outdated
>statistics to planner's disposal) or probably consider changing
>default_statistics_target?


Again - my question revolves not around the whether or not I am getting good or bad estimates - my question is related
tothe fact that the nested-loop row estimation does not appear to be derived from the nodes below it - it is off by
ordersof magnitude.  I've never seen this before. 
That aside, yes - I did analyze and tweak stats target during experimentation - no change.


Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
>> explain analyze
>> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>>
>> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
>>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012
rows=1loops=1) 
>>         Index Cond: (user_id = 12345)
>>         Heap Fetches: 1
>>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164
rows=201loops=1) 
>>         Index Cond: (parent_entity = ue.entity_id)
>> Total runtime: 0.361 ms


I noticed when the explain output in your first mail shows Index Cond:
(user_id = 10954) whereas your query says: ue.user_id=12345. Something
with that? Although, your explain analyze does show the same values at
both places with the row estimate being 29107 in both cases, which,
well, looks awful and quite unexpected though there seem to have been
similar observations before

>
>>Have you tried analyze (it's probably a case of insufficient/outdated
>>statistics to planner's disposal) or probably consider changing
>>default_statistics_target?
>
>
> Again - my question revolves not around the whether or not I am getting good or bad estimates - my question is
relatedto the fact that the nested-loop row estimation does not appear to be derived from the nodes below it - it is
offby orders of magnitude.  I've never seen this before. 
> That aside, yes - I did analyze and tweak stats target during experimentation - no change.

Did you also check select count(*) on both the relations and found
related numbers?

--
Amit Langote


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:



----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Friday, May 17, 2013 2:21 PM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me

>> explain analyze
>> select  era.child_entity  from entity_rel era  join user_entity ue on ue.entity_id = era.parent_entity and
ue.user_id=12345
>>
>> Nested Loop  (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
>>   ->  Index Only Scan using entity_pk on user_entity ue  (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012
rows=1loops=1) 
>>         Index Cond: (user_id = 12345)
>>         Heap Fetches: 1
>>   ->  Index Scan using rel_parent on entity_rel era  (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164
rows=201loops=1) 
>>         Index Cond: (parent_entity = ue.entity_id)
>> Total runtime: 0.361 ms


>I noticed when the explain output in your first mail shows Index Cond:
>(user_id = 10954) whereas your query says: ue.user_id=12345. Something
>with that? Although, your explain analyze does show the same values at
>both places with the row estimate being 29107 in both cases, which,
>well, looks awful and quite unexpected though there seem to have been
>similar observations before

That was a weak attempt at hiding 'real' data - intended to change them all to 12345.
:)

>Did you also check select count(*) on both the relations and found
>related numbers?

Nothing related (that I could find)  on the rowcounts - one table has 20 million rows or so ad the other 65K.


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:
Ok - I agree -

Can somebody help me understand where the row estimates come from on a nested-loop operation in postgres then?



----- Original Message -----
From: hubert depesz lubaczewski <depesz@depesz.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Saturday, May 18, 2013 3:39 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense
 to me

Not sure if it helps, but it's apparently not a very rare thing.
Quick analysis on data from explain.depesz.com showed that > 12% of
plans with nested loop have such estimate.

Couple of examples:

http://explain.depesz.com/s/Qm4
http://explain.depesz.com/s/qmW
http://explain.depesz.com/s/qnG
http://explain.depesz.com/s/QO
http://explain.depesz.com/s/qov

...



Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
On Mon, May 20, 2013 at 11:01 PM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
> Ok - I agree -
>
> Can somebody help me understand where the row estimates come from on a nested-loop operation in postgres then?
>

In case you haven't noticed already in the documentation, there are
following lines:

"... It might appear from inspection of the EXPLAIN output that the
estimate of join rows comes from 50 * 1, that is, the number of outer
rows times the estimated number of rows obtained by each inner index
scan on tenk2. But this is not the case: *the join relation size is
estimated before any particular join plan has been considered*. If
everything is working well then the two ways of estimating the join
size will produce about the same answer, but due to roundoff error and
other factors they sometimes diverge significantly."

Read more at: http://www.postgresql.org/docs/9.2/static/row-estimation-examples.html

It also refers where in source code these table size estimations are done.

Hope this helps.


--
Amit Langote


Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
I also found one other discussion which has similar issues addressed:

http://postgresql.1045698.n5.nabble.com/Bogus-nestloop-rows-estimate-in-8-4-7-td5710254.html

--
Amit Langote


Re: Why does row estimation on nested loop make no sense to me

From
Jeff Amiel
Date:
Thanks much!
(sorry for top-posting, yahoo email sucks)




----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "depesz@depesz.com" <depesz@depesz.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Monday, May 20, 2013 9:51 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me

I also found one other discussion which has similar issues addressed:

http://postgresql.1045698.n5.nabble.com/Bogus-nestloop-rows-estimate-in-8-4-7-td5710254.html

--
Amit Langote



Re: Why does row estimation on nested loop make no sense to me

From
Amit Langote
Date:
On Tue, May 21, 2013 at 12:43 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
> Thanks much!
> (sorry for top-posting, yahoo email sucks)
>

I wonder if you could arrive at some conclusions with the statistics
(pg_stats) you have and the join selectivity formulas described in the
referred documentation link. I would like to know if you still get the
same row  estimates (after explain) and also if possible, the value
that is computed from that formula. Do they resemble each other?


--
Amit Langote