Thread: BUG #5885: Strange rows estimation for left join

BUG #5885: Strange rows estimation for left join

From

"Maxim Boguk"

Date:

14 February 2011, 22:43:19

The following bug has been logged online:

Bug reference:      5885
Logged by:          Maxim Boguk
Email address:      Maxim.Boguk@gmail.com
PostgreSQL version: 8.4.4
Operating system:   Linux
Description:        Strange rows estimation for left join
Details:

I found that strange effect while helping with slow query on russian
postgresql online forum.

To be short results looks like:

explain analyze
select *
from references rs
left join references vm on vm.reference = rs.reference and vm.attr_id =
20084
where rs.object_id = 9129863193713091717;

"Nested Loop Left Join  (cost=0.00..1567.71 rows=2129718 width=112) (actual
time=14.654..14.672 rows=2 loops=1)"
"  ->  Index Scan using xif02references on references rs  (cost=0.00..19.85
rows=11 width=56) (actual time=0.019..0.020 rows=2 loops=1)"
"        Index Cond: (object_id = 9129863193713091717::numeric)"
"  ->  Index Scan using xif01references on references vm  (cost=0.00..140.03
rows=55 width=56) (actual time=7.321..7.321 rows=0 loops=2)"
"        Index Cond: ((vm.reference = rs.reference) AND (vm.attr_id =
20084::numeric))"
"Total runtime: 14.758 ms"

E.g. 6 orders of magnitude error in result rows selectivity (2129718 vs 2).

Table description:

CREATE TABLE references
(
  attr_id numeric(20,0) NOT NULL,
  reference numeric(20,0) NOT NULL,
  object_id numeric(20,0) NOT NULL
);

CREATE INDEX xif01references
  ON references
  USING btree
  (reference, attr_id);

CREATE INDEX xif02references
  ON references
  USING btree
  (object_id, attr_id, reference);

Yes again unfortunate EAV model.

Related data from pg_stats

"schemaname"           "public"
"tablename"            "references"

"attname"              "attr_id"
"null_frac"            0
"avg_width"            10
"n_distinct"           23
"most_common_vals"
    "{20084;20085;9127479646713033746;11;14;15;9127479093313033036;457955;91273
06545213993525}"
"most_common_freqs"    "{0.234067;0.233433;0.227433;0.0998333;0.0476667;0.0449;
0.0361;0.0273333;0.0247}"

d"attname"              "reference"
"null_frac"            0
"avg_width"            14
"n_distinct"           56252
"most_common_vals"
    "{9129830887313872119;9129830887313872121;9129787365613945251;9129676282313
943149;24332313945759;...}"
"most_common_freqs"    "{0.2497;0.0138333;0.0002;0.000166667;0.000166667;...}"

"attname"              "object_id"
"null_frac"            0
"avg_width"            15
"n_distinct"           1.23744e+06
"most_common_vals"
    "{9129846527513534962;9129846051413838763;9129846154413978095;9129846403613
302858;9129846025513792413;...}"
"most_common_freqs"    "{0.0003;0.000266667;0.000266667;0.000233333;0.0002;0.00
02;...}"

What seems very strange for me is final esctimation 2M when estimated 11
rows on right side of the join and 55 rows on left side of the join (so
final rows must be 55*11 ~ 600 rows even with worst case of left join
selectivity=1).

May be that effect already fixed in 8.4.5+ (but I had readed changes list
since 8.4.4 and I don't find any related changes).

PS: Sorry if I completely missunderstood of planner mechanics in left join
rows estimation.

Re: BUG #5885: Strange rows estimation for left join

From

Tom Lane

Date:

14 February 2011, 23:27:39

"Maxim Boguk" <Maxim.Boguk@gmail.com> writes:
> I found that strange effect while helping with slow query on russian
> postgresql online forum.

Please try to put together a self-contained test case for this.
I could not reproduce such a weird result here, but that probably
just means there's something strange about your data distribution.

            regards, tom lane

Re: BUG #5885: Strange rows estimation for left join

From

Maxim Boguk

Date:

15 February 2011, 06:56:03

Hi.

Test case look like:

create table "references" ( attr_id integer, reference integer,
object_id integer );
insert into "references" select *100**(random()),
*100000**(random()^*10*), *1000000**(random()) from
generate_series(*1*,*10000000*);
create index xif01references on "references" ( reference, attr_id );
create index xif02references on "references" ( object_id, attr_id, referenc=
e );

analyze "references";

explain select * from "references" rs left join "references" vm on
vm.reference =3D rs.reference and vm.attr_id =3D *10* where rs.object_id =
=3D
*1000*;

explain analyze select * from "references" rs left join "references"
vm on vm.reference =3D rs.reference and vm.attr_id =3D *10* where
rs.object_id =3D *1000*;

On my system (8.4.4) it producing next results:

postgres=3D# explain select * from "references" rs left join
"references" vm on vm.reference =3D rs.reference and vm.attr_id =3D 10
where rs.object_id =3D 1000;

                                          QUERY PLAN
---------------------------------------------------------------------------=
--------------------
 Nested Loop Left Join  (cost=3D0.00..7.53 rows=3D107283 width=3D24)

   ->  Index Scan using xif02references on "references" rs
(cost=3D0.00..0.58 rows=3D11 width=3D12)
         Index Cond: (object_id =3D 1000)
   ->  Index Scan using xif01references on "references" vm
(cost=3D0.00..0.53 rows=3D8 width=3D12)

         Index Cond: ((vm.reference =3D rs.reference) AND (vm.attr_id =3D 1=
0))

(again 11 rows * 8 rows  <<<  107283 rows)


postgres=3D# explain analyze select * from "references" rs left join
"references" vm on vm.reference =3D rs.reference and vm.attr_id =3D 10
where rs.object_id =3D 1000;


QUERY PLAN
---------------------------------------------------------------------------=
-------------------------------------------------------------------

 Nested Loop Left Join  (cost=3D0.00..7.53 rows=3D107283 width=3D24) (actual
time=3D0.077..733.810 rows=3D117011 loops=3D1)
   ->  Index Scan using xif02references on "references" rs
(cost=3D0.00..0.58 rows=3D11 width=3D12) (actual time=3D0.036..0.079 rows=
=3D10
loops=3D1)

         Index Cond: (object_id =3D 1000)
   ->  Index Scan using xif01references on "references" vm
(cost=3D0.00..0.53 rows=3D8 width=3D12) (actual time=3D0.028..37.242
rows=3D11701 loops=3D10)
         Index Cond: ((vm.reference =3D rs.reference) AND (vm.attr_id =3D 1=
0))

On Tue, Feb 15, 2011 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Maxim Boguk" <Maxim.Boguk@gmail.com> writes:
> > I found that strange effect while helping with slow query on russian
> > postgresql online forum.
>
> Please try to put together a self-contained test case for this.
> I could not reproduce such a weird result here, but that probably
> just means there's something strange about your data distribution.
>
>                        regards, tom lane
>



--=20
Maxim Boguk
Senior Postgresql DBA.

Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
If they can send one man to the moon... why can't they send them all?

=D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/
=D0=A1=D0=B8=D0=BB=D0=B0 =D1=81=D0=BE=D0=BB=D0=BE=D0=BC=D1=83 =D0=BB=D0=BE=
=D0=BC=D0=B8=D1=82, =D0=BD=D0=BE =D0=BD=D0=B5 =D0=B2=D1=81=D0=B5 =D0=B2 =D0=
=BD=D0=B0=D1=88=D0=B5=D0=B9 =D0=B6=D0=B8=D0=B7=D0=BD=D0=B8 - =D1=81=D0=BE=
=D0=BB=D0=BE=D0=BC=D0=B0, =D0=B4=D0=B0 =D0=B8 =D1=81=D0=B8=D0=BB=D0=B0 =D0=
=B4=D0=B0=D0=BB=D0=B5=D0=BA=D0=BE =D0=BD=D0=B5
=D0=B2=D1=81=D0=B5.

Re: BUG #5885: Strange rows estimation for left join

From

Tom Lane

Date:

15 February 2011, 13:18:17

Maxim Boguk <maxim.boguk@gmail.com> writes:
> Test case look like:

> create table "references" ( attr_id integer, reference integer,
> object_id integer );
> insert into "references" select *100**(random()),
> *100000**(random()^*10*), *1000000**(random()) from
> generate_series(*1*,*10000000*);
> create index xif01references on "references" ( reference, attr_id );
> create index xif02references on "references" ( object_id, attr_id, reference );

> analyze "references";

> explain select * from "references" rs left join "references" vm on
> vm.reference = rs.reference and vm.attr_id = *10* where rs.object_id =
> *1000*;

I don't believe there's actually anything very wrong here.  The
large-looking estimate for the join size is not out of line: if you try
different values for object_id you will find that some produce more rows
than that and some produce less.  If we had cross-column stats we could
maybe derive a better estimate, but as-is you're getting an estimate
that is probably about right on the average, depending on whether the
particular object_id matches to more common or less common reference
values.

The thing that looks funny is that the inner indexscan rowcount estimate
is so small, which is because that's being done on the assumption that
the passed-in rs.reference value is random.  It's not really --- it's
more likely to be one of the more common reference values --- which is
something that's correctly accounted for in the join size estimate but
not in the inner indexscan estimate.

            regards, tom lane

Re: BUG #5885: Strange rows estimation for left join

From

Maxim Boguk

Date:

16 February 2011, 05:25:05

On Wed, Feb 16, 2011 at 6:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Maxim Boguk <maxim.boguk@gmail.com> writes:
> > Test case look like:
>
> > create table "references" ( attr_id integer, reference integer,
> > object_id integer );
> > insert into "references" select *100**(random()),
> > *100000**(random()^*10*), *1000000**(random()) from
> > generate_series(*1*,*10000000*);
> > create index xif01references on "references" ( reference, attr_id );
> > create index xif02references on "references" ( object_id, attr_id,
> reference );
>
> > analyze "references";
>
> > explain select * from "references" rs left join "references" vm on
> > vm.reference = rs.reference and vm.attr_id = *10* where rs.object_id =
> > *1000*;
>
> I don't believe there's actually anything very wrong here.  The
> large-looking estimate for the join size is not out of line: if you try
> different values for object_id you will find that some produce more rows
> than that and some produce less.  If we had cross-column stats we could
> maybe derive a better estimate, but as-is you're getting an estimate
> that is probably about right on the average, depending on whether the
> particular object_id matches to more common or less common reference
> values.
>
> The thing that looks funny is that the inner indexscan rowcount estimate
> is so small, which is because that's being done on the assumption that
> the passed-in rs.reference value is random.  It's not really --- it's
> more likely to be one of the more common reference values --- which is
> something that's correctly accounted for in the join size estimate but
> not in the inner indexscan estimate.
>
>                        regards, tom lane
>

Thank you very much for answer.

Are I correct in my assumption:
estimated row counts in both sides of the join isn't related to estimated
resulting row count of the join because they are calculated independently?

If that assumption correct than which values used to select between nested
loop and merge/hash joins (estimated resulting join rows or estimated row
counts on each sides of the join)?
I asking because in some cases these two values can lead to different plans.

PS: I just calculated how many questions I had in mail lists about
postgresql planner. Look like it's time to me dust off my memory about C and
start read planner code by myself. :)

Kind Regards, Maxim