Thread: planner chooses unoptimal plan on joins with complex key

planner chooses unoptimal plan on joins with complex key

From
"Dmitry Potapov"
Date:
I've got two huge tables with one-to-many relationship with complex
key. There's also a view, which JOINs the tables, and planner chooses
unoptimal plan on SELECTs from this view.

The db schema is declared as:  (from on now, I skip the unsignificant
columns for the sake of simplicity)

CREATE TABLE t1 (
   id integer NOT NULL,
   m1 integer NOT NULL DEFAULT 0,
   m2 bigint NOT NULL DEFAULT 0,
   m3 bigint NOT NULL DEFAULT 0,
   time_stamp timestamp without time zone DEFAULT now() NOT NULL,
[...skipped...]
);

CREATE TABLE t2 (
   id integer NOT NULL,
   m1 integer NOT NULL DEFAULT 0,
   m2 bigint NOT NULL DEFAULT 0,
   m3 bigint NOT NULL DEFAULT 0,
   time_stamp timestamp without time zone DEFAULT now() NOT NULL,
[...skipped...]
);

CREATE VIEW t1t2_view AS SELECT ..., t1.m1, t1.m2, t1.m3,
t1.time_stamp  FROM t1 JOIN t2 on ( (t1.m1=t2.m1) AND (t1.m2=t2.m2)
AND (t1.m3=t2.m3));

CREATE UNIQUE INDEX i_t1_ms ON t1(m1,m2,m3);
CREATE INDEX i_t1_ts ON t1(time_stamp);
CREATE INDEX i_t2_ms ON t2(m1,m2,m3);

Table t1 contains ~20M rows, t2 contains ~30M rows. The complex key
that ties one table to another is implied, i.e. (m1,m2,m3) isn't
declared as foreign key. There's a reason for that: an app needs to
push lots of INSERTs to these tables pretty quickly, and additional
foreign key constraint check will kill the performance.

So, here's the query in question:

SELECT * FROM t1t2_view ORDER BY time_stamp ASC LIMIT 100;

EXPLAIN ANALYZE SELECT * FROM t1t2_view ORDER BY time_stamp ASC LIMIT 100:

Limit  (cost=13403340.40..13403340.40 rows=1 width=152)
  ->  Sort  (cost=13403340.40..13403340.40 rows=1 width=152)
        Sort Key: t1.time_stamp
        ->  Merge Join  (cost=6663466.28..13403340.39 rows=1 width=152)
              Merge Cond: ((t1.m1 = t2.m1) AND (t1.m2 = t2.m2) AND
(t1.m3 = t2.m3))
              ->  Index Scan using i_t1_ms on t1
(cost=0.00..6272009.52 rows=21639880 width=121)
              ->  Sort  (cost=6663466.28..6739884.33 rows=30567222 width=51)
                    Sort Key: t2.m1, t2.m2, t2.m3
                    ->  Seq Scan on t2  (cost=0.00..922814.22
rows=30567222 width=51)

When I set enable_sort and enable_mergejoin to off, the planner
chooses  better plan:

EXPLAIN ANALYZE SELECT * FROM t1t2_view ORDER BY time_stamp ASC LIMIT 100

 Limit  (cost=0.00..175299576.86 rows=1 width=152)
  ->  Nested Loop  (cost=0.00..175299576.86 rows=1 width=152)
        ->  Index Scan using i_t1_ts on t1  (cost=0.00..1106505.70
rows=21642342 width=121)
        ->  Index Scan using i_t2_ms on t2  (cost=0.00..8.03 rows=1 width=51)
              Index Cond: ((t1.m1 = t2.m1) AND (t1.m2 = t2.m2) AND
(t1.m3 = t2.m3))

The problem here is, as far as I understand, is the wrong estimate of
row count in join result table.

Postgresql version is 8.2.5. The tables are ANALYZEd, Changing
default_statistics_target from 10 to 100, and even 300 doesn't affect
planner's behaviour.

Is there any possibility to make the planner to choose an optimal plan
without turning off enable_sort and enable_mergejoin?

Thanks in advance.


--
Regards,
            Dmitry

Re: planner chooses unoptimal plan on joins with complex key

From
"Guillaume Smet"
Date:
Dmitry,

On Jan 23, 2008 2:48 PM, Dmitry Potapov <fortune.fish@gmail.com> wrote:
> EXPLAIN ANALYZE SELECT * FROM t1t2_view ORDER BY time_stamp ASC LIMIT 100:
>
> Limit  (cost=13403340.40..13403340.40 rows=1 width=152)

It doesn't look like an EXPLAIN ANALYZE output. Can you provide a real
one (you should have a second set of numbers with EXPLAIN ANALYZE)?

Thanks,

--
Guillaume

Re: planner chooses unoptimal plan on joins with complex key

From
Tom Lane
Date:
"Guillaume Smet" <guillaume.smet@gmail.com> writes:
> It doesn't look like an EXPLAIN ANALYZE output. Can you provide a real
> one (you should have a second set of numbers with EXPLAIN ANALYZE)?

Also, could we see the pg_stats rows for the columns being joined?

            regards, tom lane

Re: planner chooses unoptimal plan on joins with complex key

From
Tom Lane
Date:
"Dmitry Potapov" <fortune.fish@gmail.com> writes:
> Sorry, it was just EXPLAIN. I can't run EXPLAIN ANALYZE on that
> (production) server, so I uploaded 3days old backup to a spare box and
> here's what I've got:

>          ->  Merge Join  (cost=0.00..4955790.28 rows=1 width=59)
> (actual time=0.048..4575782.472 rows=30805113 loops=1)
>                Merge Cond: ((t1.m1 = t2.m1) AND (t1.m2 = t2.m2) AND
> (t1.m3 = t2.m3))

Well, there's our problem: an estimate of 1 row for a join that's
actually 30805113 rows is uncool :-(.

It's hard to tell whether the planner is just being overoptimistic
about the results of ANDing the three join conditions, or if one or
more of the basic condition selectivities were misestimated.  Could
you try

    explain analyze select 1 from t1, t2 where t1.m1 = t2.m1;
    explain analyze select 1 from t1, t2 where t1.m2 = t2.m2;
    explain analyze select 1 from t1, t2 where t1.m3 = t2.m3;

and show the results?  This will probably be slow too, but we don't
care --- we just need to see the estimated and actual rowcounts.

            regards, tom lane

Re: planner chooses unoptimal plan on joins with complex key

From
"Dmitry Potapov"
Date:
Hello,

(Tom, sorry if you receive this letter twice. The first copy was
unintentionally sent with 'reply to sender only', I resend it to the
list, reply this one to keep the thread, please.)

2008/1/25, Tom Lane <tgl@sss.pgh.pa.us>:
> Well, there's our problem: an estimate of 1 row for a join that's
> actually 30805113 rows is uncool :-(.
>
> It's hard to tell whether the planner is just being overoptimistic
> about the results of ANDing the three join conditions, or if one or
> more of the basic condition selectivities were misestimated.  Could
> you try
>
>         explain analyze select 1 from t1, t2 where t1.m1 = t2.m1;
>         explain analyze select 1 from t1, t2 where t1.m2 = t2.m2;
>         explain analyze select 1 from t1, t2 where t1.m3 = t2.m3;
>
> and show the results?  This will probably be slow too, but we don't
> care --- we just need to see the estimated and actual rowcounts.

I've indexed m1, m2, m3 in each table individually, to speed things up.

The first query is too slow. In fact, it's still running, for 4 days now:
=# select procpid, current_query, now()-query_start from pg_stat_activity;
 procpid |                              current_query
            |       ?column?
---------+-------------------------------------------------------------------------+-----------------------
   11403 | explain analyze select 1 from t1, t2 where t1.m1 = t2.m1;
            | 4 days 06:11:52.18082

I wonder if it will ever finish the work :( So, for now, the only
thing I can show is:

=# explain select 1 from t1, t2 where t1.m1=t2.m1 ;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..162742993234.25 rows=57462242912003 width=0)
   ->  Seq Scan on t2  (cost=0.00..690784.54 rows=30820054 width=4)
   ->  Index Scan using i_t1_m1 on t1  (cost=0.00..3080.74 rows=175973 width=4)
         Index Cond: (t1.m1 = t2.m1)
(4 rows)

I'll post explain analyze result as soon as it'll finish up. Second
and third queries are less work:

Result for m2 join:

=# explain analyze select 1 from t1, t2 where t1.m2=t2.m2 ;
                                                                 QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=0.00..2390772.31 rows=32274433 width=0) (actual
time=44.460..376892.633 rows=32466668 loops=1)
   Merge Cond: (t1.m2 = t2.m2)
   ->  Index Scan using i_t1_m2 on t1  (cost=0.00..861938.04
rows=21292688 width=8) (actual time=22.178..54862.030 rows=21292689
loops=1)
   ->  Index Scan using i_t2_m2 on t2  (cost=0.00..1023944.35
rows=30820054 width=8) (actual time=22.263..245649.669 rows=32481348
loops=1)
 Total runtime: 389871.753 ms
(5 rows)

Results for m3 join:

=# explain analyze select 1 from t1, t2 where t1.m3=t2.m3 ;
                                                                 QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=3460701.40..6971662.01 rows=148454021 width=0)
(actual time=292433.263..1269127.008 rows=170051982 loops=1)
   Merge Cond: (t2.m3 = t1.m3)
   ->  Index Scan using i_t2_m3 on t2  (cost=0.00..1207227.84
rows=30820054 width=8) (actual time=28.996..622876.390 rows=30820054
loops=1)
   ->  Sort  (cost=3460701.40..3513933.12 rows=21292688 width=8)
(actual time=292404.240..426620.070 rows=170051989 loops=1)
         Sort Key: t1.m3
         ->  Seq Scan on t1  (cost=0.00..635040.88 rows=21292688
width=8) (actual time=0.031..65919.482 rows=21292689 loops=1)
 Total runtime: 1333669.966 ms
(7 rows)


--
Regards,
            Dmitry

Re: planner chooses unoptimal plan on joins with complex key

From
Tom Lane
Date:
"Dmitry Potapov" <fortune.fish@gmail.com> writes:
> 2008/1/25, Tom Lane <tgl@sss.pgh.pa.us>:
>> It's hard to tell whether the planner is just being overoptimistic
>> about the results of ANDing the three join conditions, or if one or
>> more of the basic condition selectivities were misestimated.

> I wonder if it will ever finish the work :( So, for now, the only
> thing I can show is:

> =# explain select 1 from t1, t2 where t1.m1=t2.m1 ;
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..162742993234.25 rows=57462242912003 width=0)
>    ->  Seq Scan on t2  (cost=0.00..690784.54 rows=30820054 width=4)
>    ->  Index Scan using i_t1_m1 on t1  (cost=0.00..3080.74 rows=175973 width=4)
>          Index Cond: (t1.m1 = t2.m1)
> (4 rows)

So this is estimating the join selectivity for m1 alone as

regression=# select 57462242912003 / (30820054.0 * 21292688);
        ?column?
------------------------
 0.08756260792962293676
(1 row)

which would suggest that there are only 11 or so distinct values for m1;
is that about right?

There's something a bit funny here --- looking at the plan components,
the output row estimate should be about 30820054 * 175973, but it's ten
times that.  I'm not sure if this is just a presentation artifact or
if it points to a real bug.

Could you show us the pg_stats rows for t1.m1 and t2.m1?

The m2 estimate seems nearly dead on, and the m3 estimate is within 10%
or so which is certainly close enough.  Barring the possibility that
there's something wrong underneath that m1 estimate, the products of
the selectivities do indeed work out to give us about 1 row out of the
join with all three conditions.  So my conclusion is that these three
values are not independent, but are highly correlated (in fact, it seems
that m2 alone provides most of the join selectivity).  Do you actually
need to compare all three to make the join, or is one field perhaps
dependent on the other two?  Can you refactor the data to make it so?

            regards, tom lane

Re: planner chooses unoptimal plan on joins with complex key

From
"Dmitry Potapov"
Date:
2008/1/30, Tom Lane <tgl@sss.pgh.pa.us>:
> "Dmitry Potapov" <fortune.fish@gmail.com> writes:
> > 2008/1/25, Tom Lane <tgl@sss.pgh.pa.us>:
> >> It's hard to tell whether the planner is just being overoptimistic
> >> about the results of ANDing the three join conditions, or if one or
> >> more of the basic condition selectivities were misestimated.
>
> > I wonder if it will ever finish the work :( So, for now, the only
> > thing I can show is:
>
> > =# explain select 1 from t1, t2 where t1.m1=t2.m1 ;
> >                                    QUERY PLAN
> > --------------------------------------------------------------------------------
> >  Nested Loop  (cost=0.00..162742993234.25 rows=57462242912003 width=0)
> >    ->  Seq Scan on t2  (cost=0.00..690784.54 rows=30820054 width=4)
> >    ->  Index Scan using i_t1_m1 on t1  (cost=0.00..3080.74 rows=175973 width=4)
> >          Index Cond: (t1.m1 = t2.m1)
> > (4 rows)
>
> So this is estimating the join selectivity for m1 alone as
>
> regression=# select 57462242912003 / (30820054.0 * 21292688);
>         ?column?
> ------------------------
>  0.08756260792962293676
> (1 row)
>
> which would suggest that there are only 11 or so distinct values for m1;
> is that about right?
No, that's wrong. There's 155 distinct values in t1, and 157 in t2. By
the way, due to the nature of the data stored in m1 (see below), the
selectivity for m1, imho, should be bigger than 0.08756, something
about  0,80 or so. Unfortunally I can't be sure about that, the query
in question is still running, so the real rowcount is still not
available :(

> There's something a bit funny here --- looking at the plan components,
> the output row estimate should be about 30820054 * 175973, but it's ten
> times that.  I'm not sure if this is just a presentation artifact or
> if it points to a real bug.
That looks strange, indeed.

> Could you show us the pg_stats rows for t1.m1 and t2.m1?
I've attached them as text files to one of my previous letters. I'm
sure that email formatting will kill their readability. I attached
them once again to this letter, pgstats_t1_m,txt is for t1,
pgstats_t2_m.txt for t2.

To save your time, some fields:
t1.m1:
n_distinct=121
correlation=0.0910695

t2.m2:
n_distinct=111
correlation=0.148101

> The m2 estimate seems nearly dead on, and the m3 estimate is within 10%
> or so which is certainly close enough.  Barring the possibility that
> there's something wrong underneath that m1 estimate, the products of
> the selectivities do indeed work out to give us about 1 row out of the
> join with all three conditions.  So my conclusion is that these three
> values are not independent, but are highly correlated (in fact, it seems
> that m2 alone provides most of the join selectivity).  Do you actually
So, about the nature of the data in m1,m2,m3. m1 is something like
subsystem ID to which the row belongs, m3 is supposed to be unique
record id within that subsystem, but it starts with 0 when the system
is restarted. m2 is a timestamp in java format, it is there to
guarantee the uniqueness of (m1,m2,m3) triplet.
So, about corelation, m2 and m3 may seem to be correlated: if a system
inserts records at nearly constant rate, m2=a+b*m3..
> need to compare all three to make the join, or is one field perhaps
> dependent on the other two?  Can you refactor the data to make it so?
yes, I absolutely need to compare all three. about refactoring, the
only thing I can think out is to put m1,m2,m3 to a separate table,
with bigint primary key, and then to tie t1 and t2 with that key. but
that will require changes in the application and some amount of time
to convert the data :(


--
Regards,
            Dmitry