Thread: Encouraging multi-table join order

Encouraging multi-table join order

From
Dan Harris
Date:
I have a query that is intended to select from multiple "small tables"
to get a limited subset of "incidentid" and then join with a "very
large" table.  One of the operations will require a sequential scan, but
the planner is doing the scan on the very large table before joining the
small ones, resulting in a huge amount of disk I/O.  How would I make
this query join the large table only after narrowing down the possible
selections from the smaller tables?  This is running on version 8.0.3.

Thanks for any ideas.

-Dan


QUERY
########################################
 explain analyze
        select distinct
                 eventmain.incidentid,
                eventmain.entrydate,
                eventgeo.eventlocation,
                 recordtext as retdata
         from
                 eventactivity
         join (
                 select
                         incidentid
                  from
                         k_h
                 where
                         id = 33396 and
                         k_h.entrydate >= '2006-1-1 00:00' and
                         k_h.entrydate < '2006-4-8 00:00'
                 ) id_keywords using ( incidentid ) ,

                 eventmain,
                 eventgeo
      where
                 eventmain.incidentid = eventactivity.incidentid and
                 eventmain.incidentid = eventgeo.incidentid and
                 ( ' ' || recordtext || ' ' like '%HAL%' ) and
                 eventactivity.entrydate >= '2006-1-1 00:00' and
                 eventactivity.entrydate < '2006-4-8 00:00'
         order by
                 eventmain.entrydate limit 10000;



EXPLAIN ANALYZE OUTPUT
########################################
 Limit  (cost=2521191.65..2521191.90 rows=6 width=187) (actual
time=1360935.787..1361072.277 rows=1400 loops=1)
   ->  Unique  (cost=2521191.65..2521191.90 rows=6 width=187) (actual
time=1360935.779..1361067.853 rows=1400 loops=1)
         ->  Sort  (cost=2521191.65..2521191.66 rows=6 width=187)
(actual time=1360935.765..1360958.258 rows=16211 loops=1)
               Sort Key: eventmain.entrydate, eventmain.incidentid,
eventactivity.recordtext, eventgeo.eventlocation
               ->  Nested Loop  (cost=219.39..2521191.57 rows=6
width=187) (actual time=1123.115..1360579.798 rows=16211 loops=1)
                     ->  Nested Loop  (cost=219.39..2521173.23 rows=6
width=154) (actual time=1105.773..1325907.716 rows=16211 loops=1)
                           ->  Hash Join  (cost=219.39..2521153.37
rows=6 width=66) (actual time=1069.476..1289608.261 rows=16211 loops=1)
                                 Hash Cond: (("outer".incidentid)::text
= ("inner".incidentid)::text)
                                 ->  Seq Scan on eventactivity
(cost=0.00..2518092.06 rows=1532 width=52) (actual
time=57.205..1288514.530 rows=2621 loops=1)
                                       Filter: ((((' '::text ||
(recordtext)::text) || ' '::text) ~~ '%HAL%'::text) AND (entrydate >=
'2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate <
'2006-04-08 00:00:00'::timestamp without time zone))
                                 ->  Hash  (cost=217.53..217.53 rows=741
width=14) (actual time=899.128..899.128 rows=0 loops=1)
                                       ->  Index Scan using k_h_id_idx
on k_h  (cost=0.00..217.53 rows=741 width=14) (actual
time=55.097..893.883 rows=1162 loops=1)
                                             Index Cond: (id = 33396)
                                             Filter: ((entrydate >=
'2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate <
'2006-04-08 00:00:00'::timestamp without time zone))
                           ->  Index Scan using eventmain_incidentid_idx
on eventmain  (cost=0.00..3.30 rows=1 width=88) (actual
time=1.866..2.227 rows=1 loops=16211)
                                 Index Cond:
((eventmain.incidentid)::text = ("outer".incidentid)::text)
                     ->  Index Scan using eventgeo_incidentid_idx on
eventgeo  (cost=0.00..3.04 rows=1 width=75) (actual time=1.770..2.126
rows=1 loops=16211)
                           Index Cond: ((eventgeo.incidentid)::text =
("outer".incidentid)::text)
 Total runtime: 1361080.787 ms
(19 rows)


Re: Encouraging multi-table join order

From
Tom Lane
Date:
Dan Harris <fbsd@drivefaster.net> writes:
> I have a query that is intended to select from multiple "small tables"
> to get a limited subset of "incidentid" and then join with a "very
> large" table.  One of the operations will require a sequential scan, but
> the planner is doing the scan on the very large table before joining the
> small ones, resulting in a huge amount of disk I/O.  How would I make
> this query join the large table only after narrowing down the possible
> selections from the smaller tables?  This is running on version 8.0.3.

That's very strange --- the estimated cost of the seqscan is high enough
that the planner should have chosen a nestloop with inner indexscan on
the big table.  I'm not sure about the join-order point, but the hash
plan for the first join seems wrong in any case.

Um, you do have an index on eventactivity.incidentid, right?  What's the
datatype(s) of the incidentid columns?  What happens to the plan if you
turn off enable_hashjoin and enable_mergejoin?

            regards, tom lane

Re: Encouraging multi-table join order

From
Dan Harris
Date:
Tom Lane wrote:
> That's very strange --- the estimated cost of the seqscan is high enough
> that the planner should have chosen a nestloop with inner indexscan on
> the big table.  I'm not sure about the join-order point, but the hash
> plan for the first join seems wrong in any case.
>
> Um, you do have an index on eventactivity.incidentid, right?  What's the
> datatype(s) of the incidentid columns?  What happens to the plan if you
> turn off enable_hashjoin and enable_mergejoin?
>
>             regards, tom lane
>
Yes, eventactivity.incidentid is indexed.  The datatype is varchar(40).
Although, by checking this, I noticed that k_h.incidentid was
varchar(100).  Perhaps the difference in length between the keys caused
the planner to not use the fastest method?  I have no defense as to why
those aren't the same.. I will make them so and check.

Here's the EXPLAIN analyze with enable_hashjoin = off and
enable_mergejoin = off :

Limit  (cost=4226535.73..4226544.46 rows=698 width=82) (actual
time=74339.016..74356.521 rows=888 loops=1)
   ->  Unique  (cost=4226535.73..4226544.46 rows=698 width=82) (actual
time=74339.011..74354.073 rows=888 loops=1)
         ->  Sort  (cost=4226535.73..4226537.48 rows=698 width=82)
(actual time=74339.003..74344.031 rows=3599 loops=1)
               Sort Key: eventmain.entrydate, eventmain.incidentid,
eventgeo.eventlocation, eventactivity.recordtext
               ->  Nested Loop  (cost=0.00..4226502.76 rows=698
width=82) (actual time=921.325..74314.959 rows=3599 loops=1)
                     ->  Nested Loop  (cost=0.00..4935.61 rows=731
width=72) (actual time=166.354..14638.308 rows=1162 loops=1)
                           ->  Nested Loop  (cost=0.00..2482.47 rows=741
width=50) (actual time=150.396..7348.013 rows=1162 loops=1)
                                 ->  Index Scan using k_h_id_idx on k_h
(cost=0.00..217.55 rows=741 width=14) (actual time=129.540..1022.243
rows=1162 loops=1)
                                       Index Cond: (id = 33396)
                                       Filter: ((entrydate >=
'2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate <
'2006-04-08 00:00:00'::timestamp without time zone))
                                 ->  Index Scan using
eventgeo_incidentid_idx on eventgeo  (cost=0.00..3.04 rows=1 width=36)
(actual time=5.260..5.429 rows=1 loops=1162)
                                       Index Cond:
((eventgeo.incidentid)::text = ("outer".incidentid)::text)
                           ->  Index Scan using eventmain_incidentid_idx
on eventmain  (cost=0.00..3.30 rows=1 width=22) (actual
time=5.976..6.259 rows=1 loops=1162)
                                 Index Cond:
((eventmain.incidentid)::text = ("outer".incidentid)::text)
                     ->  Index Scan using eventactivity1 on
eventactivity  (cost=0.00..5774.81 rows=20 width=52) (actual
time=29.768..51.334 rows=3 loops=1162)
                           Index Cond: (("outer".incidentid)::text =
(eventactivity.incidentid)::text)
                           Filter: ((((' '::text || (recordtext)::text)
|| ' '::text) ~~ '%HAL%'::text) AND (entrydate >= '2006-01-01
00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08
00:00:00'::timestamp without time zone))



Re: Encouraging multi-table join order

From
Tom Lane
Date:
Dan Harris <fbsd@drivefaster.net> writes:
> Yes, eventactivity.incidentid is indexed.  The datatype is varchar(40).
> Although, by checking this, I noticed that k_h.incidentid was
> varchar(100).  Perhaps the difference in length between the keys caused
> the planner to not use the fastest method?

No, the planner wouldn't care about that.

> Here's the EXPLAIN analyze with enable_hashjoin = off and
> enable_mergejoin = off :

OK, so it does consider the "right" plan, but it's estimating it'll take
longer than the other one.  One thing that's very strange is that the
estimated number of rows out has changed ... did you re-ANALYZE since
the previous message?

>                      ->  Index Scan using eventactivity1 on
> eventactivity  (cost=0.00..5774.81 rows=20 width=52) (actual
> time=29.768..51.334 rows=3 loops=1162)
>                            Index Cond: (("outer".incidentid)::text =
> (eventactivity.incidentid)::text)
>                            Filter: ((((' '::text || (recordtext)::text)
> || ' '::text) ~~ '%HAL%'::text) AND (entrydate >= '2006-01-01
> 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08
> 00:00:00'::timestamp without time zone))

So it's estimating 5775 cost units per probe into eventactivity, which
is pretty high --- it must think that a lot of rows will be retrieved by
the index (way more than the 20 or so it thinks will get past the filter
condition).  What does the pg_stats entry for eventactivity.incidentid
contain?  It might be worth increasing the statistics target for that
column to try to get a better estimate.

            regards, tom lane

Re: Encouraging multi-table join order

From
Dan Harris
Date:
Tom Lane wrote:
> <SNIP>
> So it's estimating 5775 cost units per probe into eventactivity, which
> is pretty high --- it must think that a lot of rows will be retrieved by
> the index (way more than the 20 or so it thinks will get past the filter
> condition).

>  What does the pg_stats entry for eventactivity.incidentid
> contain?
select * from pg_stats where tablename = 'eventactivity' and
attname='incidentid';
 schemaname |   tablename   |  attname   | null_frac | avg_width |
n_distinct |
most_common_vals
|
most_common_freqs
|
histogram_bounds                                                      |
correlation

------------+---------------+------------+-----------+-----------+------------+-----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+-------------
 public     | eventactivity | incidentid |         0 |        14 |
8157 |
{P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117}
|
{0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333,0.00133333,0.00133333}
|

{P022140319,P030471058,P033090308,P041961082,P042910689,P050311006,P051350254,P052261148,P053270945,P060240316,P061000287}

|    0.241737

>   It might be worth increasing the statistics target for that
> column to try to get a better estimate.
>
How high should I set this?  I read the default is 10, but I'm not sure
if doubling this would make a difference or if I should be doing a much
larger number. There's approx 45 million rows in the table, if that matters.


Thanks again,
Dan

Re: Encouraging multi-table join order

From
Tom Lane
Date:
Dan Harris <fbsd@drivefaster.net> writes:
> Tom Lane wrote:
>> What does the pg_stats entry for eventactivity.incidentid
>> contain?

> {P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117}
> |
> {0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333,0.00133333,0.00133333}

> How high should I set this?  I read the default is 10, but I'm not sure
> if doubling this would make a difference or if I should be doing a much
> larger number. There's approx 45 million rows in the table, if that matters.

What the stats entry is saying is that the most common entries occur
about 75000 times apiece (0.00166667 * 45e6), which is what's scaring
the planner here ;-).  I think those frequencies are artificially high
though.  The default statistics sample size is 3000 rows (300 *
statistics target, actually), so those numbers correspond to 5 or 4
rows in the sample, which is probably just random chance.

Try increasing the stats targets for this table to 100, then re-ANALYZE
and see what you get.  The most_common_freqs entries might drop as much
as a factor of 10.

            regards, tom lane

Re: Encouraging multi-table join order

From
Dan Harris
Date:
Tom Lane wrote:
> What the stats entry is saying is that the most common entries occur
> about 75000 times apiece (0.00166667 * 45e6), which is what's scaring
> the planner here ;-).  I think those frequencies are artificially high
> though.  The default statistics sample size is 3000 rows (300 *
> statistics target, actually), so those numbers correspond to 5 or 4
> rows in the sample, which is probably just random chance.
>
> Try increasing the stats targets for this table to 100, then re-ANALYZE
> and see what you get.  The most_common_freqs entries might drop as much
> as a factor of 10.
>
>             regards, tom lane
>

Tom:

I believe this was the problem.  I upped the statistics to 100, for a
sample size of 30k and now the planner does the correct nested
loop/index scan and takes only 30 seconds!  This is a HUGE performance
increase.

I wonder why the estimates were so far off the first time?  This table
has been ANALYZED regularly ever since creation.

Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on.

-Dan


Re: Encouraging multi-table join order

From
Tom Lane
Date:
Dan Harris <fbsd@drivefaster.net> writes:
> I wonder why the estimates were so far off the first time?  This table
> has been ANALYZED regularly ever since creation.

Probably just that you need a bigger sample size for such a large table.
We've been arguing ever since 7.2 about what the default statistics
target ought to be --- a lot of people think 10 is too small.  (It could
also be that the fixed 300X multiplier ought to depend on table size
instead.  The math that told us 300X was OK was really about getting the
histogram right, not about whether the most-common-values stats would be
any good.)

            regards, tom lane