Thread: slow joining very large table to smaller ones

slow joining very large table to smaller ones

From
Dan Harris
Date:
I'm trying to improve the speed of this query:

explain select recordtext from eventactivity inner join ( select
incidentid from k_r where id = 94 ) a using ( incidentid ) inner join
( select incidentid from k_b where id = 107 ) b using ( incidentid );
                                                   QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join  (cost=2747.29..4249364.96 rows=11968693 width=35)
    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
    ->  Merge Join  (cost=1349.56..4230052.73 rows=4413563 width=117)
          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
          ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..4051200.28 rows=44519781 width=49)
          ->  Sort  (cost=1349.56..1350.85 rows=517 width=68)
                Sort Key: (k_b.incidentid)::text
                ->  Index Scan using k_b_idx on k_b
(cost=0.00..1326.26 rows=517 width=68)
                      Index Cond: (id = 107)
    ->  Sort  (cost=1397.73..1399.09 rows=542 width=68)
          Sort Key: (k_r.incidentid)::text
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..1373.12
rows=542 width=68)
                Index Cond: (id = 94)
(13 rows)


There are many millions of rows in eventactivity.  There are a few
ten-thousand rows in k_r and k_b.  There is an index on 'incidentid'
in all three tables.  There should only be less than 100 rows matched
in k_r and k_b total.  That part on its own is very very fast.  But,
it should have those 100 or so incidentids extracted in under a
second and then go into eventactivity AFTER doing that.  At least,
that's my intention to make this fast.

Right now, it looks like pg is trying to sort the entire
eventactivity table for the merge join which is taking several
minutes to do.  Can I rephrase this so that it does the searching
through k_r and k_b FIRST and then go into eventactivity using the
index on incidentid?  It seems like that shouldn't be too hard to
make fast but my SQL query skills are only average.

Thanks
-Dan

Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Dan Harris wrote:
> I'm trying to improve the speed of this query:
>
> explain select recordtext from eventactivity inner join ( select
> incidentid from k_r where id = 94 ) a using ( incidentid ) inner join  (
> select incidentid from k_b where id = 107 ) b using ( incidentid );

You might try giving it a little bit more freedom with:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity, k_r, k_b
 WHERE eventactivity.incidentid = k_r.incidentid
   AND eventactivity.incidentid = k_b.incidentid
   AND k_r.id = 94
   AND k_b.id = 107
-- AND k_r.incidentid = k_b.incidentid
;

I'm pretty sure that would give identical results, just let the planner
have a little bit more freedom about how it does it.
Also the last line is commented out, because I think it is redundant.

You might also try:
EXPLAIN ANALYZE
SELECT recordtext
  FROM eventactivity JOIN k_r USING (incidentid)
  JOIN k_b USING (incidentid)
 WHERE k_r.id = 94
   AND k_b.id = 107
;

Also, if possible give us the EXPLAIN ANALYZE so that we know if the
planner is making accurate estimates. (You might send an EXPLAIN while
waiting for the EXPLAIN ANALYZE to finish)

You can also try disabling merge joins, and see how that changes things.

>                                                   QUERY PLAN
> ------------------------------------------------------------------------
> --------------------------------------
> Merge Join  (cost=2747.29..4249364.96 rows=11968693 width=35)
>    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>    ->  Merge Join  (cost=1349.56..4230052.73 rows=4413563 width=117)
>          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>          ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4051200.28 rows=44519781 width=49)
>          ->  Sort  (cost=1349.56..1350.85 rows=517 width=68)
>                Sort Key: (k_b.incidentid)::text
>                ->  Index Scan using k_b_idx on k_b   (cost=0.00..1326.26
> rows=517 width=68)
>                      Index Cond: (id = 107)
>    ->  Sort  (cost=1397.73..1399.09 rows=542 width=68)
>          Sort Key: (k_r.incidentid)::text
>          ->  Index Scan using k_r_idx on k_r  (cost=0.00..1373.12
> rows=542 width=68)
>                Index Cond: (id = 94)
> (13 rows)
>
>
> There are many millions of rows in eventactivity.  There are a few
> ten-thousand rows in k_r and k_b.  There is an index on 'incidentid'  in
> all three tables.  There should only be less than 100 rows matched  in
> k_r and k_b total.  That part on its own is very very fast.  But,  it
> should have those 100 or so incidentids extracted in under a  second and
> then go into eventactivity AFTER doing that.  At least,  that's my
> intention to make this fast.

Well, postgres is estimating around 500 rows each, is that way off? Try
just doing:
EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

And see if postgres estimates the number of rows properly.

I assume you have recently VACUUM ANALYZEd, which means you might need
to update the statistics target (ALTER TABLE k_b ALTER COLUMN
incidientid SET STATISTICS 100) default is IIRC 10, ranges from 1-1000,
higher is more accurate, but makes ANALYZE slower.

>
> Right now, it looks like pg is trying to sort the entire  eventactivity
> table for the merge join which is taking several  minutes to do.  Can I
> rephrase this so that it does the searching  through k_r and k_b FIRST
> and then go into eventactivity using the  index on incidentid?  It seems
> like that shouldn't be too hard to  make fast but my SQL query skills
> are only average.

To me, it looks like it is doing an index scan (on k_b.id) through k_b
first, sorting the results by incidentid, then merge joining that with
eventactivity.

I'm guessing you actually want it to merge k_b and k_r to get extra
selectivity before joining against eventactivity.
I think my alternate forms would let postgres realize this. But if not,
you could try:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
 JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
    WHERE k_r.id = 94 AND k_b.id = 107)
USING (incidentid);

I don't know how selective your keys are, but one of these queries
should probably structure it better for the planner. It depends a lot on
how selective your query is.
If you have 100M rows, the above query looks like it expects k_r to
restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
should be a seq scan (> 10% of the rows = seq scan). But if you are
saying the selectivity is mis-estimated it could be different.

John
=:->
>
> Thanks
> -Dan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>


Attachment

Re: slow joining very large table to smaller ones

From
Dan Harris
Date:
On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:
>
>
> You might try giving it a little bit more freedom with:
>
> EXPLAIN ANALYZE
> SELECT recordtext FROM eventactivity, k_r, k_b
>  WHERE eventactivity.incidentid = k_r.incidentid
>    AND eventactivity.incidentid = k_b.incidentid
>    AND k_r.id = 94
>    AND k_b.id = 107
> -- AND k_r.incidentid = k_b.incidentid
> ;
>
> I'm pretty sure that would give identical results, just let the
> planner
> have a little bit more freedom about how it does it.
> Also the last line is commented out, because I think it is redundant.
>

Ok, I tried this one.  My ssh keeps getting cut off by a router
somewhere between me and the server due to inactivity timeouts, so
all I know is that both the select and explain analyze are taking
over an hour to run.  Here's the explain select for that one, since
that's the best I can get.

explain select recordtext from eventactivity,k_r,k_b where
eventactivity.incidentid = k_r.incidentid and
eventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id
= 107;
                                                   QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
    ->  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
          ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..4186753.16 rows=46029271 width=49)
          ->  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
                Sort Key: (k_b.incidentid)::text
                ->  Index Scan using k_b_idx on k_b
(cost=0.00..4668.31 rows=1821 width=68)
                      Index Cond: (id = 107)
    ->  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
          Sort Key: (k_r.incidentid)::text
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..4755.52
rows=1879 width=68)
                Index Cond: (id = 94)
(13 rows)



> You might also try:
> EXPLAIN ANALYZE
> SELECT recordtext
>   FROM eventactivity JOIN k_r USING (incidentid)
>   JOIN k_b USING (incidentid)
>  WHERE k_r.id = 94
>    AND k_b.id = 107
> ;
>

Similar results here.  The query is taking at least an hour to finish.

explain select recordtext from eventactivity join k_r using
( incidentid ) join k_b using (incidentid ) where k_r.id = 94 and
k_b.id = 107;

                                                  QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join  (cost=9542.77..4672831.12 rows=148391132 width=35)
    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
    ->  Merge Join  (cost=4726.61..4542825.87 rows=15930238 width=117)
          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
          ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..4184145.43 rows=46000104 width=49)
          ->  Sort  (cost=4726.61..4731.13 rows=1806 width=68)
                Sort Key: (k_b.incidentid)::text
                ->  Index Scan using k_b_idx on k_b
(cost=0.00..4628.92 rows=1806 width=68)
                      Index Cond: (id = 107)
    ->  Sort  (cost=4816.16..4820.82 rows=1863 width=68)
          Sort Key: (k_r.incidentid)::text
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..4714.97
rows=1863 width=68)
                Index Cond: (id = 94)
(13 rows)



> Also, if possible give us the EXPLAIN ANALYZE so that we know if the
> planner is making accurate estimates. (You might send an EXPLAIN while
> waiting for the EXPLAIN ANALYZE to finish)
>
> You can also try disabling merge joins, and see how that changes
> things.
>

Are there any negative sideaffects of doing this?

>
>>
>>
>
> Well, postgres is estimating around 500 rows each, is that way off?
> Try
> just doing:
> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;
>
> And see if postgres estimates the number of rows properly.
>
> I assume you have recently VACUUM ANALYZEd, which means you might need
> to update the statistics target (ALTER TABLE k_b ALTER COLUMN
> incidientid SET STATISTICS 100) default is IIRC 10, ranges from
> 1-1000,
> higher is more accurate, but makes ANALYZE slower.
>
>
>>
>> Right now, it looks like pg is trying to sort the entire
>> eventactivity
>> table for the merge join which is taking several  minutes to do.
>> Can I
>> rephrase this so that it does the searching  through k_r and k_b
>> FIRST
>> and then go into eventactivity using the  index on incidentid?  It
>> seems
>> like that shouldn't be too hard to  make fast but my SQL query skills
>> are only average.
>>
>
> To me, it looks like it is doing an index scan (on k_b.id) through k_b
> first, sorting the results by incidentid, then merge joining that with
> eventactivity.
>
> I'm guessing you actually want it to merge k_b and k_r to get extra
> selectivity before joining against eventactivity.
> I think my alternate forms would let postgres realize this. But if
> not,
> you could try:
>
> EXPLAIN ANALYZE
> SELECT recordtext FROM eventactivity
>  JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
>     WHERE k_r.id = 94 AND k_b.id = 107)
> USING (incidentid);
>

This one looks like the same plan as the others:

explain select recordtext from eventactivity join ( select incidentid
from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =
107 ) a  using (incidentid );
                                                   QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join  (cost=9793.33..4693149.15 rows=156544758 width=35)
    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
    ->  Merge Join  (cost=4847.75..4557237.59 rows=16365843 width=117)
          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
          ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..4191691.79 rows=46084161 width=49)
          ->  Sort  (cost=4847.75..4852.38 rows=1852 width=68)
                Sort Key: (k_b.incidentid)::text
                ->  Index Scan using k_b_idx on k_b
(cost=0.00..4747.24 rows=1852 width=68)
                      Index Cond: (id = 107)
    ->  Sort  (cost=4945.58..4950.36 rows=1913 width=68)
          Sort Key: (k_r.incidentid)::text
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..4841.30
rows=1913 width=68)
                Index Cond: (id = 94)
(13 rows)



> I don't know how selective your keys are, but one of these queries
> should probably structure it better for the planner. It depends a
> lot on
> how selective your query is.

eventactivity currently has around 36 million rows in it. There
should only be maybe 200-300 incidentids at most that will be matched
with the combination of k_b and k_r.  That's why I was thinking I
could somehow get a list of just the incidentids that matched the id
= 94 and id = 107 in k_b and k_r first. Then, I would only need to
grab a few hundred out of 36 million rows from eventactivity.

> If you have 100M rows, the above query looks like it expects k_r to
> restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
> should be a seq scan (> 10% of the rows = seq scan). But if you are
> saying the selectivity is mis-estimated it could be different.
>

Yeah, if I understand you correctly, I think the previous paragraph
shows this is a significant misestimate.



Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Dan Harris wrote:
>
> On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:

...
Did you try doing this to see how good the planners selectivity
estimates are?

>> Well, postgres is estimating around 500 rows each, is that way off?  Try
>> just doing:
>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

These should be fast queries.

John
=:->

Attachment

Re: slow joining very large table to smaller ones

From
Michael Stone
Date:
On Thu, Jul 14, 2005 at 04:29:58PM -0600, Dan Harris wrote:
>Ok, I tried this one.  My ssh keeps getting cut off by a router
>somewhere between me and the server due to inactivity timeouts, so
>all I know is that both the select and explain analyze are taking
>over an hour to run.

Try running the query as a script with nohup & redirect the output to a
file.

Mike Stone

Re: slow joining very large table to smaller ones

From
Tom Lane
Date:
Dan Harris <fbsd@drivefaster.net> writes:
> Here's the explain select for that one, since
> that's the best I can get.

> explain select recordtext from eventactivity,k_r,k_b where
> eventactivity.incidentid = k_r.incidentid and
> eventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id
> = 107;
>                                                    QUERY PLAN
> ------------------------------------------------------------------------
> --------------------------------------
> Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
>     Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>     ->  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
>           Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>           ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4186753.16 rows=46029271 width=49)
>           ->  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
>                 Sort Key: (k_b.incidentid)::text
>                 ->  Index Scan using k_b_idx on k_b
> (cost=0.00..4668.31 rows=1821 width=68)
>                       Index Cond: (id = 107)
>     ->  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
>           Sort Key: (k_r.incidentid)::text
>           ->  Index Scan using k_r_idx on k_r  (cost=0.00..4755.52
> rows=1879 width=68)
>                 Index Cond: (id = 94)
> (13 rows)

There's something awfully fishy here.  The 8.0 planner is definitely
capable of figuring out that it ought to join the smaller relations
first.  As an example, using 8.0.3+ (CVS branch tip) I did

regression=# create table eventactivity(incidentid varchar, recordtext text);
CREATE TABLE
regression=# create table k_r(incidentid varchar);
CREATE TABLE
regression=# create table k_b(incidentid varchar);
CREATE TABLE
regression=# explain select recordtext from eventactivity inner join
(select incidentid from k_r) a using (incidentid)
inner join (select incidentid from k_b) b using (incidentid);

(Being too impatient to actually fill the eventactivity table with 36M
rows of data, I just did some debugger magic to make the planner think
that that was the table size...)  The default plan looks like

 Merge Join  (cost=16137814.70..36563453.23 rows=1361700000 width=32)
   Merge Cond: (("outer".incidentid)::text = "inner"."?column3?")
   ->  Merge Join  (cost=170.85..290.48 rows=7565 width=64)
         Merge Cond: ("outer"."?column2?" = "inner"."?column2?")
         ->  Sort  (cost=85.43..88.50 rows=1230 width=32)
               Sort Key: (k_r.incidentid)::text
               ->  Seq Scan on k_r  (cost=0.00..22.30 rows=1230 width=32)
         ->  Sort  (cost=85.43..88.50 rows=1230 width=32)
               Sort Key: (k_b.incidentid)::text
               ->  Seq Scan on k_b  (cost=0.00..22.30 rows=1230 width=32)
   ->  Sort  (cost=16137643.84..16227643.84 rows=36000000 width=64)
         Sort Key: (eventactivity.incidentid)::text
         ->  Seq Scan on eventactivity  (cost=0.00..1080000.00 rows=36000000 width=64)

and if I "set enable_mergejoin TO 0;" I get

 Hash Join  (cost=612.54..83761451.54 rows=1361700000 width=32)
   Hash Cond: (("outer".incidentid)::text = ("inner".incidentid)::text)
   ->  Seq Scan on eventactivity  (cost=0.00..1080000.00 rows=36000000 width=64)
   ->  Hash  (cost=504.62..504.62 rows=7565 width=64)
         ->  Hash Join  (cost=25.38..504.62 rows=7565 width=64)
               Hash Cond: (("outer".incidentid)::text = ("inner".incidentid)::text)
               ->  Seq Scan on k_r  (cost=0.00..22.30 rows=1230 width=32)
               ->  Hash  (cost=22.30..22.30 rows=1230 width=32)
                     ->  Seq Scan on k_b  (cost=0.00..22.30 rows=1230 width=32)

which is the plan I would judge Most Likely To Succeed based on what we
know about Dan's problem.  (The fact that the planner is estimating it
as twice as expensive as the mergejoin comes from the fact that with no
statistics about the join keys, the planner deliberately estimates hash
join as expensive, because it can be pretty awful in the presence of
many equal keys.)

So the planner is certainly capable of finding the desired plan, even
without any tweaking of the query text.  This means that what we have
is mainly a statistical problem.  Have you ANALYZEd these tables
recently?  If so, may we see the pg_stats rows for incidentid in all
three tables?

            regards, tom lane

Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Dan Harris wrote:
>
> On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:
>
>>
>>
>> You might try giving it a little bit more freedom with:
>>
>> EXPLAIN ANALYZE
>> SELECT recordtext FROM eventactivity, k_r, k_b
>>  WHERE eventactivity.incidentid = k_r.incidentid
>>    AND eventactivity.incidentid = k_b.incidentid
>>    AND k_r.id = 94
>>    AND k_b.id = 107
>> -- AND k_r.incidentid = k_b.incidentid
>> ;
>>
>> I'm pretty sure that would give identical results, just let the  planner
>> have a little bit more freedom about how it does it.
>> Also the last line is commented out, because I think it is redundant.
>>
>
> Ok, I tried this one.  My ssh keeps getting cut off by a router
> somewhere between me and the server due to inactivity timeouts, so  all
> I know is that both the select and explain analyze are taking  over an
> hour to run.  Here's the explain select for that one, since  that's the
> best I can get.
>
> explain select recordtext from eventactivity,k_r,k_b where
> eventactivity.incidentid = k_r.incidentid and  eventactivity.incidentid
> = k_b.incidentid and k_r.id = 94 and k_b.id  = 107;
>                                                   QUERY PLAN
> ------------------------------------------------------------------------
> --------------------------------------
> Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
>    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>    ->  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
>          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>          ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4186753.16 rows=46029271 width=49)
>          ->  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
>                Sort Key: (k_b.incidentid)::text
>                ->  Index Scan using k_b_idx on k_b   (cost=0.00..4668.31
> rows=1821 width=68)
>                      Index Cond: (id = 107)
>    ->  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
>          Sort Key: (k_r.incidentid)::text
>          ->  Index Scan using k_r_idx on k_r  (cost=0.00..4755.52
> rows=1879 width=68)
>                Index Cond: (id = 94)
> (13 rows)
>

If anything, the estimations have gotten worse. As now it thinks there
will be 1800 rows returned each, whereas you were thinking it would be
more around 100.

Since you didn't say, you did VACUUM ANALYZE recently, right?

>

...

>>
>> You can also try disabling merge joins, and see how that changes  things.
>>
>
> Are there any negative sideaffects of doing this?

If the planner is estimating things correctly, you want to give it the
most flexibility of plans to pick from, because sometimes a merge join
is faster (postgres doesn't pick things because it wants to go slower).
The only reason for the disable flags is that sometimes the planner
doesn't estimate correctly. Usually disabling a method is not the final
solution, but a way to try out different methods, and see what happens
to the results.

Using: SET enable_mergejoin TO off;
You can disable it just for the current session (not for the entire
database). Which is the recommended way if you have a query that
postgres is messing up on. (Usually it is correct elsewhere).


>>
>> Well, postgres is estimating around 500 rows each, is that way off?  Try
>> just doing:
>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

Once again, do this and post the results. We might just need to tweak
your settings so that it estimates the number of rows correctly, and we
don't need to do anything else.

>>
>> And see if postgres estimates the number of rows properly.
>>
>> I assume you have recently VACUUM ANALYZEd, which means you might need
>> to update the statistics target (ALTER TABLE k_b ALTER COLUMN
>> incidientid SET STATISTICS 100) default is IIRC 10, ranges from  1-1000,
>> higher is more accurate, but makes ANALYZE slower.
>>


...

>> EXPLAIN ANALYZE
>> SELECT recordtext FROM eventactivity
>>  JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
>>     WHERE k_r.id = 94 AND k_b.id = 107)
>> USING (incidentid);
>>
>
> This one looks like the same plan as the others:
>
> explain select recordtext from eventactivity join ( select incidentid
> from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =  107
> ) a  using (incidentid );

Well, the planner is powerful enough to flatten nested selects. To make
it less "intelligent" you can do:
SET join_collapse_limit 1;
or
SET join_collapse_limit 0;
Which should tell postgres to not try and get tricky with your query.
Again, *usually* the planner knows better than you do. So again just do
it to see what you get.

The problem is that if you are only using EXPLAIN SELECT, you will
probably get something which *looks* worse. Because if it looked better,
the planner would have used it. That is why you really need the EXPLAIN
ANALYZE, so that you can see where the planner is incorrect in it's
estimates.


>                                                   QUERY PLAN
> ------------------------------------------------------------------------
> --------------------------------------
> Merge Join  (cost=9793.33..4693149.15 rows=156544758 width=35)
>    Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>    ->  Merge Join  (cost=4847.75..4557237.59 rows=16365843 width=117)
>          Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>          ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4191691.79 rows=46084161 width=49)
>          ->  Sort  (cost=4847.75..4852.38 rows=1852 width=68)
>                Sort Key: (k_b.incidentid)::text
>                ->  Index Scan using k_b_idx on k_b   (cost=0.00..4747.24
> rows=1852 width=68)
>                      Index Cond: (id = 107)
>    ->  Sort  (cost=4945.58..4950.36 rows=1913 width=68)
>          Sort Key: (k_r.incidentid)::text
>          ->  Index Scan using k_r_idx on k_r  (cost=0.00..4841.30
> rows=1913 width=68)
>                Index Cond: (id = 94)
> (13 rows)
>

What I don't understand is that the planner is actually estimating that
joining against the new table is going to *increase* the number of
returned rows.
Because the final number of rows here is 156M while the initial join
shows only 16M.
And it also thinks that it will only grab 46M rows from eventactivity.

If you have analyzed recently can you do:
SELECT relname, reltuples FROM pg_class WHERE relname='eventactivity';

It is a cheaper form than "SELECT count(*) FROM eventactivity" to get an
approximate estimate of the number of rows. But if it isn't too
expensive, please also give the value from SELECT count(*) FROM
eventactivity.

Again, that helps us know if your tables are up-to-date.


>
>
>> I don't know how selective your keys are, but one of these queries
>> should probably structure it better for the planner. It depends a  lot on
>> how selective your query is.
>
>
> eventactivity currently has around 36 million rows in it. There  should
> only be maybe 200-300 incidentids at most that will be matched  with the
> combination of k_b and k_r.  That's why I was thinking I  could somehow
> get a list of just the incidentids that matched the id  = 94 and id =
> 107 in k_b and k_r first. Then, I would only need to  grab a few hundred
> out of 36 million rows from eventactivity.
>

Well, you can also try:
SELECT count(*) FROM k_b JOIN k_r USING (incidentid)
 WHERE k_b.id=?? AND k_r.id=??
;

That will tell you how many rows they have in common.

>> If you have 100M rows, the above query looks like it expects k_r to
>> restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
>> should be a seq scan (> 10% of the rows = seq scan). But if you are
>> saying the selectivity is mis-estimated it could be different.
>>
>
> Yeah, if I understand you correctly, I think the previous paragraph
> shows this is a significant misestimate.
>

Well, if you look at the latest plans, things have gone up from 44M to
156M, I don't know why it is worse, but it is getting there.

John
=:->

Attachment

Re: slow joining very large table to smaller ones

From
Tom Lane
Date:
John A Meinel <john@arbash-meinel.com> writes:
> What I don't understand is that the planner is actually estimating that
> joining against the new table is going to *increase* the number of
> returned rows.

It evidently thinks that incidentid in the k_r table is pretty
nonunique.  We really need to look at the statistics data to
see what's going on.

            regards, tom lane

Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Tom Lane wrote:
> John A Meinel <john@arbash-meinel.com> writes:
>
>>What I don't understand is that the planner is actually estimating that
>>joining against the new table is going to *increase* the number of
>>returned rows.
>
>
> It evidently thinks that incidentid in the k_r table is pretty
> nonunique.  We really need to look at the statistics data to
> see what's going on.
>
>             regards, tom lane
>

Okay, sure. What about doing this, then:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
  JOIN (SELECT DISTINCT incidentid FROM k_r JOIN k_b USING (incidentid)
     WHERE k_r.id = ?? AND k_b.id = ??)
 USING (incidentid)
;

Since I assume that eventactivity is the only table with "recordtext",
and that you don't get any columns from k_r and k_b, meaning it would be
pointless to get duplicate incidentids.

I may be misunderstanding what the query is trying to do, but depending
on what is in k_r and k_b, is it possible to use a UNIQUE INDEX rather
than just an index on incidentid?

There is also the possibility of
EXPLAIN ANALYZE
SELECT recordtext FROM eventactivtity
  JOIN (SELECT incidentid FROM k_r WHERE k_r.id = ??
         UNION SELECT incidentid FROM k_b WHERE k_b.id = ??)
 USING (incidentid)
;

But both of these would mean that you don't actually want columns from
k_r or k_b, just a unique list of incident ids.

But first, I agree, we should make sure the pg_stats values are reasonable.

John
=:->


Attachment

Re: slow joining very large table to smaller ones

From
Dan Harris
Date:
On Jul 14, 2005, at 5:12 PM, John A Meinel wrote:

> Dan Harris wrote:
>
>
>>>
>>> Well, postgres is estimating around 500 rows each, is that way
>>> off?  Try
>>> just doing:
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;
>>>
>
> Once again, do this and post the results. We might just need to tweak
> your settings so that it estimates the number of rows correctly,
> and we
> don't need to do anything else.
>

Ok, sorry I missed these the first time through:

explain analyze select incidentid from k_b where id = 107;
                                                        QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------
Index Scan using k_b_idx on k_b  (cost=0.00..1926.03 rows=675
width=14) (actual time=0.042..298.394 rows=2493 loops=1)
    Index Cond: (id = 107)
Total runtime: 299.103 ms

select count(*) from k_b;
count
--------
698350

( sorry! I think I said this one only had tens of thousands in it )


explain analyze select incidentid from k_r where id =
94;                                                       QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------
Index Scan using k_r_idx on k_r  (cost=0.00..2137.61 rows=757
width=14) (actual time=0.092..212.187 rows=10893 loops=1)
    Index Cond: (id = 94)
Total runtime: 216.498 ms
(3 rows)


select count(*) from k_r;
count
--------
671670


That one is quite a bit slower, yet it's the same table structure and
same index as k_b, also it has fewer records.

I did run VACUUM ANALYZE immediately before running these queries.
It seems a lot better with the join_collapse set.

>
> \
> Well, the planner is powerful enough to flatten nested selects. To
> make
> it less "intelligent" you can do:
> SET join_collapse_limit 1;
> or
> SET join_collapse_limit 0;
> Which should tell postgres to not try and get tricky with your query.
> Again, *usually* the planner knows better than you do. So again
> just do
> it to see what you get.
>

Ok, when join_collapse_limit = 1 I get this now:

explain analyze select recordtext from eventactivity join ( select
incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and
k_b.id = 107 ) a  using (incidentid );

QUERY PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------------
Nested Loop  (cost=0.00..156509.08 rows=2948 width=35) (actual
time=1.555..340.625 rows=24825 loops=1)
    ->  Nested Loop  (cost=0.00..5361.89 rows=6 width=28) (actual
time=1.234..142.078 rows=366 loops=1)
          ->  Index Scan using k_b_idx on k_b  (cost=0.00..1943.09
rows=681 width=14) (actual time=0.423..56.974 rows=2521 loops=1)
                Index Cond: (id = 107)
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..5.01
rows=1 width=14) (actual time=0.031..0.031 rows=0 loops=2521)
                Index Cond: ((k_r.id = 94) AND
((k_r.incidentid)::text = ("outer".incidentid)::text))
    ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..25079.55 rows=8932 width=49) (actual time=0.107..0.481
rows=68 loops=366)
          Index Cond: ((eventactivity.incidentid)::text =
("outer".incidentid)::text)
Total runtime: 347.975 ms

MUCH better!  Maybe you can help me understand what I did and if I
need to make something permanent to get this behavior from now on?



>
>
>
> If you have analyzed recently can you do:
> SELECT relname, reltuples FROM pg_class WHERE relname='eventactivity';
>
> It is a cheaper form than "SELECT count(*) FROM eventactivity" to
> get an
> approximate estimate of the number of rows. But if it isn't too
> expensive, please also give the value from SELECT count(*) FROM
> eventactivity.
>
> Again, that helps us know if your tables are up-to-date.
>

Sure:

select relname, reltuples from pg_class where relname='eventactivity';
     relname    |  reltuples
---------------+-------------
eventactivity | 3.16882e+07

select count(*) from eventactivity;
   count
----------
31871142





>
>
>>
>>
>>
>>> I don't know how selective your keys are, but one of these queries
>>> should probably structure it better for the planner. It depends
>>> a  lot on
>>> how selective your query is.
>>>
>>
>>
>> eventactivity currently has around 36 million rows in it. There
>> should
>> only be maybe 200-300 incidentids at most that will be matched
>> with the
>> combination of k_b and k_r.  That's why I was thinking I  could
>> somehow
>> get a list of just the incidentids that matched the id  = 94 and id =
>> 107 in k_b and k_r first. Then, I would only need to  grab a few
>> hundred
>> out of 36 million rows from eventactivity.
>>
>>
>
> Well, you can also try:
> SELECT count(*) FROM k_b JOIN k_r USING (incidentid)
>  WHERE k_b.id=?? AND k_r.id=??
> ;
>
> That will tell you how many rows they have in common.

select count(*) from k_b join k_r using (incidentid) where k_b.id=107
and k_r.id=94;
count
-------
    373



>
> Well, if you look at the latest plans, things have gone up from 44M to
> 156M, I don't know why it is worse, but it is getting there.

I assume this is because r_k and r_b are growing fairly rapidly right
now.  The time in between queries contained a lot of inserts.  I was
careful to vacuum analyze before sending statistics, as I did this
time.  I'm sorry if this has confused the issue.




Re: slow joining very large table to smaller ones

From
Dan Harris
Date:
On Jul 14, 2005, at 7:15 PM, John A Meinel wrote:
>
>
> Is the distribution of your rows uneven? Meaning do you have more rows
> with a later id than an earlier one?
>

There are definitely some id's that will have many times more than
the others.  If I group and count them, the top 10 are fairly
dominant in the table.
>>
>
> Hmm.. How to do it permanantly? Well you could always issue "set
> join_collapse set 1; select * from ...."
> But obviously that isn't what you prefer. :)
>
> I think there are things you can do to make merge join more expensive
> than a nested loop, but I'm not sure what they are.

Maybe someone else has some ideas to encourage this behavior for
future work?  Setting it on a per-connection basis is doable, but
would add some burden to us in code.

>
> What I really don't understand is that the estimates dropped as well.
> The actual number of estimate rows drops to 3k instead of > 1M.
> The real question is why does the planner think it will be so
> expensive?
>
>
>> select count(*) from k_b join k_r using (incidentid) where k_b.id=107
>> and k_r.id=94;
>> count
>> -------
>>    373
>>
>>
>
> Well, this says that they are indeed much more selective.
> Each one has > 1k rows, but together you end up with only 400.
>

Is this a bad thing?  Is this not "selective enough" to make it much
faster?

Overall, I'm much happier now after seeing the new plan come about,
if I can find a way to make that join_collapse behavior permanent, I
can certainly live with these numbers.

Thanks again for your continued efforts.

-Dan

Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Dan Harris wrote:
>
> On Jul 14, 2005, at 7:15 PM, John A Meinel wrote:
>
>>
>>
>> Is the distribution of your rows uneven? Meaning do you have more rows
>> with a later id than an earlier one?
>>
>
> There are definitely some id's that will have many times more than  the
> others.  If I group and count them, the top 10 are fairly  dominant in
> the table.

That usually skews the estimates. Since the estimate is more of an
average (unless the statistics are higher).

>
>>>
>>
>> Hmm.. How to do it permanantly? Well you could always issue "set
>> join_collapse set 1; select * from ...."
>> But obviously that isn't what you prefer. :)
>>
>> I think there are things you can do to make merge join more expensive
>> than a nested loop, but I'm not sure what they are.
>
>
> Maybe someone else has some ideas to encourage this behavior for  future
> work?  Setting it on a per-connection basis is doable, but  would add
> some burden to us in code.

My biggest question is why the planner things the Nested Loop would be
so expensive.
Have you tuned any of the parameters? It seems like something is out of
whack. (cpu_tuple_cost, random_page_cost, etc...)

>
>>
>> What I really don't understand is that the estimates dropped as well.
>> The actual number of estimate rows drops to 3k instead of > 1M.
>> The real question is why does the planner think it will be so  expensive?
>>
>>
>>> select count(*) from k_b join k_r using (incidentid) where k_b.id=107
>>> and k_r.id=94;
>>> count
>>> -------
>>>    373
>>>
>>>
>>
>> Well, this says that they are indeed much more selective.
>> Each one has > 1k rows, but together you end up with only 400.
>>
>
> Is this a bad thing?  Is this not "selective enough" to make it much
> faster?

Yes, being more selective is what makes it faster. But the planner
doesn't seem to notice it properly.

>
> Overall, I'm much happier now after seeing the new plan come about,  if
> I can find a way to make that join_collapse behavior permanent, I  can
> certainly live with these numbers.
>

I'm sure there are pieces to tune, but I've reached my limits of
parameters to tweak :)

> Thanks again for your continued efforts.
>
> -Dan
>

John
=:->

Attachment

Re: slow joining very large table to smaller ones

From
Dan Harris
Date:
On Jul 14, 2005, at 10:12 PM, John A Meinel wrote:
>
> My biggest question is why the planner things the Nested Loop would be
> so expensive.
> Have you tuned any of the parameters? It seems like something is
> out of
> whack. (cpu_tuple_cost, random_page_cost, etc...)
>

here's some of my postgresql.conf.  Feel free to blast me if I did
something idiotic here.

shared_buffers = 50000
effective_cache_size = 1348000
random_page_cost = 3
work_mem = 512000
max_fsm_pages = 80000
log_min_duration_statement = 60000
fsync = true ( not sure if I'm daring enough to run without this )
wal_buffers = 1000
checkpoint_segments = 64
checkpoint_timeout = 3000


#---- FOR PG_AUTOVACUUM --#
stats_command_string = true
stats_row_level = true


Re: slow joining very large table to smaller ones

From
Dan Harris
Date:
On Jul 15, 2005, at 9:09 AM, Dan Harris wrote:

>
> On Jul 14, 2005, at 10:12 PM, John A Meinel wrote:
>
>>
>> My biggest question is why the planner things the Nested Loop
>> would be
>> so expensive.
>> Have you tuned any of the parameters? It seems like something is
>> out of
>> whack. (cpu_tuple_cost, random_page_cost, etc...)
>>
>>
>
> here's some of my postgresql.conf.  Feel free to blast me if I did
> something idiotic here.
>
> shared_buffers = 50000
> effective_cache_size = 1348000
> random_page_cost = 3
> work_mem = 512000
> max_fsm_pages = 80000
> log_min_duration_statement = 60000
> fsync = true ( not sure if I'm daring enough to run without this )
> wal_buffers = 1000
> checkpoint_segments = 64
> checkpoint_timeout = 3000
>
>
> #---- FOR PG_AUTOVACUUM --#
> stats_command_string = true
> stats_row_level = true
>

Sorry, I forgot to re-post my hardware specs.

HP DL585
4 x 2.2 GHz Opteron
12GB RAM
SmartArray RAID controller, 1GB hardware cache, 4x73GB 10k SCSI in
RAID 0+1
ext2 filesystem

Also, there are 30 databases on the machine, 27 of them are identical
schemas.


Re: slow joining very large table to smaller ones

From
Bruno Wolff III
Date:
On Thu, Jul 14, 2005 at 16:29:58 -0600,
  Dan Harris <fbsd@drivefaster.net> wrote:
>
> Ok, I tried this one.  My ssh keeps getting cut off by a router
> somewhere between me and the server due to inactivity timeouts, so
> all I know is that both the select and explain analyze are taking
> over an hour to run.  Here's the explain select for that one, since
> that's the best I can get.

Are you using NAT at home? That's probably where the issue is. If you
have control of that box you can probably increase the timeout to a
couple of hours.

Re: slow joining very large table to smaller ones

From
PFC
Date:
>> Ok, I tried this one.  My ssh keeps getting cut off by a router
>> somewhere between me and the server due to inactivity timeouts, so
>> all I know is that both the select and explain analyze are taking
>> over an hour to run.  Here's the explain select for that one, since
>> that's the best I can get.

    one word : screen !
    one of the most useful little command line utilities...

Re: slow joining very large table to smaller ones

From
John A Meinel
Date:
Dan Harris wrote:
>
> On Jul 14, 2005, at 10:12 PM, John A Meinel wrote:
>
>>
>> My biggest question is why the planner things the Nested Loop would be
>> so expensive.
>> Have you tuned any of the parameters? It seems like something is  out of
>> whack. (cpu_tuple_cost, random_page_cost, etc...)
>>
>
> here's some of my postgresql.conf.  Feel free to blast me if I did
> something idiotic here.
>
> shared_buffers = 50000
> effective_cache_size = 1348000
> random_page_cost = 3
> work_mem = 512000

Unless you are the only person connecting to this database, your
work_mem is very high. And if you haven't modified maintenance_work_mem
it is probably very low. work_mem might be causing postgres to think it
can fit all of a merge into ram, making it faster, I can't say for sure.

> max_fsm_pages = 80000

This seems high, but it depends how many updates/deletes you get
in-between vacuums. It may not be too excessive. VACUUM [FULL] VERBOSE
replies with how many free pages are left, if you didn't use that
already for tuning. Though it should be tuned based on a steady state
situation. Not a one time test.

> log_min_duration_statement = 60000
> fsync = true ( not sure if I'm daring enough to run without this )
> wal_buffers = 1000
> checkpoint_segments = 64
> checkpoint_timeout = 3000
>

These seem fine to me.

Can you include the output of EXPLAIN SELECT both with and without SET
join_collapselimit? Since your tables have grown, I can't compare the
estimated number of rows, and costs very well.

EXPLAIN without ANALYZE is fine, since I am wondering what the planner
is thinking things cost.

John
=:->

>
> #---- FOR PG_AUTOVACUUM --#
> stats_command_string = true
> stats_row_level = true
>

Attachment

Re: slow joining very large table to smaller ones

From
Dawid Kuroczko
Date:
On 7/15/05, Bruno Wolff III <bruno@wolff.to> wrote:
> On Thu, Jul 14, 2005 at 16:29:58 -0600,
>   Dan Harris <fbsd@drivefaster.net> wrote:
> >
> > Ok, I tried this one.  My ssh keeps getting cut off by a router
> > somewhere between me and the server due to inactivity timeouts, so
> > all I know is that both the select and explain analyze are taking
> > over an hour to run.  Here's the explain select for that one, since
> > that's the best I can get.
>
> Are you using NAT at home? That's probably where the issue is. If you
> have control of that box you can probably increase the timeout to a
> couple of hours.

Some versions of ssh have such a configuration option (in .ssh/config):

Host *
        ServerAliveInterval 600

...it means that ssh will send a "ping" packet to a sshd every 10 minutes
of inactivity.  This way NAT will see activity and won't kill the session.
I'm using OpenSSH_4.1p1 for this...

Oh, and it doesn't have anything to do with TCP keep alive, which is
rather for finding dead connections than keeping connections alive. ;)

   Regards,
       Dawid