Re: slow joining very large table to smaller ones - Mailing list pgsql-performance

From Dan Harris
Subject Re: slow joining very large table to smaller ones
Date
Msg-id A8A7C7A0-7DC3-449C-A081-C1557DB9E16A@drivefaster.net
Whole thread Raw
In response to Re: slow joining very large table to smaller ones  (John A Meinel <john@arbash-meinel.com>)
List pgsql-performance
On Jul 14, 2005, at 5:12 PM, John A Meinel wrote:

> Dan Harris wrote:
>
>
>>>
>>> Well, postgres is estimating around 500 rows each, is that way
>>> off?  Try
>>> just doing:
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;
>>>
>
> Once again, do this and post the results. We might just need to tweak
> your settings so that it estimates the number of rows correctly,
> and we
> don't need to do anything else.
>

Ok, sorry I missed these the first time through:

explain analyze select incidentid from k_b where id = 107;
                                                        QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------
Index Scan using k_b_idx on k_b  (cost=0.00..1926.03 rows=675
width=14) (actual time=0.042..298.394 rows=2493 loops=1)
    Index Cond: (id = 107)
Total runtime: 299.103 ms

select count(*) from k_b;
count
--------
698350

( sorry! I think I said this one only had tens of thousands in it )


explain analyze select incidentid from k_r where id =
94;                                                       QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------
Index Scan using k_r_idx on k_r  (cost=0.00..2137.61 rows=757
width=14) (actual time=0.092..212.187 rows=10893 loops=1)
    Index Cond: (id = 94)
Total runtime: 216.498 ms
(3 rows)


select count(*) from k_r;
count
--------
671670


That one is quite a bit slower, yet it's the same table structure and
same index as k_b, also it has fewer records.

I did run VACUUM ANALYZE immediately before running these queries.
It seems a lot better with the join_collapse set.

>
> \
> Well, the planner is powerful enough to flatten nested selects. To
> make
> it less "intelligent" you can do:
> SET join_collapse_limit 1;
> or
> SET join_collapse_limit 0;
> Which should tell postgres to not try and get tricky with your query.
> Again, *usually* the planner knows better than you do. So again
> just do
> it to see what you get.
>

Ok, when join_collapse_limit = 1 I get this now:

explain analyze select recordtext from eventactivity join ( select
incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and
k_b.id = 107 ) a  using (incidentid );

QUERY PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------------
Nested Loop  (cost=0.00..156509.08 rows=2948 width=35) (actual
time=1.555..340.625 rows=24825 loops=1)
    ->  Nested Loop  (cost=0.00..5361.89 rows=6 width=28) (actual
time=1.234..142.078 rows=366 loops=1)
          ->  Index Scan using k_b_idx on k_b  (cost=0.00..1943.09
rows=681 width=14) (actual time=0.423..56.974 rows=2521 loops=1)
                Index Cond: (id = 107)
          ->  Index Scan using k_r_idx on k_r  (cost=0.00..5.01
rows=1 width=14) (actual time=0.031..0.031 rows=0 loops=2521)
                Index Cond: ((k_r.id = 94) AND
((k_r.incidentid)::text = ("outer".incidentid)::text))
    ->  Index Scan using eventactivity1 on eventactivity
(cost=0.00..25079.55 rows=8932 width=49) (actual time=0.107..0.481
rows=68 loops=366)
          Index Cond: ((eventactivity.incidentid)::text =
("outer".incidentid)::text)
Total runtime: 347.975 ms

MUCH better!  Maybe you can help me understand what I did and if I
need to make something permanent to get this behavior from now on?



>
>
>
> If you have analyzed recently can you do:
> SELECT relname, reltuples FROM pg_class WHERE relname='eventactivity';
>
> It is a cheaper form than "SELECT count(*) FROM eventactivity" to
> get an
> approximate estimate of the number of rows. But if it isn't too
> expensive, please also give the value from SELECT count(*) FROM
> eventactivity.
>
> Again, that helps us know if your tables are up-to-date.
>

Sure:

select relname, reltuples from pg_class where relname='eventactivity';
     relname    |  reltuples
---------------+-------------
eventactivity | 3.16882e+07

select count(*) from eventactivity;
   count
----------
31871142





>
>
>>
>>
>>
>>> I don't know how selective your keys are, but one of these queries
>>> should probably structure it better for the planner. It depends
>>> a  lot on
>>> how selective your query is.
>>>
>>
>>
>> eventactivity currently has around 36 million rows in it. There
>> should
>> only be maybe 200-300 incidentids at most that will be matched
>> with the
>> combination of k_b and k_r.  That's why I was thinking I  could
>> somehow
>> get a list of just the incidentids that matched the id  = 94 and id =
>> 107 in k_b and k_r first. Then, I would only need to  grab a few
>> hundred
>> out of 36 million rows from eventactivity.
>>
>>
>
> Well, you can also try:
> SELECT count(*) FROM k_b JOIN k_r USING (incidentid)
>  WHERE k_b.id=?? AND k_r.id=??
> ;
>
> That will tell you how many rows they have in common.

select count(*) from k_b join k_r using (incidentid) where k_b.id=107
and k_r.id=94;
count
-------
    373



>
> Well, if you look at the latest plans, things have gone up from 44M to
> 156M, I don't know why it is worse, but it is getting there.

I assume this is because r_k and r_b are growing fairly rapidly right
now.  The time in between queries contained a lot of inserts.  I was
careful to vacuum analyze before sending statistics, as I did this
time.  I'm sorry if this has confused the issue.




pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: lots of updates on small table
Next
From: Alvaro Herrera
Date:
Subject: Re: lots of updates on small table