Thread: indexes across joins not used for count

indexes across joins not used for count

From
Jeremy Wells
Date:
I'm running a query to do a count with two joins in it. I've added
indexes to the tables for the join columns, but the explain of the query
doesn't seem to be using the indexes:

Table 1:
   invites (id:int)

Table 2:
   sms_requests (id:int, invoker_id:int, invoker_type:string,
sms_message_id:int)
   Indexes:
     "sms_requests_pkey" PRIMARY KEY, btree (id)
     "index_sms_requests_on_invoker_id_and_invoker_type" btree
(invoker_id, invoker_type)
     "index_sms_requests_on_sms_message_id" btree (sms_message_id)

Table 3:
   sms_messages (id:int, sent_at:timestamp)
   Indexes:
     "sms_messages_pkey" PRIMARY KEY, btree (id)
     "index_sms_messages_on_sent_at_partial" btree (sent_at) WHERE
sent_at IS NULL
     "index_sms_messages_on_sent_at" btree (sent_at)

Query:

SELECT COUNT(*) FROM "invites" INNER JOIN "sms_requests" ON
"sms_requests"."invoker_id" = "invites"."id" AND
"sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON
"sms_messages"."id" = "sms_requests"."sms_message_id" WHERE
"sms_messages"."sent_at" IS NOT NULL

Explain:

  Aggregate  (cost=165914.42..165914.43 rows=1 width=0)
    ->  Hash Join  (cost=92326.82..163534.87 rows=951821 width=0)
          Hash Cond: (sms_requests.sms_message_id = sms_messages.id)
          ->  Hash Join  (cost=32692.53..83674.38 rows=951821 width=4)
                Hash Cond: (invites.id = sms_requests.invoker_id)
                ->  Seq Scan on invites  (cost=0.00..27525.48
rows=1238948 width=4)
                ->  Hash  (cost=20794.76..20794.76 rows=951821 width=8)
                      ->  Seq Scan on sms_requests  (cost=0.00..20794.76
rows=951821 width=8)
                            Filter: ((invoker_type)::text = 'Invite'::text)
          ->  Hash  (cost=48180.24..48180.24 rows=916324 width=4)
                ->  Seq Scan on sms_messages  (cost=0.00..48180.24
rows=916324 width=4)
                      Filter: (sent_at IS NOT NULL)

This is pretty slow, ~5000ms on my development machine. I would have
expected it to be able to make use of the indexes I've created. Any
ideas on what I can do to make this perform better?

Jeremy



Re: indexes across joins not used for count

From
Jeff Davis
Date:
On Tue, 2012-10-16 at 16:08 +1300, Jeremy Wells wrote:
> I'm running a query to do a count with two joins in it. I've added
> indexes to the tables for the join columns, but the explain of the query
> doesn't seem to be using the indexes:

Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
tables already?

You can often force an index scan by doing:

   SET enable_seqscan=false;

So also try setting that, and run EXPLAIN ANALYZE again, and see if it
uses the indexes, and if so, if it's faster.

Regards,
    Jeff Davis



Re: indexes across joins not used for count

From
Jeremy Wells
Date:
Jeff Davis wrote:
Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
tables already?
EXPLAIN ANALYZE SELECT COUNT(*) FROM "invites" INNER JOIN "sms_requests" ON "sms_requests"."invoker_id" = "invites"."id" AND "sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON "sms_messages"."id" = "sms_requests"."sms_message_id" WHERE "sms_messages"."sent_at" IS NOT NULL;

                                                                   QUERY PLAN                                                                   
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=129556.14..129556.14 rows=1 width=0) (actual time=21530.146..21530.147 rows=1 loops=1)
   ->  Hash Join  (cost=81261.64..128882.20 rows=1347872 width=0) (actual time=6961.957..20437.245 rows=1340639 loops=1)
         Hash Cond: (sms_requests.sms_message_id = sms_messages.id)
         ->  Hash Join  (cost=22033.10..62240.37 rows=1347872 width=4) (actual time=3618.615..11410.322 rows=1340639 loops=1)
               Hash Cond: (invites.id = sms_requests.invoker_id)
               ->  Seq Scan on invites  (cost=0.00..33216.78 rows=1683927 width=4) (actual time=0.009..1983.440 rows=1683927 loops=1)
               ->  Hash  (cost=17315.55..17315.55 rows=1347872 width=8) (actual time=3617.847..3617.847 rows=1347872 loops=1)
                     Buckets: 262144  Batches: 1  Memory Usage: 52652kB
                     ->  Seq Scan on sms_requests  (cost=0.00..17315.55 rows=1347872 width=8) (actual time=0.027..1651.786 rows=1347872 loops=1)
                           Filter: ((invoker_type)::text = 'Invite'::text)
         ->  Hash  (cost=55408.40..55408.40 rows=1091467 width=4) (actual time=3342.029..3342.029 rows=1107628 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 38941kB
               ->  Seq Scan on sms_messages  (cost=0.00..55408.40 rows=1091467 width=4) (actual time=0.009..1737.765 rows=1107628 loops=1)
                     Filter: (sent_at IS NOT NULL)
 Total runtime: 21530.247 ms


You can often force an index scan by doing:
  SET enable_seqscan=false;

So also try setting that, and run EXPLAIN ANALYZE again, and see if it
uses the indexes, and if so, if it's faster.
=> SET enable_seqscan=false;
SET
=> EXPLAIN ANALYZE SELECT COUNT(*) FROM "invites" INNER JOIN "sms_requests" ON "sms_requests"."invoker_id" = "invites"."id" AND "sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON "sms_messages"."id" = "sms_requests"."sms_message_id" WHERE "sms_messages"."sent_at" IS NOT NULL;

                                                   QUERY PLAN                                                                                      

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=198896.67..198896.67 rows=1 width=0) (actual time=21923.378..21923.378 rows=1 loops=1)
   ->  Hash Join  (cost=90165.47..198222.73 rows=1347872 width=0) (actual time=5465.116..20552.881 rows=1340639 loops=1)
         Hash Cond: (sms_requests.invoker_id = invites.id)
         ->  Merge Join  (cost=7.29..101325.20 rows=1347872 width=4) (actual time=0.035..9630.580 rows=1347872 loops=1)
               Merge Cond: (sms_messages.id = sms_requests.sms_message_id)
               ->  Index Scan using sms_messages_pkey on sms_messages  (cost=0.00..69875.42 rows=1091467 width=4) (actual time=0.011..3009.455 rows=1107628 loops=1)
                     Filter: (sent_at IS NOT NULL)
               ->  Index Scan using index_sms_requests_on_sms_message_id on sms_requests  (cost=0.00..26190.25 rows=1347872 width=8) (actual time=0.014..2257.271 rows=1347872 loops=1)
                     Filter: ((invoker_type)::text = 'Invite'::text)
         ->  Hash  (cost=84264.43..84264.43 rows=1683927 width=4) (actual time=5462.709..5462.709 rows=1683927 loops=1)
               Buckets: 262144  Batches: 1  Memory Usage: 59201kB
               ->  Index Scan using invites_pkey on invites  (cost=0.00..84264.43 rows=1683927 width=4) (actual time=0.159..2763.626 rows=1683927 loops=1)
 Total runtime: 21923.504 ms




Re: indexes across joins not used for count

From
Jeff Davis
Date:
On Sun, 2012-10-21 at 08:37 +1300, Jeremy Wells wrote:
> Jeff Davis wrote:
> > Can you post the output of EXPLAIN ANALYZE? Did you do an ANALYZE of the
> > tables already?
> EXPLAIN ANALYZE SELECT COUNT(*) FROM "invites" INNER JOIN
> "sms_requests" ON "sms_requests"."invoker_id" = "invites"."id" AND
> "sms_requests"."invoker_type" = 'Invite' INNER JOIN "sms_messages" ON
> "sms_messages"."id" = "sms_requests"."sms_message_id" WHERE
> "sms_messages"."sent_at" IS NOT NULL;
>
Interesting... it looks like the two plans take about the same amount of
time, so the planner is not making a mistake.

I tried making three tables about the same size as yours, and then doing
a three-way join, and on my machine it took closer to a second. So
there's clearly something else going on with your data. Maybe the data
distribution is skewed (some values having many matches in another
table, others having none)?

Regards,
    Jeff Davis



Re: indexes across joins not used for count

From
Jeremy Wells
Date:
Jeff Davis wrote:
>
> I tried making three tables about the same size as yours, and then doing
> a three-way join, and on my machine it took closer to a second. So
> there's clearly something else going on with your data. Maybe the data
> distribution is skewed (some values having many matches in another
> table, others having none)?
>
Every row in sms_requests will match a row in sms_messages (but maybe
the same row as sms_requests n<-->1 sms_messages). Each row in invites
will have either 0 or 1 row in sms_requests.

Given that the number of sms_messages with sent_at IS NULL will remain
pretty constant will the query continue to run at about the same speed
or will it slow significantly as each table grows in size? Although the
time of the count is slow it's not giving me problems yet. Can I speed
it up? Would it make sense to denormalize the sent_at time into a
boolean on the sms_requests table?

Thanks for your help so far,
Jeremy


Re: indexes across joins not used for count

From
Tom Lane
Date:
Jeremy Wells <jemmyw@gmail.com> writes:
> Given that the number of sms_messages with sent_at IS NULL will remain
> pretty constant will the query continue to run at about the same speed
> or will it slow significantly as each table grows in size?

At some point the IS NULL condition will become selective enough that
it's worth using an index on sent_at to pull out just those rows.
The planner evidently thinks you are not there yet.  (Or, perhaps, you
are using a version too old to be able to use an index for IS NULL?
You didn't mention your PG version, which is a fundamental mistake in
any performance-related question.)

My overall take on this is that it's folly to imagine that indexes are a
magic bullet for full-table joins: if the query has to read most/all of
the table anyway, an index doesn't help much.  The plans you've shown
look perfectly sane from here.

If you were using 9.2, and the tables were relatively static, index-only
scans might make sense.  But they're no magic bullet either.

            regards, tom lane


Re: indexes across joins not used for count

From
Jeremy Wells
Date:
Tom Lane wrote:
> (Or, perhaps, you
> are using a version too old to be able to use an index for IS NULL?
> You didn't mention your PG version, which is a fundamental mistake in
> any performance-related question.)
I'm using 9.1