Seq Scan used instead of Index Scan - Mailing list pgsql-performance

From Gary Warner
Subject Seq Scan used instead of Index Scan
Date
Msg-id 1930791595.3607.1322087076188.JavaMail.root@zimbra.cis.uab.edu
Whole thread Raw
Responses Re: Seq Scan used instead of Index Scan  (Claudio Freire <klaussfreire@gmail.com>)
Re: Seq Scan used instead of Index Scan  (Mark Kirkwood <mark.kirkwood@catalyst.net.nz>)
Re: Seq Scan used instead of Index Scan  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Very Fast Version:

Recently my database stopped respecting one of my indexes, which took a query that should run in "subsecond response
time"and turning it into something that with small data sets runs in the 7-10 minute range and with large data sets
runsin the 30 minute - eternity range. 

Explain Analyze tells me that what used to be an Index Scan has become a Seq Scan, doing a full table scan through 140
millionrecords. 

Any thoughts on why that happens?

+++++++++++++++++++
Longer version:

I have a very large database (billion+ records) that has begun not respecting indexes on some queries, resulting in
whatused to be "instant answers" now taking many minutes and I'm trying to figure out why. 

I'll try to simplify the request this way:

I have a database of many hundreds of millions of email messages, and information about those messages.  There is a
tablethat tells me when I received emails and from what source.  There is another table that tells what URLs were seen
inthe bodies of those emails and there is a table that links those two tables together.  Since many emails can contain
thesame URL, and many URLs can be seen in each email, we divide like this: 

Tables
===========
emails

link_urls

email_links

email.message_id     = link_url.message_id
linkurl.urlid        = email_links.urlid


One attribute of the URL is the "hostname" portion.

In my puzzle, I have a table of hostnames, and I want to get statistics about which emails contained URLs that pointed
tothose hostnames. 

A very simplified version of the query could be:

select email.stuff from email natural join link_url natural join email_links where hostname = 'foo.bar.com';


We know that there were two emails that advertised foo.bar.com, so the objective is to link the presence of foo.bar.com
toa URLID, use the link_url table to find the respective message_id and then go ask the email table for details about
thosetwo messages. 


My "explain analyze" of that query takes seven minutes to complete, despite the fact that I have an index on
message_id,and urlid, and machine where they occur in each of the three tables. 

We broke the query down into it's components, and confirmed that what we consider the three natural pieces of the query
eachrun "blazing fast", each using the proper indexes.  But when we put them all together, the index gets ignored, and
wesequential scan a 140 million row table. 

Which is slow.


===============================
If we break it into pieces ... a human might think of the pieces as:


============

explain analyze select urlid from email_links where machine = 'foo.bar.com';

                               QUERY PLAN
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email_links  (cost=97.10..9650.58 rows=2457 width=33) (actual time=0.049..0.049 rows=1 loops=1)
   Recheck Cond: (machine = 'foo.bar.com'::text)
   ->  Bitmap Index Scan on hostdex  (cost=0.00..96.49 rows=2457 width=0) (actual time=0.039..0.039 rows=1 loops=1)
         Index Cond: (machine = 'foo.bar.com'::text)
 Total runtime: 0.066 ms
(5 rows)


=============

=> explain analyze select message_id from link_url where urlid = '9de6440fcc089c654806bd4c853c76f1';
  QUERY PLAN
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on link_url  (cost=65.20..5680.14 rows=1437 width=4) (actual time=0.074..0.083 rows=2 loops=1)
   Recheck Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
   ->  Bitmap Index Scan on link_url_hkey  (cost=0.00..64.84 rows=1437 width=0) (actual time=0.021..0.021 rows=2
loops=1)
         Index Cond: (urlid = '9de6440fcc089c654806bd4c853c76f1'::text)
 Total runtime: 0.109 ms
(5 rows)

=================

explain analyze select stuff from email where message_id in (78085350, 78088168);
 QUERY PLAN
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on email  (cost=17.19..25.21 rows=2 width=8) (actual time=0.068..0.077 rows=2 loops=1)
   Recheck Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
   ->  Bitmap Index Scan on email_pkey  (cost=0.00..17.19 rows=2 width=0) (actual time=0.054..0.054 rows=2 loops=1)
         Index Cond: (message_id = ANY ('{78085350,78088168}'::integer[]))
 Total runtime: 0.100 ms
(5 rows)

++++++++++++++++++++++


In my simple mind that should take (.1 + .109 + .066) = FAST.

Here's the "Explain Analyze" for the combined version:

++++++++++++++++++++++++
explain analyze select email.stuff from email natural join link_url natural join email_link where machine =
'foo.bar.com';

 QUERY PLAN
-------------------------------------------------------------------------------------------
 Merge Join  (cost=3949462.38..8811048.82 rows=4122698 width=7) (actual time=771578.076..777749.755 rows=3 loops=1)
   Merge Cond: (email.message_id = link_url.message_id)
   ->  Index Scan using email_pkey on email  (cost=0.00..4561330.19 rows=79154951 width=11) (actual
time=0.041..540883.445rows=79078427 loops=1) 
   ->  Materialize  (cost=3948986.49..4000520.21 rows=4122698 width=4) (actual time=227023.820..227023.823 rows=3
loops=1)
         ->  Sort  (cost=3948986.49..3959293.23 rows=4122698 width=4) (actual time=227023.816..227023.819 rows=3
loops=1)
               Sort Key: link_url.message_id
               Sort Method:  quicksort  Memory: 25kB
               ->  Hash Join  (cost=9681.33..3326899.30 rows=4122698 width=4) (actual time=216443.617..227023.798
rows=3loops=1) 
                     Hash Cond: (link_url.urlid = email_link.urlid)
                     ->  Seq Scan on link_url  (cost=0.00..2574335.33 rows=140331133 width=37) (actual
time=0.013..207980.261rows=140330592 lo 
ops=1)
                     ->  Hash  (cost=9650.62..9650.62 rows=2457 width=33) (actual time=0.074..0.074 rows=1 loops=1)
                           ->  Bitmap Heap Scan on email_link  (cost=97.10..9650.62 rows=2457 width=33) (actual
time=0.072..0.072rows=1 loops=1 
)
                                 Recheck Cond: (hostname = 'foo.bar.com'::text)
                                 ->  Bitmap Index Scan on hostdex  (cost=0.00..96.49 rows=2457 width=0) (actual
time=0.060..0.060rows=1 loops=1 
)
                                       Index Cond: (hostname = 'foo.bar.com'::text)
 Total runtime: 777749.820 ms
(16 rows)

++++++++++++++++++

See that "Seq Scan on link_url"?  We can't figure out why that is there!  We should be scanning for a matching "urlid"
andwe have an index on "urlid"? 

When this is happening in a "two table" version of this problem, we can get temporary relief by giving the statement:

set enable_seqscan = false;

But in the "three table" version (which is itself a simplification of the real problem, which is that instead of
lookingfor 'foo.bar.com', I'm looking for all the hostnames in the table "testurls") the "enable_seqscan = false"
doesn'tseem to do anything. 


++++++++++++++++++

Does anyone have suggestions as to why my database has started ignoring this index in certain circumstances?   We've
trieda variety of different joins and subselects and can't seem to beat the "combined query" index failure. 

The work-around has been to actually write code that makes the query for urlids, and then manually does the query into
thetable for message_ids, and then manually does the query for email.stuff.  It's bad fast, but we can't figure out why
'theold way' quit working. 

Suggestions welcome!




--

----------------------------------------------------------

Gary Warner

-----------------------------------------------------------


pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: SSD endurance calculations
Next
From: Claudio Freire
Date:
Subject: Re: Seq Scan used instead of Index Scan