Performance With Joins on Large Tables - Mailing list pgsql-performance

From Joshua Marsh
Subject Performance With Joins on Large Tables
Date
Msg-id 38242de90609121517y62517040v8622b9ae93cba23d@mail.gmail.com
Whole thread Raw
Responses Re: Performance With Joins on Large Tables
List pgsql-performance
I am having problems performing a join on two large tables.  It seems to only
want to use a sequential scan on the join, but that method seems to be slower
than an index scan.  I've never actually had it complete the sequential scan
because I stop it after 24+ hours.  I've run joins against large tables before
and an index scan was always faster (a few hours at the most).

Here is some information on the two tables:
data=# analyze view_505;
ANALYZE
data=# analyze r3s169;
ANALYZE
data=# \d view_505
               Table "public.view_505"
      Column      |         Type          | Modifiers
------------------+-----------------------+-----------
dsiacctno        | numeric               |
name             | boolean               |
title            | boolean               |
company          | boolean               |
zip4             | boolean               |
acceptcall       | boolean               |
phonedirect      | smallint              |
phonetollfree    | smallint              |
fax              | smallint              |
editdrop         | boolean               |
postsuppress     | boolean               |
firstnameinit    | boolean               |
prefix           | integer               |
crrt             | boolean               |
dpbc             | boolean               |
executive        | integer               |
addressline      | integer               |
multibuyer       | integer               |
activemultibuyer | integer               |
active           | boolean               |
emails           | integer               |
domains          | integer               |
zip1             | character varying(1)  |
zip3             | character varying(3)  |
gender           | character varying(1)  |
topdomains       | bit varying           |
city             | character varying(35) |
state            | character varying(35) |
zip              | character varying(20) |
country          | character varying(30) |
selects          | bit varying           |
files            | integer[]             |
sics             | integer[]             |
custdate         | date                  |
Indexes:
    "view_505_city" btree (city)
    "view_505_dsiacctno" btree (dsiacctno)
    "view_505_state" btree (state)
    "view_505_zip" btree (zip)
    "view_505_zip1" btree (zip1)
    "view_505_zip3" btree (zip3)

data=# \d r3s169
              Table "public.r3s169"
   Column    |          Type          | Modifiers
-------------+------------------------+-----------
dsiacctno   | numeric                |
fileid      | integer                |
customerid  | character varying(20)  |
email       | character varying(100) |
sic2        | character varying(2)   |
sic4        | character varying(4)   |
sic6        | character varying(6)   |
custdate    | date                   |
inqdate     | date                   |
eentrydate  | date                   |
esubdate    | date                   |
efaildate   | date                   |
eunlistdate | date                   |
pentrydate  | date                   |
psubdate    | date                   |
punlistdate | date                   |
pexpiredate | date                   |
lastupdate  | date                   |
emaildrop   | numeric                |
sic8        | character varying(8)   |
Indexes:
    "r3s169_dsiacctno" btree (dsiacctno)

data=# select count(*) from view_505;
   count
-----------
112393845
(1 row)

data=# select count(*) from r3s169;
   count
-----------
285230264
(1 row)


Here is what EXPLAIN says:

data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
FROM s.custdate) FROM view_505 v INNER JOIN r3s169 s ON v.dsiacctno =
s.dsiacctno;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
Merge Join  (cost=293767607.69..305744319.52 rows=285392608 width=11)
   Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
   ->  Sort  (cost=127304933.87..127585815.71 rows=112352736 width=20)
         Sort Key: v.dsiacctno
         ->  Seq Scan on view_505 v  (cost=100000000.00..104604059.36
rows=112352736 width=20)
   ->  Sort  (cost=166462673.82..167176155.34 rows=285392608 width=17)
         Sort Key: s.dsiacctno
         ->  Seq Scan on r3s169 s  (cost=100000000.00..106875334.08
rows=285392608 width=17)
(8 rows)



I can't really do and EXPLAIN ANALYZE because the query never really finishes.
Also, I use a cursor to loop through the data.  view_505 isn't a pgsql view, its
just how we decided to name the table.  There is a one to many
relationship between
view_505 and r3s169.

Since enable_seqscan is off, my understanding is that in order for the query
planner to user a sequential scan it must think there is no other alternative.
Both sides are indexed and anaylzed, so that confuses me a little.

I tried it on a smaller sample set of the data and it works fine:

data=# select * into r3s169_test from r3s169 limit 1000000;
SELECT
data=# select * into view_505_test from view_505 limit 1000000;
SELECT
data=# create index r3s169_test_dsiacctno on r3s169_test (dsiacctno);
CREATE INDEX
data=# create index view_505_test_dsiacctno on view_505_test (dsiacctno);
CREATE INDEX
data=# analyze r3s169_test;
ANALYZE
data=# analyze view_505_test;
ANALYZE
data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
FROM s.custdate) FROM view_505_test v INNER JOIN r3s169_test s ON
v.dsiacctno = s.dsiacctno;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Merge Join  (cost=0.00..1976704.69 rows=1000187 width=11)
   Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
   ->  Index Scan using view_505_test_dsiacctno on view_505_test v
(cost=0.00..1676260.67 rows=999985 width=20)
   ->  Index Scan using r3s169_test_dsiacctno on r3s169_test s
(cost=0.00..1089028.66 rows=1000186 width=17)
(4 rows)


Is there anything I'm missing that is preventing it from using the index?  It
just seems weird to me that other joins like this work fine and fast
with indexes,
but this one won't.

pgsql-performance by date:

Previous
From: Laszlo Nagy
Date:
Subject: Re: Poor performance on seq scan
Next
From: "Luke Lonergan"
Date:
Subject: Re: Poor performance on seq scan