Thread: Sequential scan on FK join

Sequential scan on FK join

From
Martin Nickel
Date:
All,

I can see why the query below is slow.  The lead table is 34 million rows,
and a sequential scan always takes 3+ minutes.  Mailing_id is the PK for
mailing and is constrained as a foreign key (NULLS allowed) in lead.
There is an index on lead.mailing_id.  I've just run VACUUM ANALYZE on
lead.  I don't understand why it isn't being used.

Thanks for your help,
Martin Nickel

SELECT m.mailcode, l.lead_id
  FROM mailing m
 INNER JOIN lead l ON m.mailing_id = l.mailing_id
 WHERE (m.maildate >= '2005-7-01'::date
         AND m.maildate < '2005-8-01'::date)
-- takes 510,145 ms

EXPLAIN SELECT m.mailcode, l.lead_id
  FROM mailing m
 INNER JOIN lead l ON m.mailing_id = l.mailing_id
 WHERE (m.maildate >= '2005-7-01'::date
         AND m.maildate < '2005-8-01'::date)

Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20)
  Hash Cond: ("outer".mailing_id = "inner".mailing_id)
  ->  Seq Scan on lead l  (cost=0.00..1804198.60 rows=34065260 width=8)
  ->  Hash  (cost=61.22..61.22 rows=362 width=20)
        ->  Index Scan using mailing_maildate_idx on mailing m  (cost=0.00..61.22 rows=362 width=20)
              Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date))


Re: Sequential scan on FK join

From
Richard Huxton
Date:
Martin Nickel wrote:
> EXPLAIN SELECT m.mailcode, l.lead_id
>   FROM mailing m
>  INNER JOIN lead l ON m.mailing_id = l.mailing_id
>  WHERE (m.maildate >= '2005-7-01'::date
>          AND m.maildate < '2005-8-01'::date)
>
> Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20)
>   Hash Cond: ("outer".mailing_id = "inner".mailing_id)
>   ->  Seq Scan on lead l  (cost=0.00..1804198.60 rows=34065260 width=8)
>   ->  Hash  (cost=61.22..61.22 rows=362 width=20)
>         ->  Index Scan using mailing_maildate_idx on mailing m  (cost=0.00..61.22 rows=362 width=20)
>               Index Cond: ((maildate >= '2005-07-01'::date) AND (maildate < '2005-08-01'::date))

Well the reason *why* is that the planner expects 2.71 million rows to
be matched. If that was the case, then a seq-scan of 34 million rows
might well make sense. The output from EXPLAIN ANALYSE would show us
whether that estimate is correct - is it?

--
   Richard Huxton
   Archonet Ltd

Re: Sequential scan on FK join

From
Martin Nickel
Date:
Subject:      Re: Sequential scan on FK join
From:         Martin Nickel <martin@portant.com>
Newsgroups:   pgsql.performance
Date:         Wed, 12 Oct 2005 15:53:35 -0500

Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
expected vs the 2 actual, but I've run ANALYZE on the lead table and it
hasn't changed the plan.  Suggestions?

"Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20) (actual
time=40.659..244709.315 rows=2    125270 loops=1)" "  Hash Cond:
("outer".mailing_id = "inner".mailing_id)" "  ->  Seq Scan on lead l
(cost=0.00..1804198.60 rows=34065260 width=8) (actual
time=8.621..180281.094 rows=34060373 loops=1)" "  ->  Hash
(cost=61.22..61.22 rows=362 width=20) (actual time=28.718..28.718 rows=0
loops=1)" "        ->  Index Scan using mailing_maildate_idx on mailing m
(cost=0.00..61.22 rows=362 width=20) (actual time=16.571..27.793 rows=430
loops=1)" "              Index Cond: ((maildate >= '2005-07-01'::date) AND
(maildate < '2005-08-01'::date))" "Total runtime: 248104.339 ms"



Re: Sequential scan on FK join

From
Richard Huxton
Date:
Martin Nickel wrote:
> Subject:      Re: Sequential scan on FK join
> From:         Martin Nickel <martin@portant.com>
> Newsgroups:   pgsql.performance
> Date:         Wed, 12 Oct 2005 15:53:35 -0500
>
> Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
> expected vs the 2 actual, but I've run ANALYZE on the lead table and it
> hasn't changed the plan.  Suggestions?
>
> Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20)
> (actual time=40.659..244709.315 rows=2 125270 loops=1)
                                        ^^^
Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows
matching which would suggest PG is getting it more right than wrong.

Try issuing "SET enable_seqscan=false" before running the explain
analyse - that will force the planner to use any indexes it can find and
should show us whether the index would help.
--
   Richard Huxton
   Archonet Ltd

Re: Sequential scan on FK join

From
Martin Nickel
Date:
When I turn of seqscan it does use the index - and it runs 20 to 30%
longer.  Based on that, the planner is correctly choosing a sequential
scan - but that's just hard for me to comprehend.  I'm joining on an int4
key, 2048 per index page - I guess that's a lot of reads - then the data
-page reads.  Still, the 8-minute query time seems excessive.

On Mon, 17 Oct 2005 18:45:38 +0100, Richard Huxton wrote:

> Martin Nickel wrote:
>> Subject:      Re: Sequential scan on FK join From:         Martin Nickel
>> <martin@portant.com> Newsgroups:   pgsql.performance
>> Date:         Wed, 12 Oct 2005 15:53:35 -0500
>>
>> Richard, here's the EXPLAIN ANALYZE.  I see your point re: the 2.7M
>> expected vs the 2 actual, but I've run ANALYZE on the lead table and it
>> hasn't changed the plan.  Suggestions?
>>
>> Hash Join  (cost=62.13..2001702.55 rows=2711552 width=20) (actual
>> time=40.659..244709.315 rows=2 125270 loops=1)
>                                         ^^^
> Hmm - is that not just a formatting gap there? Is it not 2,125,270 rows
> matching which would suggest PG is getting it more right than wrong.
>
> Try issuing "SET enable_seqscan=false" before running the explain analyse
> - that will force the planner to use any indexes it can find and should
> show us whether the index would help. --
>    Richard Huxton
>    Archonet Ltd
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Sequential scan on FK join

From
Richard Huxton
Date:
Martin Nickel wrote:
> When I turn of seqscan it does use the index - and it runs 20 to 30%
> longer.  Based on that, the planner is correctly choosing a sequential
> scan - but that's just hard for me to comprehend.  I'm joining on an int4
> key, 2048 per index page - I guess that's a lot of reads - then the data
> -page reads.  Still, the 8-minute query time seems excessive.

You'll be getting (many) fewer than 2048 index entries per page. There's
a page header and various pointers involved too, and index pages aren't
going to be full. So - it needs to search the table on dates, fetch the
id's and then assemble them for the hash join. Of course, if you have
too many to join then all this will spill to disk slowing you further.

Now, you'd rather get down below 8 minutes. There are a number of options:
  1. Make sure your disk i/o is being pushed to its limit
  2. Look into increasing the sort memory for this one query "set
work_mem..." (see the runtime configuration section of the manual)
  3. Actually - are you happy that your general configuration is OK?
  4. Perhaps use a cursor - I'm guessing you want to process these
mailings in some way and only want them one at a time in any case.
  5. Try the query one day at a time and see if the balance tips the
other way - you'll be dealing with substantially less data per query
which might match your system better. Of course, this may not be
practical for your applicaton.
  6. If your lead table is updated only rarely, you could try a CLUSTER
on the table by mailing_id - that should speed the scan. Read the manual
for the cluster command first though.

--
   Richard Huxton
   Archonet Ltd

Re: Sequential scan on FK join

From
Martin Nickel
Date:
On Tue, 18 Oct 2005 08:52:15 +0100, Richard Huxton wrote:

> Martin Nickel wrote:
>> When I turn of seqscan it does use the index - and it runs 20 to 30%
>> longer.  Based on that, the planner is correctly choosing a sequential
>> scan - but that's just hard for me to comprehend.  I'm joining on an
>> int4 key, 2048 per index page - I guess that's a lot of reads - then the
>> data -page reads.  Still, the 8-minute query time seems excessive.
>
> You'll be getting (many) fewer than 2048 index entries per page. There's a
> page header and various pointers involved too, and index pages aren't
> going to be full. So - it needs to search the table on dates, fetch the
> id's and then assemble them for the hash join. Of course, if you have too
> many to join then all this will spill to disk slowing you further.
>
> Now, you'd rather get down below 8 minutes. There are a number of options:
>   1. Make sure your disk i/o is being pushed to its limit
We are completely peaked out on disk io.  iostat frequently shows 60%
iowait time.  This is quite an issue for us and I don't have any
great ideas.  Data is on a 3ware sata raid at raid 10 across 4 disks.  I
can barely even run vacuums on our largest table (lead) since it runs for
a day and a half and kills our online performance while running.

> 2. Look into increasing the sort memory for this one query "set
> work_mem..." (see the runtime configuration section of the manual)
I haven't tried this, and I will.  Thanks for the idea.

>   3. Actually - are you happy that your general configuration is OK?
I'm not at all.  Most of the configuration changes I've tried have made
almost no discernable difference.  I'll post the relevant numbers in a
different post - possibly you'll have some suggestions.

> 4. Perhaps use a cursor - I'm guessing you want to process these
> mailings in some way and only want them one at a time in any case.
Where this first came up was in trying to get aggregate totals per
mailing.  I gave up on that and created a nightly job to create a summary
table since Postgres wasn't up to the job in real time.  Still, I
frequently need to do the join and limit it by other criteria - and it is
incredibly slow - even when the result set is smallish.

>   5. Try the query one day at a time and see if the balance tips the
> other way - you'll be dealing with substantially less data per query
> which might match your system better. Of course, this may not be
> practical for your applicaton.
It is not useful.

>   6. If your lead table is updated only rarely, you could try a CLUSTER
> on the table by mailing_id - that should speed the scan. Read the manual
> for the cluster command first though.
The lead table is one of the most volatle in our system.  Each day we
insert tens or hundreds of thousands of rows, update almost that many, and
delete a few.  It is growing, and could reach 100 million rows in 8 or 9
months.  We're redesigning the data structure a little so lead is not
updated (updates are just too slow), but it will continue to have inserts
and deletes, and we'll have to join it with the associated table being
updated, which already promises to be a slow operation.

We're looking at 15K rpm scsi drives for a replacement raid array.  We are
getting the place where it may be cheaper to convert to Oracle or DB2 than
to try and make Posgres work.


Re: Sequential scan on FK join

From
Martin Nickel
Date:
On Tue, 18 Oct 2005 08:52:15 +0100, Richard Huxton wrote:
>   3. Actually - are you happy that your general configuration is OK?
We're running dual Opteron 244s with 4G of memory.  The platform is
Suse 9.3, 64 bit.  The database is on a 3ware 9500S-8 sata raid controller
configured raid 10 with 4 drives plus a hot swap.  Drives are
7400 rpm (don't remember model or size).

I'm running Postgres 8.0.3.  Here are some of the relevant conf file
parameters:
shared_buffers = 50000
sort_mem = 8192
work_mem = 256000
vacuum_mem = 32768
max_fsm_pages = 40000
max_fsm_relations = 1000

I realize shared_buffers is too high.  Not sure on the others.  Thanks for
any help you can suggest.  I've moved most of these around some and
restarted without any clear changes for the better or worse (just
seat-of-the-pants feel - I haven't tried to benchmark the result of
changes at all).

Thanks,
Martin



Re: Sequential scan on FK join

From
Alvaro Herrera
Date:
Martin Nickel wrote:
> On Tue, 18 Oct 2005 08:52:15 +0100, Richard Huxton wrote:
> >   3. Actually - are you happy that your general configuration is OK?
> We're running dual Opteron 244s with 4G of memory.  The platform is
> Suse 9.3, 64 bit.  The database is on a 3ware 9500S-8 sata raid controller
> configured raid 10 with 4 drives plus a hot swap.  Drives are
> 7400 rpm (don't remember model or size).
>
> I'm running Postgres 8.0.3.  Here are some of the relevant conf file
> parameters:
> shared_buffers = 50000
> sort_mem = 8192
> work_mem = 256000

Interesting that you set both sort_mem and work_mem.  Do you realize
that the former is an obsolete name, and currently a synonym for the
latter?  Maybe the problem is that you are using too much memory for
sorts, forcing swap usage, etc.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas    / desprovistas, por cierto
 de blandos atenuantes"                          (Patricio Vogel)