Re: Performance tips - Mailing list pgsql-general

From Tom Lane
Subject Re: Performance tips
Date
Msg-id 21703.1010688265@sss.pgh.pa.us
Whole thread Raw
In response to Performance tips  (Andrew Perrin <andrew_perrin@unc.edu>)
Responses Re: Performance tips  (Andrew Perrin <andrew_perrin@unc.edu>)
List pgsql-general
Andrew Perrin <andrew_perrin@unc.edu> writes:
> auth=# EXPLAIN select count(patternid) from patterns where patternid in
> (select
> auth(# o_patternid from letters, pattern_occurrences where
> letters.letterid =
> auth(# pattern_occurrences.o_letterid and letters.datecat in (1,2));
> NOTICE:  QUERY PLAN:

> Aggregate  (cost=10770432787318.88..10770432787318.88 rows=1 width=4)
>   ->  Seq Scan on patterns  (cost=0.00..10770432756138.14 rows=12472297
> width=4)
>         SubPlan
>           ->  Materialize  (cost=863548.43..863548.43 rows=5749731
> width=12)
>                 ->  Hash Join  (cost=1741.00..863548.43 rows=5749731
> width=12)
>                       ->  Seq Scan on pattern_occurrences
> (cost=0.00..250248.56 rows=15287556 width=8)
>                       ->  Hash  (cost=1729.67..1729.67 rows=4530 width=4)
>                             ->  Seq Scan on letters  (cost=0.00..1729.67
> rows=4530 width=4)


Well, it's materializing the subquery result, which is good, but are
there really going to be 5.7M rows in the subquery result?  If so,
no wonder you're hurting: the IN is going to be scanning through that
result for each row from the outer query, until it either finds a match
or reaches the end.  Can you reduce the size of the subquery result at
all?  (If the subquery as written produces a lot of duplicate
o_patternids, then making it be a SELECT DISTINCT might help.)

The long-term answer is probably that you need to convert the IN to some
smarter form of join.  One idea that comes to mind is

select count(patternid)
from patterns,
    (select distinct o_patternid from letters, pattern_occurrences where
     letters.letterid = pattern_occurrences.o_letterid
     and letters.datecat in (1,2)) AS ss
where patternid = ss.o_patternid;

Given the "select distinct" to ensure there are no duplicates in the
subselect output, this should produce the same output as the original,
I think, and it would give the planner a shot at using a merge or hash
join to match up the pattern id values.

Oh, BTW: you might also try kicking up sort_mem if you didn't already.

            regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Performance tips
Next
From: Andrew Perrin
Date:
Subject: Re: Performance tips