Re: Performance tips - Mailing list pgsql-general

From Andrew Perrin
Subject Re: Performance tips
Date
Msg-id Pine.LNX.4.21L1.0201101439510.404-100000@hm269-26876.socsci.unc.edu
Whole thread Raw
In response to Re: Performance tips  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
There probably were 5.7m in the subquery result - the DISTINCT helped
that, but not enough.

I used your (Tom Lane's) rewritten query, though, which was EXPLAINed as
hugely less expensive:

auth=# explain select count(patternid)
auth-# from patterns,
auth-#     (select distinct o_patternid from letters, pattern_occurrences
where
auth(#      letters.letterid = pattern_occurrences.o_letterid
auth(#      and letters.datecat in (1,2)) AS ss
auth-# where patternid = ss.o_patternid;
NOTICE:  QUERY PLAN:

Aggregate  (cost=4486533.96..4486533.96 rows=1 width=8)
  ->  Merge Join  (cost=4322005.65..4485096.53 rows=574973 width=8)
        ->  Sort  (cost=2362674.85..2362674.85 rows=12472297 width=4)
              ->  Seq Scan on patterns  (cost=0.00..259225.97
rows=12472297 width=4)
        ->  Sort  (cost=1962135.80..1962135.80 rows=574973 width=12)
              ->  Subquery Scan ss  (cost=1882121.28..1896495.61
rows=574973 width=12)
                    ->  Unique  (cost=1882121.28..1896495.61 rows=574973
width=12)
                          ->  Sort  (cost=1882121.28..1882121.28
rows=5749731 width=12)
                                ->  Hash Join  (cost=1741.00..863548.43
rows=5749731 width=12)
                                      ->  Seq Scan on pattern_occurrences
(cost=0.00..250248.56 rows=15287556 width=8)
                                      ->  Hash  (cost=1729.67..1729.67
rows=4530 width=4)
                                            ->  Seq Scan on letters
(cost=0.00..1729.67 rows=4530 width=4)


and it finished in about 12 minutes - far more manageable.

Thanks, everyone, for your help - it's much appreciated! I'll keep y'all
abreast of the project's development.

Andy Perrin

On Thu, 10 Jan 2002, Tom Lane wrote:

> Andrew Perrin <andrew_perrin@unc.edu> writes:
> > auth=# EXPLAIN select count(patternid) from patterns where patternid in
> > (select
> > auth(# o_patternid from letters, pattern_occurrences where
> > letters.letterid =
> > auth(# pattern_occurrences.o_letterid and letters.datecat in (1,2));
> > NOTICE:  QUERY PLAN:
>
> > Aggregate  (cost=10770432787318.88..10770432787318.88 rows=1 width=4)
> >   ->  Seq Scan on patterns  (cost=0.00..10770432756138.14 rows=12472297
> > width=4)
> >         SubPlan
> >           ->  Materialize  (cost=863548.43..863548.43 rows=5749731
> > width=12)
> >                 ->  Hash Join  (cost=1741.00..863548.43 rows=5749731
> > width=12)
> >                       ->  Seq Scan on pattern_occurrences
> > (cost=0.00..250248.56 rows=15287556 width=8)
> >                       ->  Hash  (cost=1729.67..1729.67 rows=4530 width=4)
> >                             ->  Seq Scan on letters  (cost=0.00..1729.67
> > rows=4530 width=4)
>
>
> Well, it's materializing the subquery result, which is good, but are
> there really going to be 5.7M rows in the subquery result?  If so,
> no wonder you're hurting: the IN is going to be scanning through that
> result for each row from the outer query, until it either finds a match
> or reaches the end.  Can you reduce the size of the subquery result at
> all?  (If the subquery as written produces a lot of duplicate
> o_patternids, then making it be a SELECT DISTINCT might help.)
>
> The long-term answer is probably that you need to convert the IN to some
> smarter form of join.  One idea that comes to mind is
>
> select count(patternid)
> from patterns,
>     (select distinct o_patternid from letters, pattern_occurrences where
>      letters.letterid = pattern_occurrences.o_letterid
>      and letters.datecat in (1,2)) AS ss
> where patternid = ss.o_patternid;
>
> Given the "select distinct" to ensure there are no duplicates in the
> subselect output, this should produce the same output as the original,
> I think, and it would give the planner a shot at using a merge or hash
> join to match up the pattern id values.
>
> Oh, BTW: you might also try kicking up sort_mem if you didn't already.
>
>             regards, tom lane
>


----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin@unc.edu - http://www.unc.edu/~aperrin
 Assistant Professor of Sociology, U of North Carolina, Chapel Hill
      269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA





pgsql-general by date:

Previous
From: "Jelan"
Date:
Subject: Re: URGENT: pg_dump & Postgres 7.2b4
Next
From: Jason Earl
Date:
Subject: Re: index and seq scan