Re: Performance tips - Mailing list pgsql-general
From | Andrew Perrin |
---|---|
Subject | Re: Performance tips |
Date | |
Msg-id | Pine.LNX.4.21L1.0201101439510.404-100000@hm269-26876.socsci.unc.edu Whole thread Raw |
In response to | Re: Performance tips (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
There probably were 5.7m in the subquery result - the DISTINCT helped that, but not enough. I used your (Tom Lane's) rewritten query, though, which was EXPLAINed as hugely less expensive: auth=# explain select count(patternid) auth-# from patterns, auth-# (select distinct o_patternid from letters, pattern_occurrences where auth(# letters.letterid = pattern_occurrences.o_letterid auth(# and letters.datecat in (1,2)) AS ss auth-# where patternid = ss.o_patternid; NOTICE: QUERY PLAN: Aggregate (cost=4486533.96..4486533.96 rows=1 width=8) -> Merge Join (cost=4322005.65..4485096.53 rows=574973 width=8) -> Sort (cost=2362674.85..2362674.85 rows=12472297 width=4) -> Seq Scan on patterns (cost=0.00..259225.97 rows=12472297 width=4) -> Sort (cost=1962135.80..1962135.80 rows=574973 width=12) -> Subquery Scan ss (cost=1882121.28..1896495.61 rows=574973 width=12) -> Unique (cost=1882121.28..1896495.61 rows=574973 width=12) -> Sort (cost=1882121.28..1882121.28 rows=5749731 width=12) -> Hash Join (cost=1741.00..863548.43 rows=5749731 width=12) -> Seq Scan on pattern_occurrences (cost=0.00..250248.56 rows=15287556 width=8) -> Hash (cost=1729.67..1729.67 rows=4530 width=4) -> Seq Scan on letters (cost=0.00..1729.67 rows=4530 width=4) and it finished in about 12 minutes - far more manageable. Thanks, everyone, for your help - it's much appreciated! I'll keep y'all abreast of the project's development. Andy Perrin On Thu, 10 Jan 2002, Tom Lane wrote: > Andrew Perrin <andrew_perrin@unc.edu> writes: > > auth=# EXPLAIN select count(patternid) from patterns where patternid in > > (select > > auth(# o_patternid from letters, pattern_occurrences where > > letters.letterid = > > auth(# pattern_occurrences.o_letterid and letters.datecat in (1,2)); > > NOTICE: QUERY PLAN: > > > Aggregate (cost=10770432787318.88..10770432787318.88 rows=1 width=4) > > -> Seq Scan on patterns (cost=0.00..10770432756138.14 rows=12472297 > > width=4) > > SubPlan > > -> Materialize (cost=863548.43..863548.43 rows=5749731 > > width=12) > > -> Hash Join (cost=1741.00..863548.43 rows=5749731 > > width=12) > > -> Seq Scan on pattern_occurrences > > (cost=0.00..250248.56 rows=15287556 width=8) > > -> Hash (cost=1729.67..1729.67 rows=4530 width=4) > > -> Seq Scan on letters (cost=0.00..1729.67 > > rows=4530 width=4) > > > Well, it's materializing the subquery result, which is good, but are > there really going to be 5.7M rows in the subquery result? If so, > no wonder you're hurting: the IN is going to be scanning through that > result for each row from the outer query, until it either finds a match > or reaches the end. Can you reduce the size of the subquery result at > all? (If the subquery as written produces a lot of duplicate > o_patternids, then making it be a SELECT DISTINCT might help.) > > The long-term answer is probably that you need to convert the IN to some > smarter form of join. One idea that comes to mind is > > select count(patternid) > from patterns, > (select distinct o_patternid from letters, pattern_occurrences where > letters.letterid = pattern_occurrences.o_letterid > and letters.datecat in (1,2)) AS ss > where patternid = ss.o_patternid; > > Given the "select distinct" to ensure there are no duplicates in the > subselect output, this should produce the same output as the original, > I think, and it would give the planner a shot at using a merge or hash > join to match up the pattern id values. > > Oh, BTW: you might also try kicking up sort_mem if you didn't already. > > regards, tom lane > ---------------------------------------------------------------------- Andrew J Perrin - andrew_perrin@unc.edu - http://www.unc.edu/~aperrin Assistant Professor of Sociology, U of North Carolina, Chapel Hill 269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA
pgsql-general by date: