Thread: Optimising a query requiring seqscans=0
Hi, We recently upgraded our trac backend from sqlite to postgres, and I decided to have a little fun and write some reports that delve into trac's subversion cache, and got stuck with a query optimisation problem. Table revision contains 2800+ rows Table node_change contains 370000+. rev is a 'TEXT' field on both containing numerical data (not my choice, trac does it like this to support VCS backends with non-numerical revision identifiers). I've got stuck with this query: SELECT author, COUNT(DISTINCT r.rev) FROM revision AS r LEFT JOIN node_change AS nc ON r.rev=nc.rev WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30 days'))::integer GROUP BY r.author; Currently it shows the number of commits per author in the last 30 days. The join to node_change is superfluous for this purpose but was added because I intended to add new columns which draw on this table, such as the number of files added, deleted etc. I never got that far however due to the planner problem: GroupAggregate (cost=56755.41..56852.28 rows=2 width=17) (actual time=4836.433..4897.458 rows=25 loops=1) -> Sort (cost=56755.41..56787.69 rows=12913 width=17) (actual time=4836.233..4851.968 rows=22419 loops=1) Sort Key: r.author -> Merge Left Join (cost=53886.10..55873.68 rows=12913 width=17) (actual time=4600.733..4641.749 rows=22419 loops=1) Merge Cond: ("outer".rev = "inner".rev) -> Sort (cost=93.78..96.24 rows=982 width=17) (actual time=7.050..7.383 rows=1088 loops=1) Sort Key: r.rev -> Index Scan using revision_time_idx on revision r (cost=0.01..44.98 rows=982 width=17) (actual time=0.191..4.014 rows=1088 loops=1) Index Cond: ("time" >= (date_part('epoch'::text, (now() - '30 days'::interval)))::integer) -> Sort (cost=53792.32..54719.09 rows=370707 width=8) (actual time=4203.665..4443.748 rows=346238 loops=1) Sort Key: nc.rev -> Seq Scan on node_change nc (cost=0.00..12852.07 rows=370707 width=8) (actual time=0.054..663.719 rows=370707 loops=1) Total runtime: 4911.430 ms If I disable sequential scans I get the following explain: GroupAggregate (cost=221145.13..221242.01 rows=2 width=17) (actual time=286.348..348.268 rows=25 loops=1) -> Sort (cost=221145.13..221177.42 rows=12913 width=17) (actual time=286.183..302.239 rows=22419 loops=1) Sort Key: r.author -> Nested Loop Left Join (cost=0.01..220263.40 rows=12913 width=17) (actual time=0.339..86.626 rows=22419 loops=1) -> Index Scan using revision_time_idx on revision r (cost=0.01..44.98 rows=982 width=17) (actual time=0.194..4.056 rows=1088 loops=1) Index Cond: ("time" >= (date_part('epoch'::text, (now() - '30 days'::interval)))::integer) -> Index Scan using node_change_rev_idx on node_change nc (cost=0.00..223.18 rows=86 width=8) (actual time=0.009..0.058 rows=21 loops=1088) Index Cond: ("outer".rev = nc.rev) Total runtime: 350.103 ms Statistics are set to 20, and I have ANALYZEd both tables. The report itself isn't important, but I'm using this as an exercise in PostgreSQL query optimisation and planner tuning, so any help/hints would be appreciated. Thanks. -- Russ
On Sep 14, 2006, at 11:15 AM, Russ Brown wrote: > We recently upgraded our trac backend from sqlite to postgres, and I > decided to have a little fun and write some reports that delve into > trac's subversion cache, and got stuck with a query optimisation > problem. > > Table revision contains 2800+ rows > Table node_change contains 370000+. <...> > I've got stuck with this query: > > SELECT author, COUNT(DISTINCT r.rev) > FROM revision AS r > LEFT JOIN node_change AS nc > ON r.rev=nc.rev > WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30 > days'))::integer Man I really hate when people store time_t in a database... > GROUP BY r.author; > > Statistics are set to 20, and I have ANALYZEd both tables. > > The report itself isn't important, but I'm using this as an > exercise in > PostgreSQL query optimisation and planner tuning, so any help/hints > would be appreciated. Setting statistics higher (100-200), at least for the large table will likely help. Also make sure that you've set effective_cache_size correctly (I generally set it to total memory - 1G, assuming the server has at least 4G in it). -- Jim Nasby jimn@enterprisedb.com EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, 2006-09-21 at 23:39 -0400, Jim Nasby wrote: > On Sep 14, 2006, at 11:15 AM, Russ Brown wrote: > > We recently upgraded our trac backend from sqlite to postgres, and I > > decided to have a little fun and write some reports that delve into > > trac's subversion cache, and got stuck with a query optimisation > > problem. > > > > Table revision contains 2800+ rows > > Table node_change contains 370000+. > <...> > > I've got stuck with this query: > > > > SELECT author, COUNT(DISTINCT r.rev) > > FROM revision AS r > > LEFT JOIN node_change AS nc > > ON r.rev=nc.rev > > WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30 > > days'))::integer > > Man I really hate when people store time_t in a database... > I know. Probably something to do with database engine independence. I don't know if sqlite even has a date type (probably does, but I haven't checked). > > GROUP BY r.author; > > > > Statistics are set to 20, and I have ANALYZEd both tables. > > > > The report itself isn't important, but I'm using this as an > > exercise in > > PostgreSQL query optimisation and planner tuning, so any help/hints > > would be appreciated. > > Setting statistics higher (100-200), at least for the large table > will likely help. Also make sure that you've set effective_cache_size > correctly (I generally set it to total memory - 1G, assuming the > server has at least 4G in it). Thank you: the problem was the effective_cache_size (which I hadn't changed from the default of 1000). This machine doesn't have loads of RAM, but I knocked it up to 65536 and now the query uses the index, without having to change the statistics. Thanks a lot! > -- > Jim Nasby jimn@enterprisedb.com > EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) > >