Thread: Performance woes relating to DISTINCT (I think)
Hello. I'm not sure if this is a question suited for here, the -sql list, or the -performance list, so if I'm mis-posting, please direct me to the right list. I was unable to come up with anything relevant from the archives (though...I am not too sure where to start so I may just have been looking for the wrong things). I am trying to convert a bunch of code from MySQL to Postgresql. In MySQL I was running this particular query with more complexity (with joins and such) and it takes far less than a second. The Pg stripped-down version takes over 45 seconds (it was taking over 80 seconds with the joins). The table-in-question (tasks_applied) contains 12 columns: cid, modcode, yearcode, seid, tid, actid, pkgid, optional, corrected, labor, lmid, parts_price I have the following indexes: CREATE INDEX actid ON tasks_applied USING btree (actid) CREATE INDEX pkgid ON tasks_applied USING btree (pkgid) CREATE INDEX strafe_group ON tasks_applied USING btree (modcode, yearcode, seid, tid) WHERE cid = 0 CREATE UNIQUE INDEX tasks_applied_pkey ON tasks_applied USING btree (cid, modcode, yearcode, seid, tid, actid, optional) Here is my Pg query: ----------------- SELECT DISTINCT modcode, yearcode, seid, COUNT(DISTINCT(tid)) AS task_count FROM tasks_applied WHERE cid=0 AND seid=100001 GROUP BY modcode, yearcode, seid HAVING COUNT(tid)>=0 ORDER BY modcode ASC, yearcode ASC, seid ASC --------------- Here's my EXPLAIN ANAYZE output: ----------------- QUERY PLAN Unique (cost=85168.84..85168.98 rows=11 width=22) (actual time=45602.908..45607.399 rows=515 loops=1) -> Sort (cost=85168.84..85168.87 rows=11 width=22) (actual time=45602.897..45604.286 rows=515 loops=1) Sort Key: modcode, yearcode, seid, count(DISTINCT tid) -> GroupAggregate (cost=0.00..85168.65 rows=11 width=22) (actual time=3149.916..45578.292 rows=515 loops=1) Filter: (count(tid) >= 0) -> Index Scan using strafe_group on tasks_applied (cost=0.00..85167.23 rows=107 width=22) (actual time=3144.908..45366.147 rows=29893 loops=1) Filter: ((cid = 0) AND (seid = 100001)) Total runtime: 45609.207 ms ------------- Finally, here's my MySQL query: SELECT tasks_applied.modcode AS modcode, vin_models.shortname AS shortname, vin_years.year AS year, vin_years.yearcode AS yearcode, service_events.details AS se, service_events.intl_details AS i_se, service_events.seid AS seid, COUNT(DISTINCT(tid)) AS task_count FROM tasks_applied, service_events, vin_models, vin_years WHERE cid=0 AND tasks_applied.yearcode=vin_years.yearcode AND tasks_applied.modcode=vin_models.modcode AND tasks_applied.seid=service_events.seid AND tasks_applied.seid=100001 GROUP BY se, modcode, year HAVING COUNT(tid)>=0 ORDER BY tasks_applied.modcode ASC, vin_years.year ASC, service_events.seid ASC --------- Any help would be greatly appreciated (even if it's just "RTFM on xxx"). Thanks --jeff
On 9/26/05, boinger <boinger@gmail.com> wrote:
Hmm, planner expected 11 rows, got 515
planner expected 107 rows, got 29893...
I guess the problem here is that planner has wrong idea how your
data looks. Try doing two things:
VACUUM ANALYZE;
(of tables in question or whole database)
If that doesn't help, do increase the statistics target. By default PostgreSQL
keeps 10 samples, but you might want to increase it to 50 or even 100.
And then rerun VACUUM ANALYZE.
If it doesn't help -- please repost the new query plan once again.
Regards,
Dawid
Hello.
I'm not sure if this is a question suited for here, the -sql list, or
the -performance list, so if I'm mis-posting, please direct me to the
right list.
I was unable to come up with anything relevant from the archives
(though...I am not too sure where to start so I may just have been
looking for the wrong things).
I am trying to convert a bunch of code from MySQL to Postgresql. In
MySQL I was running this particular query with more complexity (with
joins and such) and it takes far less than a second. The Pg
stripped-down version takes over 45 seconds (it was taking over 80
seconds with the joins).
QUERY PLAN
-> GroupAggregate (cost=0.00..85168.65 rows=11 width=22)
(actual time=3149.916..45578.292 rows=515 loops=1)
Hmm, planner expected 11 rows, got 515
(cost=0.00..85167.23 rows=107 width=22) (actual
time=3144.908..45366.147 rows=29893 loops=1)
planner expected 107 rows, got 29893...
data looks. Try doing two things:
VACUUM ANALYZE;
(of tables in question or whole database)
If that doesn't help, do increase the statistics target. By default PostgreSQL
keeps 10 samples, but you might want to increase it to 50 or even 100.
And then rerun VACUUM ANALYZE.
If it doesn't help -- please repost the new query plan once again.
Regards,
Dawid
On 9/27/05, Dawid Kuroczko <qnex42@gmail.com> wrote: > > QUERY PLAN > > -> GroupAggregate (cost=0.00..85168.65 rows=11 > width=22) > > (actual time=3149.916..45578.292 rows=515 loops=1) > > Hmm, planner expected 11 rows, got 515 > > > > (cost=0.00..85167.23 rows=107 width=22) (actual > > time=3144.908..45366.147 rows=29893 loops=1) > > > planner expected 107 rows, got 29893... > I guess the problem here is that planner has wrong idea how your > data looks. Try doing two things: > > VACUUM ANALYZE; > (of tables in question or whole database) > > If that doesn't help, do increase the statistics target. By default > PostgreSQL > keeps 10 samples, but you might want to increase it to 50 or even 100. > And then rerun VACUUM ANALYZE. > > If it doesn't help -- please repost the new query plan once again. I actually kind of inadvertently "fixed" it. I threw my hands up and thought to myself "FINE! If it's going to take that long, at least it can do all the joins and whatnot instead of having to loop back and do separate queries" So, I piled in everything I needed it to do, and now it's inexplicably (to me) fast (!?). I'm still running a full VACUUM ANALYZE on your recommendation...maybe shave a few more ms off. Here's what I have, now (pre-vacuum): SQL: SELECT tasks_applied.modcode AS modcode, tasks_applied.seid AS seid, tasks_applied.yearcode AS yearcode, vin_years.year AS year, COUNT(DISTINCT(tid)) AS task_count FROM "SS_valid_modelyears", tasks_applied, vin_years WHERE cid=0 AND tasks_applied.seid='500001' AND "SS_valid_modelyears".modcode=tasks_applied.modcode AND "SS_valid_modelyears".year=vin_years.year AND tasks_applied.yearcode=vin_years.yearcode AND "SS_valid_modelyears".valid=1 GROUP BY tasks_applied.seid, vin_years.year, tasks_applied.modcode, "SS_valid_modelyears".shortname, tasks_applied.yearcode ORDER BY tasks_applied.seid ASC, vin_years.year ASC QUERY PLAN: GroupAggregate (cost=201.39..201.42 rows=1 width=69) (actual time=80.383..80.386 rows=1 loops=1) -> Sort (cost=201.39..201.40 rows=1 width=69) (actual time=79.737..79.898 rows=59 loops=1) Sort Key: tasks_applied.seid, vin_years."year", tasks_applied.modcode, "SS_valid_modelyears".shortname, tasks_applied.yearcode -> Nested Loop (cost=1.38..201.38 rows=1 width=69) (actual time=72.599..78.765 rows=59 loops=1) -> Hash Join (cost=1.38..165.15 rows=6 width=61) (actual time=0.530..18.881 rows=1188 loops=1) Hash Cond: ("outer"."year" = "inner"."year") -> Seq Scan on "SS_valid_modelyears" (cost=0.00..163.54 rows=36 width=56) (actual time=0.183..9.202 rows=1188 loops=1) Filter: ("valid" = 1) -> Hash (cost=1.30..1.30 rows=30 width=9) (actual time=0.230..0.230 rows=0 loops=1) -> Seq Scan on vin_years (cost=0.00..1.30 rows=30 width=9) (actual time=0.019..0.116 rows=30 loops=1) -> Index Scan using strafe_group on tasks_applied (cost=0.00..6.02 rows=1 width=22) (actual time=0.042..0.043 rows=0 loops=1188) Index Cond: ((("outer".modcode)::text = (tasks_applied.modcode)::text) AND (tasks_applied.yearcode = "outer".yearcode) AND (tasks_applied.seid = 500001)) Filter: (cid = 0) Total runtime: 80.764 ms
On Tue, Sep 27, 2005 at 09:07:41AM -0500, boinger wrote: > -> Index Scan using strafe_group on tasks_applied > (cost=0.00..6.02 rows=1 width=22) (actual time=0.042..0.043 rows=0 > loops=1188) > > Index Cond: ((("outer".modcode)::text = > (tasks_applied.modcode)::text) AND (tasks_applied.yearcode = > "outer".yearcode) AND (tasks_applied.seid = 500001)) > > Filter: (cid = 0) > > Total runtime: 80.764 ms Compare that to the index scan it had to do before. Now that you gave the database the exact info it needs to answer your real question, it can use a much, much more selective index scan. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461