Thread: Performance woes relating to DISTINCT (I think)

Performance woes relating to DISTINCT (I think)

From
boinger
Date:
Hello.

I'm not sure if this is a question suited for here, the -sql list, or
the -performance list, so if I'm mis-posting, please direct me to the
right list.

I was unable to come up with anything relevant from the archives
(though...I am not too sure where to start so I may just have been
looking for the wrong things).

I am trying to convert a bunch of code from MySQL to Postgresql.  In
MySQL I was running this particular query with more complexity (with
joins and such) and it takes far less than a second.  The Pg
stripped-down version takes over 45 seconds (it was taking over 80
seconds with the joins).

The table-in-question (tasks_applied) contains 12 columns:
cid, modcode, yearcode, seid, tid, actid, pkgid, optional, corrected,
labor, lmid, parts_price

I have the following indexes:
CREATE INDEX actid ON tasks_applied USING btree (actid)
CREATE INDEX pkgid ON tasks_applied USING btree (pkgid)
CREATE INDEX strafe_group ON tasks_applied USING btree (modcode,
yearcode, seid, tid) WHERE cid = 0
CREATE UNIQUE INDEX tasks_applied_pkey ON tasks_applied USING btree
(cid, modcode, yearcode, seid, tid, actid, optional)

Here is my Pg query:
-----------------
        SELECT DISTINCT
            modcode,
            yearcode,
            seid,
            COUNT(DISTINCT(tid))    AS task_count
        FROM
            tasks_applied
        WHERE
            cid=0 AND
            seid=100001
        GROUP BY
            modcode,
            yearcode,
            seid
        HAVING
            COUNT(tid)>=0
        ORDER BY
            modcode ASC,
            yearcode ASC,
            seid ASC
---------------


Here's my EXPLAIN ANAYZE output:
-----------------
QUERY PLAN

Unique  (cost=85168.84..85168.98 rows=11 width=22) (actual
time=45602.908..45607.399 rows=515 loops=1)

  ->  Sort  (cost=85168.84..85168.87 rows=11 width=22) (actual
time=45602.897..45604.286 rows=515 loops=1)

        Sort Key: modcode, yearcode, seid, count(DISTINCT tid)

        ->  GroupAggregate  (cost=0.00..85168.65 rows=11 width=22)
(actual time=3149.916..45578.292 rows=515 loops=1)

              Filter: (count(tid) >= 0)

              ->  Index Scan using strafe_group on tasks_applied
(cost=0.00..85167.23 rows=107 width=22) (actual
time=3144.908..45366.147 rows=29893 loops=1)

                    Filter: ((cid = 0) AND (seid = 100001))

Total runtime: 45609.207 ms

-------------

Finally, here's my MySQL query:
               SELECT
                       tasks_applied.modcode           AS modcode,
                       vin_models.shortname            AS shortname,
                       vin_years.year                  AS year,
                       vin_years.yearcode              AS yearcode,
                       service_events.details          AS se,
                       service_events.intl_details     AS i_se,
                       service_events.seid             AS seid,
                       COUNT(DISTINCT(tid))            AS task_count
               FROM
                       tasks_applied,
                       service_events,
                       vin_models,
                       vin_years
               WHERE
                       cid=0
                   AND tasks_applied.yearcode=vin_years.yearcode
                   AND tasks_applied.modcode=vin_models.modcode
                   AND tasks_applied.seid=service_events.seid
                   AND tasks_applied.seid=100001
               GROUP BY
                       se, modcode, year
               HAVING
                       COUNT(tid)>=0
               ORDER BY
                       tasks_applied.modcode ASC,
                       vin_years.year ASC,
                       service_events.seid ASC
---------

Any help would be greatly appreciated (even if it's just "RTFM on xxx").

Thanks

--jeff

Re: Performance woes relating to DISTINCT (I think)

From
Dawid Kuroczko
Date:
On 9/26/05, boinger <boinger@gmail.com> wrote:
Hello.

I'm not sure if this is a question suited for here, the -sql list, or
the -performance list, so if I'm mis-posting, please direct me to the
right list.

I was unable to come up with anything relevant from the archives
(though...I am not too sure where to start so I may just have been
looking for the wrong things).

I am trying to convert a bunch of code from MySQL to Postgresql.  In
MySQL I was running this particular query with more complexity (with
joins and such) and it takes far less than a second.  The Pg
stripped-down version takes over 45 seconds (it was taking over 80
seconds with the joins).



QUERY PLAN
        ->  GroupAggregate  (cost=0.00..85168.65 rows=11 width=22)
(actual time=3149.916..45578.292 rows=515 loops=1)

Hmm, planner expected 11 rows, got 515

(cost=0.00..85167.23 rows=107 width=22) (actual
time=3144.908..45366.147 rows=29893 loops=1)


planner expected 107 rows, got 29893...
 
I guess the problem here is that planner has wrong idea how your
data looks.  Try doing two things:

VACUUM ANALYZE;
(of tables in question or whole database)

If that doesn't help, do increase the statistics target.  By default PostgreSQL
keeps 10 samples, but you might want to increase it to 50 or even 100.
And then rerun VACUUM ANALYZE.

If it doesn't help -- please repost the new query plan once again.

   Regards,
      Dawid

Re: Performance woes relating to DISTINCT (I think)

From
boinger
Date:
On 9/27/05, Dawid Kuroczko <qnex42@gmail.com> wrote:
> > QUERY PLAN
> >         ->  GroupAggregate  (cost=0.00..85168.65 rows=11
> width=22)
> > (actual time=3149.916..45578.292 rows=515 loops=1)
>
>  Hmm, planner expected 11 rows, got 515
>
>
> > (cost=0.00..85167.23 rows=107 width=22) (actual
> > time=3144.908..45366.147 rows=29893 loops=1)
>
>
>  planner expected 107 rows, got 29893...
>   I guess the problem here is that planner has wrong idea how your
>  data looks.  Try doing two things:
>
>  VACUUM ANALYZE;
>  (of tables in question or whole database)
>
>  If that doesn't help, do increase the statistics target.  By default
> PostgreSQL
>  keeps 10 samples, but you might want to increase it to 50 or even 100.
>  And then rerun VACUUM ANALYZE.
>
>  If it doesn't help -- please repost the new query plan once again.

I actually kind of inadvertently "fixed" it.

I threw my hands up and thought to myself "FINE! If it's going to take
that long, at least it can do all the joins and whatnot instead of
having to loop back and do separate queries"

So, I piled in everything I needed it to do, and now it's inexplicably
(to me) fast (!?).

I'm still running a full VACUUM ANALYZE on your recommendation...maybe
shave a few more ms off.

Here's what I have, now (pre-vacuum):

SQL:
SELECT
            tasks_applied.modcode               AS modcode,
            tasks_applied.seid                  AS seid,
            tasks_applied.yearcode              AS yearcode,
            vin_years.year                      AS year,
            COUNT(DISTINCT(tid))                AS task_count
        FROM
            "SS_valid_modelyears",
            tasks_applied,
            vin_years
        WHERE
            cid=0
            AND tasks_applied.seid='500001'

            AND "SS_valid_modelyears".modcode=tasks_applied.modcode

            AND "SS_valid_modelyears".year=vin_years.year
            AND tasks_applied.yearcode=vin_years.yearcode

            AND "SS_valid_modelyears".valid=1
        GROUP BY
            tasks_applied.seid,
            vin_years.year,
            tasks_applied.modcode,
            "SS_valid_modelyears".shortname,
            tasks_applied.yearcode
        ORDER BY
            tasks_applied.seid ASC,
            vin_years.year ASC


QUERY PLAN:
GroupAggregate  (cost=201.39..201.42 rows=1 width=69) (actual
time=80.383..80.386 rows=1 loops=1)

  ->  Sort  (cost=201.39..201.40 rows=1 width=69) (actual
time=79.737..79.898 rows=59 loops=1)

        Sort Key: tasks_applied.seid, vin_years."year",
tasks_applied.modcode, "SS_valid_modelyears".shortname,
tasks_applied.yearcode

        ->  Nested Loop  (cost=1.38..201.38 rows=1 width=69) (actual
time=72.599..78.765 rows=59 loops=1)

              ->  Hash Join  (cost=1.38..165.15 rows=6 width=61)
(actual time=0.530..18.881 rows=1188 loops=1)

                    Hash Cond: ("outer"."year" = "inner"."year")

                    ->  Seq Scan on "SS_valid_modelyears"
(cost=0.00..163.54 rows=36 width=56) (actual time=0.183..9.202
rows=1188 loops=1)

                          Filter: ("valid" = 1)

                    ->  Hash  (cost=1.30..1.30 rows=30 width=9)
(actual time=0.230..0.230 rows=0 loops=1)

                          ->  Seq Scan on vin_years  (cost=0.00..1.30
rows=30 width=9) (actual time=0.019..0.116 rows=30 loops=1)

              ->  Index Scan using strafe_group on tasks_applied
(cost=0.00..6.02 rows=1 width=22) (actual time=0.042..0.043 rows=0
loops=1188)

                    Index Cond: ((("outer".modcode)::text =
(tasks_applied.modcode)::text) AND (tasks_applied.yearcode =
"outer".yearcode) AND (tasks_applied.seid = 500001))

                    Filter: (cid = 0)

Total runtime: 80.764 ms

Re: Performance woes relating to DISTINCT (I think)

From
"Jim C. Nasby"
Date:
On Tue, Sep 27, 2005 at 09:07:41AM -0500, boinger wrote:
>               ->  Index Scan using strafe_group on tasks_applied
> (cost=0.00..6.02 rows=1 width=22) (actual time=0.042..0.043 rows=0
> loops=1188)
>
>                     Index Cond: ((("outer".modcode)::text =
> (tasks_applied.modcode)::text) AND (tasks_applied.yearcode =
> "outer".yearcode) AND (tasks_applied.seid = 500001))
>
>                     Filter: (cid = 0)
>
> Total runtime: 80.764 ms

Compare that to the index scan it had to do before. Now that you gave
the database the exact info it needs to answer your real question, it
can use a much, much more selective index scan.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461