Thread: Simple but slow

Simple but slow

From

"Chad Thompson"

Date:

20 August 2002, 18:25:30

I have this query that takes a very long time.

explain analyze select distinct on (l.full_phone) l.full_phone, l.id into temp temp_list_tz
from lists l, timezone tz
where l.area_code = tz.area_code
and l.client_id = 8
and tz.greenwich = '-5'

limit 100000;

these are the results

NOTICE: QUERY PLAN:

Limit (cost=96929.63..97822.45 rows=35713 width=36) (actual time=42477.82..43491.69 rows=100000 loops=1)
-> Unique (cost=96929.63..97822.46 rows=35713 width=36) (actual time=42477.82..43325.87 rows=100001 loops=1)
        -> Sort (cost=96929.63..96929.63 rows=357132 width=36) (actual time=42477.80..42735.18 rows=102151 loops=1)
              -> Hash Join (cost=9.59..41831.14 rows=357132 width=36) (actual time=25.29..16456.26 rows=352194 loops=1)
                    -> Seq Scan on lists l (cost=0.00..32881.18 rows=895244 width=29) (actual time=0.10..11342.50 rows=878098 loops=1)
                    -> Hash (cost=9.22..9.22 rows=148 width=7) (actual time=23.80..23.80 rows=0 loops=1)
                          -> Seq Scan on timezone tz (cost=0.00..9.22 rows=148 width=7) (actual time=21.72..23.45 rows=148 loops=1)
Total runtime: 46247.79 msec

I have about 2M records in lists, and about 400 in timezone, so it shouldnt be anything this serious.

I have indexed area_code in both tables, client_id in lists and greenwich in timezone

I have tried both group by and distinct on with negligable difference.

This doesnt seem like that complex a query. Am i doing something dumb again?

Here are the table structures.

-- Table: lists

CREATE

TABLE "lists" (

"id" int8 DEFAULT nextval('"lists_id_seq"'::text) NOT NULL,

"first_name" varchar(50),

"middle_initial" char(1),

"last_name" varchar(30),

"address" varchar(40),

"city" varchar(30),

"state" char(2),

"zip" char(10),

"area_code" varchar(3),

"list_of_lists_id" int8,

CONSTRAINT "lists_pkey" PRIMARY KEY ("id"),

CONSTRAINT "area_code" FOREIGN KEY ("area_code") REFERENCES "timezone" ("area_code") ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE INITIALLY IMMEDIATE,

CONSTRAINT "client_id" FOREIGN KEY ("client_id") REFERENCES "clients" ("id") ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE INITIALLY IMMEDIATE,

CONSTRAINT "list_of_lists_id" FOREIGN KEY ("list_of_lists_id") REFERENCES "list_of_lists" ("id") ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE INITIALLY IMMEDIATE

) WITH OIDS;

-- Table: timezone

CREATE

TABLE "timezone" (

"id" int8 DEFAULT nextval('"timezone_id_seq"'::text) NOT NULL,

"area_code" varchar(3),

"state" varchar(2),

"full_state_name" varchar(30),

"GMT" int4,

"time_zone" varchar(10),

"daylight_savings" varchar(5),

"notes" text,

"greenwich" varchar(5),

"converted" int4,

CONSTRAINT "area_code_idx" UNIQUE ("area_code"),

CONSTRAINT "timezone_pkey" PRIMARY KEY ("id")

) WITH OIDS;

TIA

Chad

Re: Simple but slow

From

Josh Berkus

Date:

21 August 2002, 20:48:08

Chad,

If you take a look at the Explain content, you'll see where the slow-down is:

> Limit  (cost=96929.63..97822.45 rows=35713 width=36) (actual
time=42477.82..43491.69 rows=100000 loops=1)
>   ->  Unique  (cost=96929.63..97822.46 rows=35713 width=36) (actual
time=42477.82..43325.87 rows=100001 loops=1)
>         ->  Sort  (cost=96929.63..96929.63 rows=357132 width=36) (actual
time=42477.80..42735.18 rows=102151 loops=1)

These three lines are the selecting unique values and sorting and terminating
the result set.   This is most of your computing time; see the "actual time
=42477"

>               ->  Hash Join  (cost=9.59..41831.14 rows=357132 width=36)
(actual time=25.29..16456.26 rows=352194 loops=1)
>                     ->  Seq Scan on lists l  (cost=0.00..32881.18
rows=895244 width=29) (actual time=0.10..11342.50 rows=878098 loops=1)
>                     ->  Hash  (cost=9.22..9.22 rows=148 width=7) (actual
time=23.80..23.80 rows=0 loops=1)
>                           ->  Seq Scan on timezone tz  (cost=0.00..9.22
rows=148 width=7) (actual time=21.72..23.45 rows=148 loops=1)

This is you join to the area codes.  It's ignoring the indexes, because the
number of records in timezone is so small compared to the number in lists.
this isn't a problem, though, because as you can see the join operation takes
only a few milliseconds at a minimum.

> Total runtime: 46247.79 msec

The way I read this, 95% of the time is being spent on the DISTINCT.   Tom, am
I reading this right?
Try:
1) Indexing lists.full_phone.
2) Check the speed without the DISTINCT as  a benckmark.
3) Increasing the amount of memory available to your queries by altering the
postgresql.conf settings and possibly adding more RAM or improving your disk
access speed.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: Simple but slow

From

"Chad Thompson"

Date:

21 August 2002, 21:12:17

Thanks for your reply Josh, as usual i learn from you whenever you write.
Ive been having a hard time understanding what explain is telling me.
I was able to get the query down to 19 secs w/o the distinct.  I think i'll
move the distinct to one of my faster queries.

If its not too much trouble id like you to look at another.  This is really
being a beast.

select l.id, l.full_phone into "8_21_2002"
from "temp_list_tier" l LEFT JOIN
(select phonenum
from call_results_fixed
where client_id = 8) as cr
ON l.full_phone = cr.phonenum
where cr.phonenum Is Null

I have indexes on cr.phonenum and l.full_phone but it doesnt seem to use
them with the subselect.
cr is about 1 million records, l varies, usually never more than about
30-40K of records.

Merge Join  (cost=265368.44..8176071.25 rows=3161144 width=44) (actual
time=121889.56..126948.26 rows=11384 loops=1)
  ->  Index Scan using temp_list_tier_idx on temp_list_tier l
(cost=0.00..4431.97 rows=99997 width=25) (actual time=0.38..808.34
rows=99997 loops=1)
  ->  Sort  (cost=256113.44..256113.44 rows=1264495 width=14) (actual
time=121887.71..122732.11 rows=422624 loops=1)
        ->  Subquery Scan cr  (cost=0.00..41331.00 rows=1264495 width=14)
(actual time=0.10..12941.66 rows=1274987 loops=1)
              ->  Seq Scan on call_results_fixed  (cost=0.00..41331.00
rows=1264495 width=14) (actual time=0.09..9047.89 rows=1274987 loops=1)
Total runtime: 127273.03 msec

Thanks for your help.
I have also enjoyed your "The Joy of Index".  I look forward to the next
issue.

Thanks
Chad
----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: "Chad Thompson" <chad@weblinkservices.com>; "pgsql-novice"
<pgsql-novice@postgresql.org>
Sent: Wednesday, August 21, 2002 6:48 PM
Subject: Re: [NOVICE] Simple but slow



Chad,

If you take a look at the Explain content, you'll see where the slow-down
is:

> Limit  (cost=96929.63..97822.45 rows=35713 width=36) (actual
time=42477.82..43491.69 rows=100000 loops=1)
>   ->  Unique  (cost=96929.63..97822.46 rows=35713 width=36) (actual
time=42477.82..43325.87 rows=100001 loops=1)
>         ->  Sort  (cost=96929.63..96929.63 rows=357132 width=36) (actual
time=42477.80..42735.18 rows=102151 loops=1)

These three lines are the selecting unique values and sorting and
terminating
the result set.   This is most of your computing time; see the "actual time
=42477"

>               ->  Hash Join  (cost=9.59..41831.14 rows=357132 width=36)
(actual time=25.29..16456.26 rows=352194 loops=1)
>                     ->  Seq Scan on lists l  (cost=0.00..32881.18
rows=895244 width=29) (actual time=0.10..11342.50 rows=878098 loops=1)
>                     ->  Hash  (cost=9.22..9.22 rows=148 width=7) (actual
time=23.80..23.80 rows=0 loops=1)
>                           ->  Seq Scan on timezone tz  (cost=0.00..9.22
rows=148 width=7) (actual time=21.72..23.45 rows=148 loops=1)

This is you join to the area codes.  It's ignoring the indexes, because the
number of records in timezone is so small compared to the number in lists.
this isn't a problem, though, because as you can see the join operation
takes
only a few milliseconds at a minimum.

> Total runtime: 46247.79 msec

The way I read this, 95% of the time is being spent on the DISTINCT.   Tom,
am
I reading this right?
Try:
1) Indexing lists.full_phone.
2) Check the speed without the DISTINCT as  a benckmark.
3) Increasing the amount of memory available to your queries by altering the
postgresql.conf settings and possibly adding more RAM or improving your disk
access speed.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: Simple but slow

From

"Henshall, Stuart - WCP"

Date:

22 August 2002, 10:06:48

> -----Original Message-----
> From: Chad Thompson [mailto:chad@weblinkservices.com]
>
> select l.id, l.full_phone into "8_21_2002"
> from "temp_list_tier" l LEFT JOIN
> (select phonenum
> from call_results_fixed
> where client_id = 8) as cr
> ON l.full_phone = cr.phonenum
> where cr.phonenum Is Null
>
I'm guessing you want all those in "temp_list_tier" where there is not the same phone number in call_results_fixed.
Does this (untested) run better for you:

SELECT l.id,l.full_phone INTO "8_21_2002"
FROM "temp_list_tier" AS l WHERE
WHERE NOT EXISTS
(SELECT * FROM call_results_fixed AS cr
WHERE cr.client_id=8 AND l.full_phone=cr.phonenum);

Indexes needed (I think):
cr.client_id
cr.phonenum
Maybe (I don't think so):
l.full_phone

Not sure if this is what you wanted but hope its helpfull,
- Stuart

Re: Simple but slow

From

"Chad Thompson"

Date:

22 August 2002, 11:04:55

WOW! Thats a great query.

It brought down the access time from 121 secs to about 4.

Thanks a ton

Chad

----- Original Message -----
From: Henshall, Stuart - WCP
To: 'Chad Thompson' ; josh@agliodbs.com ; pgsql-novice
Sent: Thursday, August 22, 2002 7:59 AM
Subject: RE: [NOVICE] Simple but slow

> -----Original Message-----
> From: Chad Thompson [mailto:chad@weblinkservices.com]
>
> select l.id, l.full_phone into "8_21_2002"
> from "temp_list_tier" l LEFT JOIN
> (select phonenum
> from call_results_fixed
> where client_id = 8) as cr
> ON l.full_phone = cr.phonenum
> where cr.phonenum Is Null
>
I'm guessing you want all those in "temp_list_tier" where there is not the same phone number in call_results_fixed.
Does this (untested) run better for you:
SELECT l.id,l.full_phone INTO "8_21_2002"
FROM "temp_list_tier" AS l WHERE
WHERE NOT EXISTS
(SELECT * FROM call_results_fixed AS cr
WHERE cr.client_id=8 AND l.full_phone=cr.phonenum);
Indexes needed (I think):
cr.client_id
cr.phonenum
Maybe (I don't think so):
l.full_phone
Not sure if this is what you wanted but hope its helpfull,
- Stuart

Re: Simple but slow

From

"Josh Berkus"

Date:

23 August 2002, 12:18:00

Chad,

> Thanks for your reply Josh, as usual i learn from you whenever you
> write.

You're quite welcome!

> Ive been having a hard time understanding what explain is telling me.
> I was able to get the query down to 19 secs w/o the distinct.  I
> think i'll
> move the distinct to one of my faster queries.

Distinct on large result sets can be quite brutal.  Here's why your
query was slow with DISTINCT:

1. First the query has to sort by the DISTINCT field.
2. Then it has to "roll up" all the non-distinct entries
3. Then it has to re-sort by your output sort.

This isn't much of a problem on small tables, but with 2 million
records, that's 3 table scans of the whole table, which either requires
a lot of patience or a server with 2gb of RAM and a really fast RAID
array.

> If its not too much trouble id like you to look at another.  This is
> really
> being a beast.

I think somebody already posted a solution for this.

> Thanks for your help.
> I have also enjoyed your "The Joy of Index".  I look forward to the
> next
> issue.

You're welcome again.   According to Tom and Bruno, I need to post some
corrections ... look for them early next week.

-Josh Berkus
 "Standing on the shoulders of giants."