Thread: Mystefied at poor performance of a standard query

Mystefied at poor performance of a standard query

From

"David logan"

Date:

01 October 2008, 09:11:46

Hi!

(The question is why this simple select takes me 20 minutes to run...)

I have two tables with address data and result data from two different runs
of two different geocoding engines. I want the count of result data
differences when the output address data matches.

In essence, I want:
Engine 1     engine 2
Id        =  id
Addr      =  Addr
City      =  City
Zip       =  Zip

And then
Result    != Result

When I do g.id=m.id, performance is fine. When I add
g.outaddress=m.out_address, performance is fine. But when I add
g.city=m.out_city, the query goes from ~10s to ~20 MINUTES for some reason.
Any ideas why? If I do a subselect rather than just g.outcity=m.out_city,
it's fine, but if I change it back to the former, it goes back to 20
minutes. The queries and explains are below.

resultcodes=# explain analyze select g.outresultcode,m.resultcode,count(*)
from geocoderoutput g,mmoutput m where g.id=m.id and
g.outresultcode<>m.resultcode and g.outaddress=m.out_address and
g.outcity=m.out_city group by g.outresultcode,m.resultcode order by count(*)
desc;
                                                                     QUERY
PLAN

----------------------------------------------------------------------------
--------------------------------------------------------
----------------
 Sort  (cost=64772.08..64772.09 rows=1 width=22) (actual
time=1194603.363..1194604.099 rows=515 loops=1)
   Sort Key: (count(*))
   Sort Method:  quicksort  Memory: 53kB
   ->  HashAggregate  (cost=64772.06..64772.07 rows=1 width=22) (actual
time=1194601.374..1194602.316 rows=515 loops=1)
         ->  Hash Join  (cost=24865.31..64772.05 rows=1 width=22) (actual
time=373146.994..1194475.482 rows=52179 loops=1)
               Hash Cond: ((m.id = g.id) AND ((m.out_address)::text =
(g.outaddress)::text) AND ((m.out_city)::text = (g.outcity)::text))
               Join Filter: (g.outresultcode <> m.resultcode)
               ->  Seq Scan on mmoutput m  (cost=0.00..15331.84 rows=502884
width=42) (actual time=0.010..1324.974 rows=502884 loops=1)
               ->  Hash  (cost=11644.84..11644.84 rows=502884 width=41)
(actual time=370411.043..370411.043 rows=502704 loops=1)
                     ->  Seq Scan on geocoderoutput g  (cost=0.00..11644.84
rows=502884 width=41) (actual time=0.010..1166.141 rows=502884 loops=1)
 Total runtime: 1194605.011 ms
(11 rows)

resultcodes=# explain analyze select g.outresultcode,m.resultcode,count(*)
from geocoderoutput g,mmoutput m where g.id=m.id and
g.outresultcode<>m.resultcode and g.outaddress=m.out_address and
g.outcity=(select out_city from mmoutput where id=g.id and
out_city=g.outcity) group by g.outresultcode,m.resultcode order by count(*)
desc;
                                                                       QUERY
PLAN

----------------------------------------------------------------------------
--------------------------------------------------------
--------------------
 Sort  (cost=4218017.33..4218017.34 rows=1 width=22) (actual
time=23095.890..23096.632 rows=515 loops=1)
   Sort Key: (count(*))
   Sort Method:  quicksort  Memory: 53kB
   ->  HashAggregate  (cost=4218017.31..4218017.32 rows=1 width=22) (actual
time=23093.901..23094.839 rows=515 loops=1)
         ->  Nested Loop  (cost=0.00..4218017.30 rows=1 width=22) (actual
time=102.356..22930.141 rows=52179 loops=1)
               Join Filter: ((g.outresultcode <> m.resultcode) AND
((g.outaddress)::text = (m.out_address)::text))
               ->  Seq Scan on geocoderoutput g  (cost=0.00..4202690.15
rows=2514 width=41) (actual time=98.045..15040.142 rows=468172 loops=1)
                     Filter: ((outcity)::text = ((subplan))::text)
                     SubPlan
                       ->  Index Scan using mmoutput_pkey on mmoutput
(cost=0.00..8.33 rows=1 width=10) (actual time=0.018..0.021 rows=1
loops=502884)
                             Index Cond: (id = $0)
                             Filter: ((out_city)::text = ($1)::text)
               ->  Index Scan using mmoutput_pkey on mmoutput m
(cost=0.00..6.08 rows=1 width=32) (actual time=0.006..0.008 rows=1
loops=468172)
                     Index Cond: (m.id = g.id)
 Total runtime: 23097.548 ms
(15 rows)

Re: Mystefied at poor performance of a standard query

From

Tom Lane

Date:

01 October 2008, 10:30:20

"David logan" <djlogan2@comcast.net> writes:
> (The question is why this simple select takes me 20 minutes to run...)

What have you got work_mem set to?  The hash join is not going to be
real fast if it has to split the join into multiple batches, so you
want work_mem large enough to hold the whole inner relation.  That would
be at least 20MB in this example, probably quite a bit more after
allowing for per-row overhead in the table.

            regards, tom lane

Re: Mystefied at poor performance of a standard query

From

"David logan"

Date:

01 October 2008, 13:38:44

Looks like that worked. I set work_mem to 256MB, and it looks like my
standard sql came back in just a couple of seconds.

Thanks!

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, October 01, 2008 07:30
To: David logan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Mystefied at poor performance of a standard query

"David logan" <djlogan2@comcast.net> writes:
> (The question is why this simple select takes me 20 minutes to run...)

What have you got work_mem set to?  The hash join is not going to be
real fast if it has to split the join into multiple batches, so you
want work_mem large enough to hold the whole inner relation.  That would
be at least 20MB in this example, probably quite a bit more after
allowing for per-row overhead in the table.

            regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance