Thread: Slow query

Slow query

From
"Jeremy Haile"
Date:
I am running a query that joins against several large tables (~5 million
rows each).  The query takes an exteremely long time to run, and the
explain output is a bit beyond my level of understanding.  It is an
auto-generated query, so the aliases are fairly ugly.  I can clean them
up (rename them) if it would help.  Also, let me know if I can send any
more information that would help (e.g. table schema)

Also, is there any resources where I can get a better understanding of
what PostgreSQL means when it says "Sort" "Sort Key" "Bitmap Index Scan"
"Hash Cond" etc. etc. - and how to recognize problems by looking at the
output.  I can understand the output for simple queries (e.g. is the
planner using an index or performing a seq. scan), but when you get to
more complex queries like the one below I lose my way =)

I would really appreciate it if someone from this list could tell me if
there is anything that is obviously wrong with the query or schema and
what I could do to improve the performance.

PostgreSQL 8.1
RedHat Enterprise Linux 4

--QUERY
select distinct city4_.region_id as region1_29_, city4_1_.name as
name29_, city4_.state_id as state2_30_
from registered_voters registered0_
         inner join registered_voter_addresses addresses1_ on
         registered0_.registered_voter_id=addresses1_.registered_voter_id
         inner join registered_voter_addresses_regions regions2_ on
         addresses1_.address_id=regions2_.registered_voter_addresses_address_id
         inner join regions region3_ on
         regions2_.regions_region_id=region3_.region_id
         inner join cities city4_ on
         addresses1_.city_id=city4_.region_id
         inner join regions city4_1_ on
         city4_.region_id=city4_1_.region_id
where region3_.region_id='093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'
order by city4_1_.name


--EXPLAIN/ANALYZE OUTPUT
"Unique  (cost=3572907.42..3623589.94 rows=4076438 width=93) (actual
time=2980825.714..3052333.753 rows=1124 loops=1)"
"  ->  Sort  (cost=3572907.42..3585578.05 rows=5068252 width=93) (actual
time=2980825.710..2987407.888 rows=4918204 loops=1)"
"        Sort Key: city4_1_.name, city4_.region_id, city4_.state_id"
"        ->  Hash Join  (cost=717783.40..1430640.10 rows=5068252
width=93) (actual time=1400141.559..2016131.467 rows=4918204 loops=1)"
"              Hash Cond:
(("outer".registered_voter_addresses_address_id)::text =
("inner".address_id)::text)"
"              ->  Bitmap Heap Scan on
registered_voter_addresses_regions regions2_  (cost=54794.95..575616.49
rows=5116843 width=80) (actual time=45814.469..155044.478 rows=4918205
loops=1)"
"                    Recheck Cond:
('093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text =
(regions_region_id)::text)"
"                    ->  Bitmap Index Scan on
reg_voter_address_region_region_idx  (cost=0.00..54794.95 rows=5116843
width=0) (actual time=45807.157..45807.157 rows=4918205 loops=1)"
"                          Index Cond:
('093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text =
(regions_region_id)::text)"
"              ->  Hash  (cost=642308.89..642308.89 rows=741420
width=173) (actual time=1354217.934..1354217.934 rows=4918204 loops=1)"
"                    ->  Hash Join  (cost=328502.66..642308.89
rows=741420 width=173) (actual time=204565.031..1268303.832 rows=4918204
loops=1)"
"                          Hash Cond:
(("outer".registered_voter_id)::text =
("inner".registered_voter_id)::text)"
"                          ->  Seq Scan on registered_voters
registered0_  (cost=0.00..173703.02 rows=4873202 width=40) (actual
time=0.005..39364.261 rows=4873167 loops=1)"
"                          ->  Hash  (cost=303970.34..303970.34
rows=748528 width=213) (actual time=204523.861..204523.861 rows=4918204
loops=1)"
"                                ->  Hash Join  (cost=263.22..303970.34
rows=748528 width=213) (actual time=101.628..140936.062 rows=4918204
loops=1)"
"                                      Hash Cond:
(("outer".city_id)::text = ("inner".region_id)::text)"
"                                      ->  Seq Scan on
registered_voter_addresses addresses1_  (cost=0.00..271622.23
rows=4919923 width=120) (actual time=0.025..98416.667 rows=4918205
loops=1)"
"                                      ->  Hash  (cost=260.35..260.35
rows=1147 width=173) (actual time=101.582..101.582 rows=1147 loops=1)"
"                                            ->  Hash Join
(cost=48.80..260.35 rows=1147 width=173) (actual time=88.608..98.984
rows=1147 loops=1)"
"                                                  Hash Cond:
(("outer".region_id)::text = ("inner".region_id)::text)"
"                                                  ->  Seq Scan on
regions city4_1_  (cost=0.00..162.39 rows=7539 width=53) (actual
time=0.048..35.204 rows=7539 loops=1)"
"                                                  ->  Hash
(cost=45.93..45.93 rows=1147 width=120) (actual time=48.896..48.896
rows=1147 loops=1)"
"                                                        ->  Nested Loop
 (cost=0.00..45.93 rows=1147 width=120) (actual time=35.791..47.012
rows=1147 loops=1)"
"                                                              ->  Index
Scan using regions_pkey on regions region3_  (cost=0.00..5.99 rows=1
width=40) (actual time=35.761..35.763 rows=1 loops=1)"
"
Index Cond: ((region_id)::text =
'093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text)"
"                                                              ->  Seq
Scan on cities city4_  (cost=0.00..28.47 rows=1147 width=80) (actual
time=0.022..9.476 rows=1147 loops=1)"
"Total runtime: 3052707.269 ms"


Re: Slow query

From
Tom Lane
Date:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
> I am running a query that joins against several large tables (~5 million
> rows each).  The query takes an exteremely long time to run, and the
> explain output is a bit beyond my level of understanding.  It is an
> auto-generated query, so the aliases are fairly ugly.

Yah :-(

> select distinct city4_.region_id as region1_29_, city4_1_.name as
> name29_, city4_.state_id as state2_30_
> from registered_voters registered0_
>          inner join registered_voter_addresses addresses1_ on
>          registered0_.registered_voter_id=addresses1_.registered_voter_id
>          inner join registered_voter_addresses_regions regions2_ on
>          addresses1_.address_id=regions2_.registered_voter_addresses_address_id
>          inner join regions region3_ on
>          regions2_.regions_region_id=region3_.region_id
>          inner join cities city4_ on
>          addresses1_.city_id=city4_.region_id
>          inner join regions city4_1_ on
>          city4_.region_id=city4_1_.region_id
> where region3_.region_id='093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'
> order by city4_1_.name

AFAICS the planner is doing about the best you can hope the machine to
do --- it's not making any serious estimation errors, and the plan is
pretty reasonable for the given query.  The problem is that you are
forming a very large join result (4918204 rows) and then doing a
DISTINCT that reduces this to only 1124 rows ... but the damage of
computing that huge join has already been done.  The machine is not
going to be able to think its way out of this one --- it's up to you
to think of a better formulation of the query.

Offhand I'd try something involving joining just city4_/city4_1_
(which should not need DISTINCT, I think) and then using WHERE
EXISTS(SELECT ... FROM the-other-tables) to filter out the cities
you don't want.  The reason this can be a win is that the EXISTS
formulation will stop running the sub-select as soon as it's produced a
single row for the current city, rather than generating thousands of
similar rows that will be thrown away by DISTINCT as you have here.

This assumes that the fraction of cities passing the query is
substantial, as it appears from the rowcounts in your EXPLAIN output.
If only a tiny fraction of them passed, then the time wasted in failing
EXISTS probes might eat up the savings.

            regards, tom lane