Re: Slow query - Mailing list pgsql-performance

From Tom Lane
Subject Re: Slow query
Date
Msg-id 23300.1140715522@sss.pgh.pa.us
Whole thread Raw
In response to Slow query  ("Jeremy Haile" <jhaile@fastmail.fm>)
List pgsql-performance
"Jeremy Haile" <jhaile@fastmail.fm> writes:
> I am running a query that joins against several large tables (~5 million
> rows each).  The query takes an exteremely long time to run, and the
> explain output is a bit beyond my level of understanding.  It is an
> auto-generated query, so the aliases are fairly ugly.

Yah :-(

> select distinct city4_.region_id as region1_29_, city4_1_.name as
> name29_, city4_.state_id as state2_30_
> from registered_voters registered0_
>          inner join registered_voter_addresses addresses1_ on
>          registered0_.registered_voter_id=addresses1_.registered_voter_id
>          inner join registered_voter_addresses_regions regions2_ on
>          addresses1_.address_id=regions2_.registered_voter_addresses_address_id
>          inner join regions region3_ on
>          regions2_.regions_region_id=region3_.region_id
>          inner join cities city4_ on
>          addresses1_.city_id=city4_.region_id
>          inner join regions city4_1_ on
>          city4_.region_id=city4_1_.region_id
> where region3_.region_id='093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'
> order by city4_1_.name

AFAICS the planner is doing about the best you can hope the machine to
do --- it's not making any serious estimation errors, and the plan is
pretty reasonable for the given query.  The problem is that you are
forming a very large join result (4918204 rows) and then doing a
DISTINCT that reduces this to only 1124 rows ... but the damage of
computing that huge join has already been done.  The machine is not
going to be able to think its way out of this one --- it's up to you
to think of a better formulation of the query.

Offhand I'd try something involving joining just city4_/city4_1_
(which should not need DISTINCT, I think) and then using WHERE
EXISTS(SELECT ... FROM the-other-tables) to filter out the cities
you don't want.  The reason this can be a win is that the EXISTS
formulation will stop running the sub-select as soon as it's produced a
single row for the current city, rather than generating thousands of
similar rows that will be thrown away by DISTINCT as you have here.

This assumes that the fraction of cities passing the query is
substantial, as it appears from the rowcounts in your EXPLAIN output.
If only a tiny fraction of them passed, then the time wasted in failing
EXISTS probes might eat up the savings.

            regards, tom lane

pgsql-performance by date:

Previous
From: Vivek Khera
Date:
Subject: Re: Looking for a tool to "*" pg tables as ERDs
Next
From: "Kevin Grittner"
Date:
Subject: Re: Good News re count(*) in 8.1