Thread: Query Join Performance

Query Join Performance

From

"Aaron Bono"

Date:

24 April 2007, 22:26:22

I have a database with the following tables:<br /><br />zip {<br />    zip_id bigserial,<br />    zip_cd varchar(6) --
thisis indexed<br />}<br /><br />city {<br />    city_id bigserial,<br />    city_name varchar(50)<br />}<br /><br
/>zip_city{ <br />    city_id bigint (FK to city table),<br />    zip_id bigint (FK to zip table),<br />    longitude
numeric,<br/>    latitude numeric<br />}<br /><br />We want to find the zip_cd values for all zip codes within 50 miles
ofa specified zip code and have this query so far (assume our given zip code is 64131): <br /><br />    select<br />   
   zip_city.zip_id,<br />        gps_distance(zip_city.longitude::numeric, zip_city.latitude::numeric,
my_location.longitude::numeric,my_location.latitude::numeric) AS distance<br />    from zip_city <br />    inner join
(<br/>        select<br />            longitude,<br />            latitude<br />        from zip<br />        inner
joinzip_city on (<br />            zip.zip_id = zip_city.zip_id<br />        )<br />        where zip_cd = '64131' <br
/>   ) my_location on (<br />        -- We use this as an approximation to filter out as many records as possible
beforedoing the gps_distance() function call<br />        -- It works well as a outer bounds to get the most obvious
outsidezip_city values. <br />        abs(zip_city.longitude - my_location.longitude) <= (50.0 / 60.0)<br />       
ANDabs(zip_city.latitude - my_location.latitude) <= (50.0 / 60.0)<br />    )<br /><br />This runs in about 2 seconds
onWindows and gives me 650 records - not bad. <br /><br />When I add just a little bit:<br /><br />    select<br />   
   zip_city.zip_id,<br />        gps_distance(zip_city.longitude::numeric, zip_city.latitude::numeric,
my_location.longitude::numeric,my_location.latitude::numeric) AS distance <br />    from zip_city<br />   
-->>>>THE ONLY DIFFERENCE IS THE NEXT 3 LINES <<<<--<br />    inner join zip on (<br />       
zip_city.zip_id= zip.zip_id<br />    )<br />    -->>>> End of added code <<<<-- <br />    inner
join(<br />        select<br />            longitude,<br />            latitude<br />        from zip<br />       
innerjoin zip_city on (<br />            zip.zip_id = zip_city.zip_id<br />        )<br />        where zip_cd =
'64131'<br />    ) my_location on (<br />        abs(zip_city.longitude - my_location.longitude) <= (50.0 / 60.0)<br
/>       AND abs(zip_city.latitude - my_location.latitude) <= (50.0 / 60.0)<br />    )<br /><br />Performance tanks
withthis query - it takes over 120 seconds (that is where I set the timeout).  I read that a foreign key doesn't help
performanceand my guess is that PostgreSQL isn't just joining on the 650 records but rather on the over 800,000 records
inthe zip table (yes we have that many - we include Canada zip codes which is 90% of the data - Blame Canada! ;). <br
/><br/>Is there something I can do to boost performance?  I tried putting the first query above in an inner join before
joiningwith the zip table with the same result.<br /><br />BTW, on our Linux box the full query we run (which adds 3
moretables on the whole operation along with more filtering on the zip table) finishes in under 10 seconds.  Problem is
ourdevelopment is on Windows and this is a real pain for developers to test. <br /><br />Any assistance is
appreciated. I did look at the explain plans but I don't see anything other than what I mentioned above.  I can include
thoseif necessary.<br /><br />Thanks!<br />Aaron<br clear="all" /><br />-- <br />
==================================================================<br/>   Aaron Bono<br />   Aranya Software
Technologies,Inc.<br />   <a href="http://www.aranya.com">http://www.aranya.com</a><br />   <a
href="http://codeelixir.com">http://codeelixir.com</a><br
/>==================================================================

Re: Query Join Performance

From

Richard Huxton

Date:

25 April 2007, 15:02:42

Aaron Bono wrote:
> Performance tanks with this query - it takes over 120 seconds (that is 
> where
> I set the timeout).

> BTW, on our Linux box the full query we run (which adds 3 more tables on 
> the
> whole operation along with more filtering on the zip table) finishes in
> under 10 seconds.  Problem is our development is on Windows and this is a
> real pain for developers to test.

So what's different between the systems. Obvious things to look at:
1. EXPLAIN ANALYSE outputs to show the plans (these presumably are 
different, but in what details, and why?)
2. Configuration (particularly memory/cost settings).
3. Hardware.
4. Locale/encoding - these can affect index usage and sorting.

--   Richard Huxton  Archonet Ltd

Re: Query Join Performance

From

"Aaron Bono"

Date:

25 April 2007, 17:40:04

On 4/25/07, Richard Huxton <dev@archonet.com> wrote:

Aaron Bono wrote:
> Performance tanks with this query - it takes over 120 seconds (that is
> where
> I set the timeout).

> BTW, on our Linux box the full query we run (which adds 3 more tables on
> the
> whole operation along with more filtering on the zip table) finishes in
> under 10 seconds.  Problem is our development is on Windows and this is a
> real pain for developers to test.

So what's different between the systems. Obvious things to look at:
1. EXPLAIN ANALYSE outputs to show the plans (these presumably are
different, but in what details, and why?)
2. Configuration (particularly memory/cost settings).
3. Hardware.
4. Locale/encoding - these can affect index usage and sorting.

--
   Richard Huxton
   Archonet Ltd

Now I am having the same problem on the Linux box so I doubt it is the platform.

The biggest problem I notice is when I add a join from a child table (zip_city) to a parent table (zip). I have filtered the child table down to about 650 records but when I add the join to the parent which has over 800,000 records, performance tanks. I was able to benchmark two queries last night on my Windows machine:

-- This runs in just over 2 seconds

    select
        nearby_zip_city.zip_id,
        gps_distance(zip_city.longitude::numeric, zip_city.latitude::numeric, nearby_zip_city.longitude::numeric, nearby_zip_city.latitude::numeric) AS distance
    from zip_city
    inner join zip on (
        zip.zip_id = zip_city.zip_id
    )
    inner join zip_city as nearby_zip_city on (
        abs(zip_city.longitude - nearby_zip_city.longitude) <= (50.0 / 60.0)
        AND abs(zip_city.latitude - nearby_zip_city.latitude) <= (50.0 / 60.0)
    )
    where zip.zip_cd = '66105'

-- This takes over 48 seconds and I just added a join from the zip_city child to the zip parent table

    select
        nearby_zip.zip_cd,
        gps_distance(zip_city.longitude::numeric, zip_city.latitude::numeric, nearby_zip_city.longitude::numeric, nearby_zip_city.latitude::numeric) AS distance
    from zip_city
    inner join zip on (
        zip.zip_id = zip_city.zip_id
    )
    inner join zip_city as nearby_zip_city on (
        abs(zip_city.longitude - nearby_zip_city.longitude) <= (50.0 / 60.0)
        AND abs(zip_city.latitude - nearby_zip_city.latitude) <= ( 50.0 / 60.0)
    )
    -->>> The next 3 lines are the main difference <<<--
    inner join zip as nearby_zip on (
        nearby_zip_city.zip_id = nearby_zip.zip_id
    )
    -->>> End of difference <<<--
    where zip.zip_cd = '66105'

-- Explain plan for faster/first query:
Nested Loop (cost=45779.82..147990502.45 rows=451678770 width=40) (actual time=5404.943..20151.684 rows=653 loops=1)
Join Filter: ((abs(("inner".longitude - "outer".longitude)) <= 0.833333333333333::double precision) AND (abs(("inner".latitude - "outer".latitude)) <= 0.833333333333333::double precision))
-> Seq Scan on zip_city nearby_zip_city (cost=0.00..25514.19 rows=901719 width=24) (actual time= 0.053..2311.547 rows=901719 loops=1)
-> Materialize (cost=45779.82..45824.90 rows=4508 width=16) (actual time=0.003..0.009 rows=1 loops=901719)
        -> Hash Join (cost=8944.55..45775.31 rows=4508 width=16) (actual time= 444.657..4490.901 rows=1 loops=1)
              Hash Cond: ("outer".zip_id = "inner".zip_id)
              -> Seq Scan on zip_city (cost=0.00..25514.19 rows=901719 width=24) (actual time=0.004..2334.548 rows=901719 loops=1)
              -> Hash (cost=8933.90..8933.90 rows=4261 width=8) (actual time=0.296..0.296 rows=1 loops=1)
                    -> Bitmap Heap Scan on zip (cost=32.91..8933.90 rows=4261 width=8) (actual time= 0.272..0.275 rows=1 loops=1)
                          Recheck Cond: ((zip_cd)::text = '66105'::text)
                          -> Bitmap Index Scan on zip_zip_cd_key (cost=0.00..32.91 rows=4261 width=0) (actual time= 0.250..0.250 rows=1 loops=1)
                                Index Cond: ((zip_cd)::text = '66105'::text)

-- Explain plan for shower/second query:
Nested Loop (cost=75372.31..148056286.32 rows=451678770 width=43) (actual time= 62688.188..69916.943 rows=653 loops=1)
Join Filter: ((abs(("inner".longitude - "outer".longitude)) <= 0.833333333333333::double precision) AND (abs(("inner".latitude - "outer".latitude)) <= 0.833333333333333::double precision))
-> Hash Join (cost=29592.49..91298.06 rows=901719 width=27) (actual time=17905.224..52279.151 rows=901719 loops=1)
        Hash Cond: ("outer".zip_id = "inner".zip_id)
        -> Seq Scan on zip_city nearby_zip_city (cost=0.00..25514.19 rows=901719 width=24) (actual time=0.044..2888.993 rows=901719 loops=1)
        -> Hash (cost=21634.79..21634.79 rows=852279 width=19) (actual time= 13925.502..13925.502 rows=852279 loops=1)
              -> Seq Scan on zip nearby_zip (cost=0.00..21634.79 rows=852279 width=19) (actual time=0.042..2535.742 rows=852279 loops=1)
-> Materialize (cost=45779.82..45824.90 rows=4508 width=16) (actual time=0.002..0.009 rows=1 loops=901719)
        -> Hash Join (cost=8944.55..45775.31 rows=4508 width=16) (actual time=421.374..4453.224 rows=1 loops=1)
              Hash Cond: ("outer".zip_id = "inner".zip_id)
              -> Seq Scan on zip_city (cost=0.00..25514.19 rows=901719 width=24) (actual time=0.028..2333.941 rows=901719 loops=1)
              -> Hash (cost=8933.90..8933.90 rows=4261 width=8) (actual time= 0.604..0.604 rows=1 loops=1)
                    -> Bitmap Heap Scan on zip (cost=32.91..8933.90 rows=4261 width=8) (actual time=0.588..0.591 rows=1 loops=1)
                          Recheck Cond: ((zip_cd)::text = '66105'::text)
                          -> Bitmap Index Scan on zip_zip_cd_key (cost=0.00..32.91 rows=4261 width=0) (actual time=0.559..0.559 rows=1 loops=1)
                                Index Cond: ((zip_cd)::text = '66105'::text)

It is the join that is killing the query but I am at a loss of the best approach to fix it. I have some work arounds in mind by flattening out the tables but I would rather not have to do that.

Thanks!
Aaron

--
==================================================================
   Aaron Bono
   Aranya Software Technologies, Inc.
   http://www.aranya.com
   http://codeelixir.com
==================================================================

Re: Query Join Performance

From

Tom Lane

Date:

25 April 2007, 19:24:46

"Aaron Bono" <postgresql@aranya.com> writes:
> The biggest problem I notice is when I add a join from a child table
> (zip_city) to a parent table (zip).  I have filtered the child table down to
> about 650 records but when I add the join to the parent which has over
> 800,000 records, performance tanks.

It shouldn't surprise you that joining a 900k row table to an 800k row table
isn't cheap.

It would certainly be better if the thing delayed the join to nearby_zip
until after it had done the restrictive join.  Your problem is it
doesn't realize that that join condition is restrictive:

> Nested Loop  (cost=45779.82..147990502.45 rows=451678770 width=40) (actual
   ^^^^^^^^^

> time=5404.943..20151.684 rows=653 loops=1)
>   Join Filter: ((abs(("inner".longitude - "outer".longitude)) <=
> 0.833333333333333::double precision) AND (abs(("inner".latitude -
> "outer".latitude)) <= 0.833333333333333::double precision))

which is hardly surprising since the condition is phrased in a way that
isn't amenable to statistical analysis.  You might want to look into
using PostGIS for this sort of thing --- it provides operators that are
better suited to the problem domain, and also allow some modicum of
intelligence in the rowcount estimates.

Another bad misestimation is here:

>                     ->  Bitmap Heap Scan on zip
> (cost=32.91..8933.90rows=4261 width=8) (actual time=
> 0.272..0.275 rows=1 loops=1)
>                           Recheck Cond: ((zip_cd)::text = '66105'::text)
>                           ->  Bitmap Index Scan on zip_zip_cd_key  (cost=
> 0.00..32.91 rows=4261 width=0) (actual time=0.250..0.250 rows=1 loops=1)
>                                 Index Cond: ((zip_cd)::text = '66105'::text)

The error of 4000x here contributes directly to the error in the
top-level row estimate; but this one is a simple scalar condition and I'd
expect our stats code to be able to deal with it.  Are the stats on zip
up-to-date?  Maybe you need to increase the stats target for it.
        regards, tom lane

Re: Query Join Performance

From

"Aaron Bono"

Date:

25 April 2007, 21:57:10

On 4/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Aaron Bono" <postgresql@aranya.com> writes:
> The biggest problem I notice is when I add a join from a child table
> (zip_city) to a parent table (zip).  I have filtered the child table down to
> about 650 records but when I add the join to the parent which has over
> 800,000 records, performance tanks.

It shouldn't surprise you that joining a 900k row table to an 800k row table
isn't cheap.

It would certainly be better if the thing delayed the join to nearby_zip
until after it had done the restrictive join.  Your problem is it
doesn't realize that that join condition is restrictive:

> Nested Loop  (cost=45779.82..147990502.45 rows=451678770 width=40) (actual
                                                 ^^^^^^^^^
> time=5404.943..20151.684 rows=653 loops=1)
>   Join Filter: ((abs(("inner".longitude - "outer".longitude)) <=
> 0.833333333333333::double precision) AND (abs(("inner".latitude -
> "outer".latitude)) <= 0.833333333333333::double precision))

which is hardly surprising since the condition is phrased in a way that
isn't amenable to statistical analysis.  You might want to look into
using PostGIS for this sort of thing --- it provides operators that are
better suited to the problem domain, and also allow some modicum of
intelligence in the rowcount estimates.

Another bad misestimation is here:

>                     ->  Bitmap Heap Scan on zip
> (cost=32.91..8933.90rows=4261 width=8) (actual time=
> 0.272..0.275 rows=1 loops=1)
>                           Recheck Cond: ((zip_cd)::text = '66105'::text)
>                           ->  Bitmap Index Scan on zip_zip_cd_key  (cost=
> 0.00..32.91 rows=4261 width=0) (actual time= 0.250..0.250 rows=1 loops=1)
>                                 Index Cond: ((zip_cd)::text = '66105'::text)

The error of 4000x here contributes directly to the error in the
top-level row estimate; but this one is a simple scalar condition and I'd
expect our stats code to be able to deal with it.  Are the stats on zip
up-to-date?  Maybe you need to increase the stats target for it.

                        regards, tom lane

Looks like a vacuum analyze did the trick. Performance is beautiful now. I should have tried that earlier.

I thought I had the auto vacuum turned on (PostgreSQL 8.1) but I guess it doesn't do analyze?

Anyway, I will schedule a vacuum analyze nightly - it is low usage and won't cause any problems.

Thanks!

--
==================================================================
   Aaron Bono
   Aranya Software Technologies, Inc.
   http://www.aranya.com
   http://codeelixir.com
==================================================================

Re: Query Join Performance

From

Alvaro Herrera

Date:

26 April 2007, 00:09:34

Aaron Bono wrote:

> Looks like a vacuum analyze did the trick.  Performance is beautiful now.  I
> should have tried that earlier.
> 
> I thought I had the auto vacuum turned on (PostgreSQL 8.1) but I guess it
> doesn't do analyze?
> 
> Anyway, I will schedule a vacuum analyze nightly - it is low usage and won't
> cause any problems.

It should have done an analyze at some point.  Unless this is Windows,
in which case there's a bug that precludes autovacuum from running at
all.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.