Re: Query Join Performance - Mailing list pgsql-sql

From Aaron Bono
Subject Re: Query Join Performance
Date
Msg-id bf05e51c0704251757o311dd849n2c6a52e1c68b933e@mail.gmail.com
Whole thread Raw
In response to Re: Query Join Performance  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Query Join Performance  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-sql
On 4/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Aaron Bono" <postgresql@aranya.com> writes:
> The biggest problem I notice is when I add a join from a child table
> (zip_city) to a parent table (zip).  I have filtered the child table down to
> about 650 records but when I add the join to the parent which has over
> 800,000 records, performance tanks.

It shouldn't surprise you that joining a 900k row table to an 800k row table
isn't cheap.

It would certainly be better if the thing delayed the join to nearby_zip
until after it had done the restrictive join.  Your problem is it
doesn't realize that that join condition is restrictive:

> Nested Loop  (cost=45779.82..147990502.45 rows=451678770 width=40) (actual
                                                 ^^^^^^^^^
> time=5404.943..20151.684 rows=653 loops=1)
>   Join Filter: ((abs(("inner".longitude - "outer".longitude)) <=
> 0.833333333333333::double precision) AND (abs(("inner".latitude -
> "outer".latitude)) <= 0.833333333333333::double precision))

which is hardly surprising since the condition is phrased in a way that
isn't amenable to statistical analysis.  You might want to look into
using PostGIS for this sort of thing --- it provides operators that are
better suited to the problem domain, and also allow some modicum of
intelligence in the rowcount estimates.

Another bad misestimation is here:

>                     ->  Bitmap Heap Scan on zip
> (cost=32.91..8933.90rows=4261 width=8) (actual time=
> 0.272..0.275 rows=1 loops=1)
>                           Recheck Cond: ((zip_cd)::text = '66105'::text)
>                           ->  Bitmap Index Scan on zip_zip_cd_key  (cost=
> 0.00..32.91 rows=4261 width=0) (actual time= 0.250..0.250 rows=1 loops=1)
>                                 Index Cond: ((zip_cd)::text = '66105'::text)

The error of 4000x here contributes directly to the error in the
top-level row estimate; but this one is a simple scalar condition and I'd
expect our stats code to be able to deal with it.  Are the stats on zip
up-to-date?  Maybe you need to increase the stats target for it.

                        regards, tom lane

Looks like a vacuum analyze did the trick.  Performance is beautiful now.  I should have tried that earlier.

I thought I had the auto vacuum turned on (PostgreSQL 8.1) but I guess it doesn't do analyze?

Anyway, I will schedule a vacuum analyze nightly - it is low usage and won't cause any problems.

Thanks!

--
==================================================================
   Aaron Bono
   Aranya Software Technologies, Inc.
   http://www.aranya.com
   http://codeelixir.com
==================================================================

pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Query Join Performance
Next
From: Alvaro Herrera
Date:
Subject: Re: Query Join Performance