Re: 8.4.4, 9.0, and 9.1 Planner Differences - Mailing list pgsql-performance

From Anthony Presley
Subject Re: 8.4.4, 9.0, and 9.1 Planner Differences
Date
Msg-id CAO2AxyrxQ=4sD7OnUuz8AADBs5176M=dQ-MhFbWLGJ3aDj=kJQ@mail.gmail.com
Whole thread Raw
In response to Re: 8.4.4, 9.0, and 9.1 Planner Differences  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 8.4.4, 9.0, and 9.1 Planner Differences  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Sat, Oct 22, 2011 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anthony Presley <anthony@resolution.com> writes:
> We have a dev machine running 9.0.1 (an i3 laptop, with a regular hard disk,
> with 4GB of RAM, and a mostly untuned postgresql.conf file).  The changed
> lines are:
>   shared_buffers = 512MB
>   temp_buffers = 48MB
>   work_mem = 32MB
>   maintenance_work_mem = 348MB
>   checkpoint_segments = 10
>   effective_cache_size = 512MB

> The same database is loaded onto a production server running 9.1.1 (dual QC
> processors, RAID-10 SAS drives, 36GB of RAM), which replicates to a backup
> server.  This has a lot of changed properties:
>   shared_buffers = 8500MB
>   work_mem = 35MB
>   maintenance_work_mem = 512MB
>   wal_level = hot_standby
>   checkpoint_segments = 50
>   max_wal_senders = 3
>   wal_keep_segments = 144
>   random_page_cost = 1.0
>   effective_cache_size = 16384MB
>   effective_io_concurrency = 6

That random_page_cost setting is going to have a huge effect on the
planner's choices, and the larger effective_cache_size setting will
likely affect plans too.  I don't find it surprising in the least
that you're getting different plan choices ... and even less so when
your "dev" and "production" DBs aren't even the same major version.
You might want to think about making your dev environment more like
your production.

Tom - thanks for your input.

Upgrading to 9.1.1 on the dev box is certainly next on our list ... I like to make sure that the dev team uses a MUCH slower box than the production server, making sure that if the developers are making things fast for the machines, it's really fast on the production box.  For all of our queries except this one, this strategy is "working".

> The same DB is loaded on both the production and the dev environment, and in
> all cases (about 5000 distinct different queries), the production
> environment is about 500x faster, except for one type of query (both
> databases were loaded from the same pg_dump on an 8.4.4 database):

>     On the dev box, we have:  http://explain.depesz.com/s/rwU   - about 131
> seconds
>     On the production box, we have:  http://explain.depesz.com/s/3dt  -
> about .25 seconds

Did you mislabel these?  Because if you didn't, the numbers are right
in line with what you say above.  But anyway, the problem with the
slower query appears to be poor rowcount estimates, leading the planner
to use a nestloop join when it shouldn't.  You haven't provided nearly
enough context to let anyone guess why the estimates are off, other
than boilerplate suggestions like making sure the tables have been
ANALYZEd recently, and maybe increasing the statistics targets.

I *did* mis-label them.  The 131 seconds is actually the production box.  IE:
  production is ... http://explain.depesz.com/s/rwU

The .25 seconds is the development box.  IE:
  development is ... http://explain.depesz.com/s/3dt

I wasn't surprised that the plans are different.  I was surprised that the development box *spanked* the production system.

Here's the actual query:
select 
    preference0_.*
from
    preference preference0_, location location1_, employee employee2_
where
    preference0_.employee_id=employee2_.id 
    and preference0_.location_id=location1_.id 
    and location1_.corporation_id=41197 
    and employee2_.deleted='N' 
    and preference0_.deleted='N' 
    and
    (preference0_.id not in (
    select preference3_.id from preference preference3_, location location4_, employee employee5_ 
where preference3_.employee_id=employee5_.id and preference3_.location_id=location4_.id 
and location4_.corporation_id=41197 and employee5_.deleted='N' and preference3_.deleted='N' 
and (preference3_.startDate>'2011-11-03 00:00:00' or preference3_.endDate<'2011-11-02 00:00:00'))
    ) and
    (preference0_.employee_id in
        (select employee6_.id from employee employee6_ inner join app_user user7_ on employee6_.user_id=user7_.id 
inner join user_location userlocati8_ on user7_.id=userlocati8_.user_id, location location9_ 
where userlocati8_.location_id=location9_.id and userlocati8_.location_id=6800 and userlocati8_.deleted='N' 
and location9_.deleted='N' and employee6_.deleted='N')
    ) order by preference0_.date_created;

I have tried setting the statistics on employee.user_id to be 100 and 1000, and the rest are the default (100).

I've run both an "ANALYZE" and a "VACUUM ANALYZE" on the production system - both "generally", and on each of the above tables (employee, app_user, location, preference).

Here's an updated explain of the most recent attempt.  About 5 minutes after I analyzed them:

What else would I need to provide?


--
Anthony Presley

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: 8.4.4, 9.0, and 9.1 Planner Differences
Next
From: Stefan Keller
Date:
Subject: hstore query: Any better idea than adding more memory?