Thread: poor performing plan from analyze vs. fast default plan pre-analyze on new database

Hi all --

In attempting to perform a particular query in postgres, I ran into
what appeared to be a pretty severe performance bottleneck.  I didn't
know whether I'd constructed my query stupidly, or I'd misconfigured
postgres in some suboptimal way, or what else might be going on.
Though I very much wanted to know the underlying cause of this
performance issue on my query, I was also facing the constraint that I
was working with a system in production and the tables my query
operated on could not be taken offline.  So I did the sensible thing
and captured a snapshot of the current production system via simple
pg_dump and played it back into a different postgres installation.
The snapshot copy was installed onto a "development system" with a
Pentium D class processor running postgres 8.1 on CentOS 4 32-bit
whereas the production copy was on a "production system" with a Core 2
Duo class processor running postgres 8.3 on OpenSUSE 10.2 64-bit.

I resubmitted a modified version of the query that'd given me problems
before, this time giving the same query to both systems at the same
time.  To my surprise, the query finished successfully on the
development system in under a minute.  I let the query continue to run
on the production system while I verified the results from the query
on the development system -- everything checked out.  It took more
than a day and a half for the query to complete on the production
system.  Granted, the production system was continuing to see inserts
to the relevant table while my query was running and granted that the
size of the production system's table had grown since the snapshot
(but not more than 50% larger), but I found it very difficult to
accept these conditions explaining such a massive difference in
performance of my query.  Here's the query and the response to
console, including \timing info, as submitted on the development
system:
smirkfp=# insert into dupids ( id ) select id from content where id
not in (select min(id) from content group by hash);
INSERT 0 437391
Time: 25733.394 ms

Here is the same query with the very different timing results seen on
the production system:
smirkfp=# insert into dupids ( id ) select id from content where id
not in (select min(id) from content group by hash);
INSERT 0 441592
Time: 142622702.430 ms

A little more background:  The table of interest, content, has around
1.5M rows on the production system and around 1.1M rows on the
development system at the time this query was run.  On both systems,
the smirkfp databases are centered around this table, content, and
have no other large tables or activities of interest outside this
table.  The database is sufficiently new that no time had been taken
to vacuum or analyze anything previously.  Neither the development nor
production system had noteworthy processor load or disk activity
outside postgres.  On the production system, the above long-running
query was pegging one of the processor cores at ~100% for almost the
whole time of the query's execution.

On the development system, I asked postgres to show me the execution
plan for my query:
smirkfp=# explain insert into dupids ( id ) select id from content
where id not in (select min(id) from content group by hash);
                                  QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on content  (cost=204439.55..406047.33 rows=565752 width=4)
  Filter: (NOT (hashed subplan))
  SubPlan
    ->  HashAggregate  (cost=204436.55..204439.05 rows=200 width=36)
          ->  Seq Scan on content  (cost=0.00..198779.03 rows=1131503 width=36)
(5 rows)


For comparison, I asked the same thing of the production system:
smirkfp=# explain insert into dupids ( id ) select id from content
where id not in (select min(id) from content group by hash);
                                        QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on content  (cost=493401.85..9980416861.63 rows=760071 width=4)
  Filter: (NOT (subplan))
  SubPlan
    ->  Materialize  (cost=493401.85..504915.85 rows=646400 width=37)
          ->  GroupAggregate  (cost=468224.39..487705.45 rows=646400 width=37)
                ->  Sort  (cost=468224.39..472024.74 rows=1520142 width=37)
                      Sort Key: public.content.hash
                      ->  Seq Scan on content  (cost=0.00..187429.42
rows=1520142 width=37)
(8 rows)


Looks pretty different.  Next, I thought I'd try asking the
development system to analyze the table, content, and see if that
changed anything:
smirkfp=# analyze content;
ANALYZE
smirkfp=# explain insert into dupids ( id ) select id from content
where id not in (select min(id) from content group by hash);
                                        QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on content  (cost=480291.35..7955136280.55 rows=582888 width=4)
  Filter: (NOT (subplan))
  SubPlan
    ->  Materialize  (cost=480291.35..492297.85 rows=656050 width=40)
          ->  GroupAggregate  (cost=457245.36..474189.30 rows=656050 width=40)
                ->  Sort  (cost=457245.36..460159.80 rows=1165776 width=40)
                      Sort Key: hash
                      ->  Seq Scan on content  (cost=0.00..199121.76
rows=1165776 width=40)
(8 rows)


That looks a bit discouraging.  As a consequence of the analyze, it
seems the development system has changed its execution plan for that
query to be just like the production system.  Let's drop the table,
dupids, and re-create it, then re-run our query on the development
system with its new execution plan:
smirkfp=# drop table dupids;
DROP TABLE
Time: 227.843 ms
smirkfp=# create table dupids ( id integer );
CREATE TABLE
Time: 30.980 ms
smirkfp=# insert into dupids ( id ) select id from content where id
not in (select min(id) from content group by hash);
Cancel request sent
ERROR:  canceling statement due to user request
smirkfp=#

I grew impatient after letting the query run for an hour.  Without
knowing precisely how long the query would've taken with this
execution plan, it seems clear enough that it is a suboptimal plan.

How to approach manipulating the execution plan back to something more
efficient?  What characteristics of the table could have induced
analyze to suggest the much slower query plan?


Thanks in advance,

Davin

Re: poor performing plan from analyze vs. fast default plan pre-analyze on new database

From
Grzegorz Jaśkiewicz
Date:
Postgresql isn't very efficient with subselects like that,
try:
explain select c.id from content c LEFT JOIN (select min(id) AS id
from content group by hash) cg ON cg.id=c.id WHERE cg.id is null;

Davin Potts <davin@appliomics.com> writes:
> How to approach manipulating the execution plan back to something more
> efficient? �What characteristics of the table could have induced
> analyze to suggest the much slower query plan?

What's evidently happening is that the planner is backing off from using
a hashed subplan because it thinks the hashtable will require more than
work_mem.  Is 646400 a reasonably good estimate of the number of rows
that the sub-select will produce?  If it's a large overestimate, then
perhaps increasing the stats target for content.hash will help.  If
it's good, then what you want to do is increase work_mem to allow the
planner to use the better plan.

            regards, tom lane

On 6/3/09 8:42 AM, "Davin Potts" <davin@appliomics.com> wrote:

> Hi all --
>
>
> A little more background:  The table of interest, content, has around
> 1.5M rows on the production system and around 1.1M rows on the
> development system at the time this query was run.  On both systems,
> the smirkfp databases are centered around this table, content, and
> have no other large tables or activities of interest outside this
> table.  The database is sufficiently new that no time had been taken
> to vacuum or analyze anything previously.  Neither the development nor
> production system had noteworthy processor load or disk activity
> outside postgres.  On the production system, the above long-running
> query was pegging one of the processor cores at ~100% for almost the
> whole time of the query's execution.
>
> On the development system, I asked postgres to show me the execution
> plan for my query:
> smirkfp=# explain insert into dupids ( id ) select id from content
> where id not in (select min(id) from content group by hash);
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
> ---
>  Seq Scan on content  (cost=204439.55..406047.33 rows=565752 width=4)
>   Filter: (NOT (hashed subplan))
>   SubPlan
>     ->  HashAggregate  (cost=204436.55..204439.05 rows=200 width=36)
>           ->  Seq Scan on content  (cost=0.00..198779.03 rows=1131503
> width=36)
> (5 rows)
>
>
> For comparison, I asked the same thing of the production system:
> smirkfp=# explain insert into dupids ( id ) select id from content
> where id not in (select min(id) from content group by hash);
>                                         QUERY PLAN
> ------------------------------------------------------------------------------
> ---------------
>  Seq Scan on content  (cost=493401.85..9980416861.63 rows=760071 width=4)
>   Filter: (NOT (subplan))
>   SubPlan
>     ->  Materialize  (cost=493401.85..504915.85 rows=646400 width=37)
>           ->  GroupAggregate  (cost=468224.39..487705.45 rows=646400 width=37)
>                 ->  Sort  (cost=468224.39..472024.74 rows=1520142 width=37)
>                       Sort Key: public.content.hash
>                       ->  Seq Scan on content  (cost=0.00..187429.42
> rows=1520142 width=37)
> (8 rows)
>
>
> Looks pretty different.  Next, I thought I'd try asking the
> development system to analyze the table, content, and see if that
> changed anything:
> smirkfp=# analyze content;
> ANALYZE
> smirkfp=# explain insert into dupids ( id ) select id from content
> where id not in (select min(id) from content group by hash);
>                                         QUERY PLAN
> ------------------------------------------------------------------------------
> ---------------
>  Seq Scan on content  (cost=480291.35..7955136280.55 rows=582888 width=4)
>   Filter: (NOT (subplan))
>   SubPlan
>     ->  Materialize  (cost=480291.35..492297.85 rows=656050 width=40)
>           ->  GroupAggregate  (cost=457245.36..474189.30 rows=656050 width=40)
>                 ->  Sort  (cost=457245.36..460159.80 rows=1165776 width=40)
>                       Sort Key: hash
>                       ->  Seq Scan on content  (cost=0.00..199121.76
> rows=1165776 width=40)
> (8 rows)
>
> How to approach manipulating the execution plan back to something more
> efficient?  What characteristics of the table could have induced
> analyze to suggest the much slower query plan?
>

When the table was analyzed, it found many more rows for the hash than the
default assumption (distinct value estimate).  If Postgres thinks the
hash-aggregate plan won't fit in work_mem, it will go to a sort -> group
aggregate plan even if it estimates the sort plan to take thousands of times
more effort.

Solutions:

1.  If this estimate is wrong, try increasing the statistics target on the
column(s) in question and re-analyzing.  Or for a test change the global
default_statistics_target and experiment.  In the explain queries below, it
appears as though this approach may have only a moderate affect.  Is the
estimate of 650,000 distinct values for the hash column accurate?

2.  Increase work_mem so that it can fit the hash in memory and use that
plan and not have to sort the whole table.  The below explain on your
production db thinks it needs to hash into about ~650000 distinct buckets
rows of width 37.  That should fit in 32MB RAM or so.
Try work_mem of 16MB, 32MB, and 64MB (and perhaps even 128M or larger on the
test box) and see if the explain changes.

  set work_mem ='32M';
Explain <your query>;

To see what your current work_mem is do
  show work_mem;

If this approach works, you can either set this work_mem before running the
query, or globally change it.  It is set low by default because if all your
connections are busy doing work that requires work_mem, you can end up using
RAM at up to about (2 * work_mem * active connections).

On my larger db, the back-end aggregate processing connections use work_mem
= 750M, which allows queries with 150M + rows to aggregate in minutes rather
than sort for half a day.  However, we carefully cap the number of processes
that can concurrently run such queries.

>
> Thanks in advance,
>
> Davin
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>