Thread: Bad plan by Planner (Already resolved?)

Bad plan by Planner (Already resolved?)

From
Robins Tharakan
Date:
Hi,

I stumbled upon a situation where the planner comes with a bad query
plan, but I wanted to mention upfront that I'm using a dated PG version
and I already see an update which mentions about improving planner
performance. I just wanted to check if this issue is already resolved,
and if so, which version should I be eyeing.

My PG Version: 8.4.7
Probably solved in: 8.4.8 / 9.0.4 ?

Issue: It seems that the planner is unable to flatten the IN sub-query
causing the planner to take a bad plan and take ages (>2500 seconds) and
expects to give a 100 million row output, where in-fact it should get a
six row output. The same IN query, when flattened, PG gives the correct
result in a fraction of a second.

Do let me know if this is a new case. I could try to give you the
EXPLAIN ANALYSE outputs / approximate table sizes if required.

EXISTING QUERY:
     SELECT field_b FROM large_table_a
         JOIN large_table_b USING (field_b)
     WHERE field_a IN (SELECT large_table_b.field_a
     FROM large_table_b WHERE field_b = 2673056)

RECOMMENDED QUERY:
     SELECT s1.field_b FROM large_table_a
         JOIN large_table_b s1 USING (field_b)
         JOIN large_table_b s2 ON s1.field_a = s2.field_a
     WHERE s2.field_b = 2673056

--
Robins Tharakan


Attachment

Re: Bad plan by Planner (Already resolved?)

From
"Kevin Grittner"
Date:
Robins Tharakan <robins.tharakan@comodo.com> wrote:

> I stumbled upon a situation where the planner comes with a bad
> query plan, but I wanted to mention upfront that I'm using a dated
> PG version and I already see an update which mentions about
> improving planner performance. I just wanted to check if this
> issue is already resolved, and if so, which version should I be
> eyeing.
>
> My PG Version: 8.4.7
> Probably solved in: 8.4.8 / 9.0.4 ?

First off, did you use pg_upgrade from an earlier major release?  If
so, be sure you've dealt with this issue:

http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix

Second, the releases you should be considering are on this page:

http://www.postgresql.org/

So 8.4.9, 9.0.5, or 9.1.1.

If anybody recognized the issue from your description, they probably
would have posted by now.  The fact that there has been no such post
doesn't necessarily mean it's not fixed -- the description is a
little vague without table definitions and EXPLAIN ANALYZE output,
so people might just not be sure.  Since it's arguably in your best
interest to update at least to 8.4.9 anyway, the easiest way to get
your answer might be to do so and test it.

http://www.postgresql.org/support/versioning

> Do let me know if this is a new case. I could try to give you the
> EXPLAIN ANALYSE outputs / approximate table sizes if required.

If you establish that the latest versions of the software still show
the issue, please post with more information, as described here:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin

Re: Bad plan by Planner (Already resolved?)

From
Robins Tharakan
Date:
Hi,

I'll try to answer in-line.

On 10/17/2011 09:32 PM, Kevin Grittner wrote:
> First off, did you use pg_upgrade from an earlier major release?  If
> so, be sure you've dealt with this issue:
Although I joined recently, I doubt whether pg_upgrade was used here.
And this doesn't look like the issue either. There are no data loss
issues and this seems primarily a planner specific bug.


> the description is a
> little vague without table definitions and EXPLAIN ANALYZE output,
> so people might just not be sure.
Makes sense. Just that, I thought I shouldn't drop in a large mail, in
case the issue was a well-known one. Please find below the EXPLAIN
ANALYSE output. I've changed the table-names / field-names and provided
other details as well.

large_table_a: ~20million
n_dead_tuples / reltuples : ~7%
analysed: <2 weeks

large_table_b: ~140million
n_dead_tuples / reltuples : ~0%
analysed: <2 days

default_statistics_target: 1000

field_a: int (indexed)
field_b: int (indexed)


> Since it's arguably in your best
> interest to update at least to 8.4.9 anyway, the easiest way to get
> your answer might be to do so and test it.
Frankly, its slightly difficult to just try out versions. DB>1Tb and
getting that kind of resources to just try out versions for a query is
not that simple. Hope you would understand. I have the workaround
implemented, but just wanted to be sure that this is accommodated in a
newer version.


===============
EXISTING QUERY:
       SELECT field_a FROM large_table_a
           JOIN large_table_b USING (field_a)
       WHERE field_b IN (SELECT large_table_b.field_b
       FROM large_table_b WHERE field_a = 2673056)

ANALYSE:
Hash Join  (cost=273247.23..6460088.89 rows=142564896 width=4)
     Hash Cond: (public.large_table_b.field_b =
public.large_table_b.field_b)
     ->  Merge Join  (cost=273112.62..5925331.24 rows=142564896 width=8)
           Merge Cond: (large_table_a.field_a =
public.large_table_b.field_a)
           ->  Index Scan using "PK_large_table_a" on large_table_a
(cost=0.00..570804.30 rows=22935395 width=4)
           ->  Index Scan using "IX_large_table_b_field_a" on
large_table_b  (cost=0.00..4381499.54 rows=142564896 width=8)
     ->  Hash  (cost=133.32..133.32 rows=103 width=4)
           ->  HashAggregate  (cost=132.29..133.32 rows=103 width=4)
                 ->  Index Scan using "IX_large_table_b_field_a" on
large_table_b  (cost=0.00..131.87 rows=165 width=4)
                       Index Cond: (field_a = 2673056)

=====================

ALTERNATE QUERY:
       SELECT s1.field_a FROM large_table_a
           JOIN large_table_b s1 USING (field_a)
           JOIN large_table_b s2 ON s1.field_b = s2.field_b
       WHERE s2.field_a = 2673056

ANALYSE:
Nested Loop  (cost=0.00..2368.74 rows=469 width=4) (actual
time=0.090..0.549 rows=6 loops=1)
     ->  Nested Loop  (cost=0.00..1784.06 rows=469 width=4) (actual
time=0.057..0.350 rows=16 loops=1)
           ->  Index Scan using "IX_large_table_b_field_a" on
large_table_b s2  (cost=0.00..131.87 rows=165 width=4) (actual
time=0.033..0.046 rows=6 loops=1)
                 Index Cond: (field_a = 2673056)
           ->  Index Scan using "IX_large_table_b_SampleId" on
large_table_b s1  (cost=0.00..9.99 rows=2 width=8) (actual
time=0.037..0.047 rows=3 loops=6)
                 Index Cond: (s1.field_b = s2.field_b)
     ->  Index Scan using "PK_large_table_a" on large_table_a
(cost=0.00..1.23 rows=1 width=4) (actual time=0.011..0.011 rows=0 loops=16)
           Index Cond: (large_table_a.field_a = s1.field_a)
Total runtime: 0.620 ms


--
Robins Tharakan


Attachment

Re: Bad plan by Planner (Already resolved?)

From
Mark Kirkwood
Date:
On 17/10/11 19:28, Robins Tharakan wrote:
> Hi,
>
> I stumbled upon a situation where the planner comes with a bad query
> plan, but I wanted to mention upfront that I'm using a dated PG
> version and I already see an update which mentions about improving
> planner performance. I just wanted to check if this issue is already
> resolved, and if so, which version should I be eyeing.
>
> My PG Version: 8.4.7
> Probably solved in: 8.4.8 / 9.0.4 ?
>
> Issue: It seems that the planner is unable to flatten the IN sub-query
> causing the planner to take a bad plan and take ages (>2500 seconds)
> and expects to give a 100 million row output, where in-fact it should
> get a six row output. The same IN query, when flattened, PG gives the
> correct result in a fraction of a second.
>
> Do let me know if this is a new case. I could try to give you the
> EXPLAIN ANALYSE outputs / approximate table sizes if required.
>
> EXISTING QUERY:
>     SELECT field_b FROM large_table_a
>         JOIN large_table_b USING (field_b)
>     WHERE field_a IN (SELECT large_table_b.field_a
>     FROM large_table_b WHERE field_b = 2673056)
>
> RECOMMENDED QUERY:
>     SELECT s1.field_b FROM large_table_a
>         JOIN large_table_b s1 USING (field_b)
>         JOIN large_table_b s2 ON s1.field_a = s2.field_a
>     WHERE s2.field_b = 2673056
>
>

Poor plans being generated for the subquery variant above were
specifically targeted in 8.4.9. It may be that you don't need the
workaround in that (or corresponding later) versions - 9.0.5, 9.1.0.

Regards

Mark

Re: Bad plan by Planner (Already resolved?)

From
"Kevin Grittner"
Date:
Robins Tharakan  wrote:

> I'll try to answer in-line.

Thanks; that's the preferred style on PostgreSQL lists.

> On 10/17/2011 09:32 PM, Kevin Grittner wrote:
>> First off, did you use pg_upgrade from an earlier major release?
>> If so, be sure you've dealt with this issue:
> Although I joined recently, I doubt whether pg_upgrade was used
> here. And this doesn't look like the issue either. There are no
> data loss issues and this seems primarily a planner specific bug.

The data loss doesn't happen until transaction ID wraparound -- so if
you had used pg_upgrade to get to where you are, and not used the
recovery techniques I pointed to, you could suddenly start losing
data at a time long after the conversion.  Since you're on a version
which came out before that was discovered I thought it would be
friendly to try to save you that trouble; but if you're sure you're
not in a vulnerable state, that's great.

>> Since it's arguably in your best interest to update at least to
>> 8.4.9 anyway, the easiest way to get your answer might be to do so
>> and test it.

> Frankly, its slightly difficult to just try out versions. DB>1Tb
> and getting that kind of resources to just try out versions for a
> query is not that simple. Hope you would understand.

That I don't understand.  We have found that it takes no longer to
upgrade to a new minor release on a 2.5 TB database cluster than on a
tiny 300 MB cluster.  (With pg_upgrade, it only takes five or ten
minutes of down time to upgrade a new *major* release on a multi-TB
database, but that's not what we're talking about to get to 9.4.9.)

We build from source, and we include the minor release number in the
prefix for the build, so we can have both old and new software
installed side-by-side.  The path for the client-side executables we
do through a symbolic link, so we can switch that painlessly.  And we
assign the prefix used for the server to an environment variable in
our services script.  So here is our process:

 - Build and install the new minor release.
 - Change the symlink to use it for clients (like pg_dump and psql).
 - Change the service script line that sets the prefix to point to
   the new minor release.
 - Run the service script with "stop" and then run the service script
   with "start".  (Unless your service script does a restart by using
   stop and start, don't run it with "restart", because a PostgreSQL
   restart won't pick up the new executables.)

There is literally no more down time than it takes to stop and start
the database service.  Our client software retries on a broken
connection, so we can even do this while users are running and they
just get a clock for a few seconds; but we usually prefer not to
cause even that much disruption, at least during normal business
hours.  We have enough hardware to load balance off of one machine at
a time to do this without interruption of service.

There are sometimes bugs fixed in a minor release which require
cleanup of possibly damaged data, like what I mentioned above.  You
may need to vacuum or reindex something to recover from the damage
caused by the now-fixed bug, but the alternative is to continue to
run with the damage.  I don't understand why someone would knowingly
choose that.

Really, it is worthwhile to keep up on minor releases.

http://www.postgresql.org/support/versioning

Perhaps the difference is that you feel I'm suggesting that you
upgrade in order to see if performance improves.  I'm not.  I'm
suggesting that you upgrade to get the bug fixes and security fixes.
After the upgrade, it would make sense to see if it also fixed your
performance problem.

> I have the workaround implemented, but just wanted to be sure that
> this is accommodated in a newer version.

You've already gotten feedback on that; I don't have anything to add
there.

-Kevin

Re: Bad plan by Planner (Already resolved?)

From
Robins Tharakan
Date:
Thanks Kevin,

That's a pretty neat way to managing (at least) minor upgrades. Like I
said, this place is new, and so although I'm quite positive about
upgrading to the latest, I should probably take things one-at-a-time and
bring in this idea of implementing regular updates sometime in the future.

As for the query, I tried the same query on an alternate machine, and
this is how EXPLAIN ANALYZE came up. Its much faster than the earlier
slow query, but nowhere near the performance of the second query shown
earlier. Do I have to live with that until this is implemented (if I am
only doing a minor version upgrade) or am I missing something else here?

I've provided the EXPLAIN ANALYZE as well as the web-link for a pretty
output of the EXPLAIN ANALYZE for your review.


ORIGINAL QUERY (on PostgreSQL 8.4.9):
http://explain.depesz.com/s/bTm

EXPLAIN ANALYZE SELECT field_a FROM large_table_a JOIN large_table_b
USING (field_a) WHERE field_b IN (SELECT large_table_b.field_b FROM
large_table_b WHERE field_a = 2673056) ;


------------------------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=132.97..194243.54 rows=156031 width=4) (actual
time=6.612..43179.524 rows=2120 loops=1)
    ->  Nested Loop  (cost=132.97..1107.63 rows=156031 width=4) (actual
time=6.576..29122.017 rows=6938 loops=1)
          ->  HashAggregate  (cost=132.97..133.96 rows=99 width=4)
(actual time=6.543..12.726 rows=2173 loops=1)
                ->  Index Scan using "IX_large_table_b_SigId" on
large_table_b  (cost=0.00..132.56 rows=164 width=4) (actual
time=0.029..3.425 rows=2173 loops=1)
                      Index Cond: (field_a = 2673056)
          ->  Index Scan using "IX_large_table_b_field_b" on
large_table_b  (cost=0.00..9.81 rows=2 width=8) (actual
time=6.732..13.384 rows=3 loops=2173)
                Index Cond: (public.large_table_b.field_b =
public.large_table_b.field_b)
    ->  Index Scan using "PK_large_table_a" on large_table_a
(cost=0.00..1.23 rows=1 width=4) (actual time=2.021..2.021 rows=0
loops=6938)
          Index Cond: (large_table_a.field_a = public.large_table_b.field_a)
  Total runtime: 43182.975 ms




OPTIMIZED QUERY (on PostgreSQL 8.4.7):
http://explain.depesz.com/s/emO

EXPLAIN ANALYZE SELECT s1.field_a FROM large_table_a JOIN large_table_b
s1 USING (field_a) JOIN large_table_b s2 ON s1.field_b = s2.field_b
WHERE s2.field_a = 2673056;


------------------------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..2356.98 rows=494 width=4) (actual
time=0.086..96.056 rows=2120 loops=1)
    ->  Nested Loop  (cost=0.00..1745.51 rows=494 width=4) (actual
time=0.051..48.900 rows=6938 loops=1)
          ->  Index Scan using "IX_large_table_b_SigId" on large_table_b
s2  (cost=0.00..132.56 rows=164 width=4) (actual time=0.028..3.411
rows=2173 loops=1)
                Index Cond: (field_a = 2673056)
          ->  Index Scan using "IX_large_table_b_field_b" on
large_table_b s1  (cost=0.00..9.81 rows=2 width=8) (actual
time=0.007..0.012 rows=3 loops=2173)
                Index Cond: (s1.field_b = s2.field_b)
    ->  Index Scan using "PK_large_table_a" on large_table_a
(cost=0.00..1.23 rows=1 width=4) (actual time=0.004..0.004 rows=0
loops=6938)
          Index Cond: (large_table_a.field_a = s1.field_a)
  Total runtime: 98.165 ms


--
Robins Tharakan

On 10/18/2011 06:16 PM, Kevin Grittner wrote:
> We build from source, and we include the minor release number in the
> prefix for the build, so we can have both old and new software
> installed side-by-side.  The path for the client-side executables we
> do through a symbolic link, so we can switch that painlessly.  And we
> assign the prefix used for the server to an environment variable in
> our services script.  So here is our process:
>
>   - Build and install the new minor release.
>   - Change the symlink to use it for clients (like pg_dump and psql).
>   - Change the service script line that sets the prefix to point to
>     the new minor release.
>   - Run the service script with "stop" and then run the service script
>     with "start".  (Unless your service script does a restart by using
>     stop and start, don't run it with "restart", because a PostgreSQL
>     restart won't pick up the new executables.)
>
> There is literally no more down time than it takes to stop and start
> the database service.  Our client software retries on a broken
> connection, so we can even do this while users are running and they
> just get a clock for a few seconds; but we usually prefer not to
> cause even that much disruption, at least during normal business
> hours.  We have enough hardware to load balance off of one machine at
> a time to do this without interruption of service.
>
> There are sometimes bugs fixed in a minor release which require
> cleanup of possibly damaged data, like what I mentioned above.  You
> may need to vacuum or reindex something to recover from the damage
> caused by the now-fixed bug, but the alternative is to continue to
> run with the damage.  I don't understand why someone would knowingly
> choose that.
>
> Really, it is worthwhile to keep up on minor releases.
> -Kevin


Attachment

Re: Bad plan by Planner (Already resolved?)

From
Tom Lane
Date:
Robins Tharakan <robins.tharakan@comodo.com> writes:
> ORIGINAL QUERY (on PostgreSQL 8.4.9):
> http://explain.depesz.com/s/bTm

> EXPLAIN ANALYZE SELECT field_a FROM large_table_a JOIN large_table_b
> USING (field_a) WHERE field_b IN (SELECT large_table_b.field_b FROM
> large_table_b WHERE field_a = 2673056) ;

>
------------------------------------------------------------------------------------------------------------------------------------------------------------
>   Nested Loop  (cost=132.97..194243.54 rows=156031 width=4) (actual
> time=6.612..43179.524 rows=2120 loops=1)
>     ->  Nested Loop  (cost=132.97..1107.63 rows=156031 width=4) (actual
> time=6.576..29122.017 rows=6938 loops=1)
>           ->  HashAggregate  (cost=132.97..133.96 rows=99 width=4)
> (actual time=6.543..12.726 rows=2173 loops=1)
>                 ->  Index Scan using "IX_large_table_b_SigId" on
> large_table_b  (cost=0.00..132.56 rows=164 width=4) (actual
> time=0.029..3.425 rows=2173 loops=1)
>                       Index Cond: (field_a = 2673056)
>           ->  Index Scan using "IX_large_table_b_field_b" on
> large_table_b  (cost=0.00..9.81 rows=2 width=8) (actual
> time=6.732..13.384 rows=3 loops=2173)
>                 Index Cond: (public.large_table_b.field_b =
> public.large_table_b.field_b)
>     ->  Index Scan using "PK_large_table_a" on large_table_a
> (cost=0.00..1.23 rows=1 width=4) (actual time=2.021..2.021 rows=0
> loops=6938)
>           Index Cond: (large_table_a.field_a = public.large_table_b.field_a)
>   Total runtime: 43182.975 ms


> OPTIMIZED QUERY (on PostgreSQL 8.4.7):
> http://explain.depesz.com/s/emO

> EXPLAIN ANALYZE SELECT s1.field_a FROM large_table_a JOIN large_table_b
> s1 USING (field_a) JOIN large_table_b s2 ON s1.field_b = s2.field_b
> WHERE s2.field_a = 2673056;

>
------------------------------------------------------------------------------------------------------------------------------------------------------------
>   Nested Loop  (cost=0.00..2356.98 rows=494 width=4) (actual
> time=0.086..96.056 rows=2120 loops=1)
>     ->  Nested Loop  (cost=0.00..1745.51 rows=494 width=4) (actual
> time=0.051..48.900 rows=6938 loops=1)
>           ->  Index Scan using "IX_large_table_b_SigId" on large_table_b
> s2  (cost=0.00..132.56 rows=164 width=4) (actual time=0.028..3.411
> rows=2173 loops=1)
>                 Index Cond: (field_a = 2673056)
>           ->  Index Scan using "IX_large_table_b_field_b" on
> large_table_b s1  (cost=0.00..9.81 rows=2 width=8) (actual
> time=0.007..0.012 rows=3 loops=2173)
>                 Index Cond: (s1.field_b = s2.field_b)
>     ->  Index Scan using "PK_large_table_a" on large_table_a
> (cost=0.00..1.23 rows=1 width=4) (actual time=0.004..0.004 rows=0
> loops=6938)
>           Index Cond: (large_table_a.field_a = s1.field_a)
>   Total runtime: 98.165 ms


I suspect that you're just fooling yourself here, and the "optimized"
query is no such thing.  Those plans are identical except for the
insertion of the HashAggregate step, which in itself adds less than
10msec to the runtime, and we can see it's not eliminating any rows
either.  So why does the second one run so much faster?  I can think
of three theories:

1. The tables are horrendously bloated on the first database, so that
many more pages have to be touched to get the same number of tuples.
This would likely indicate an improper autovacuum configuration.

2. You failed to account for caching effects, ie the first example
is being run "cold" and has to actually read everything from disk,
whereas the second example has everything it needs already in RAM.
In that case the speed differential is quite illusory.

3. The HashAggregate would likely spit out the rows in a completely
different order than it received them.  If scanning large_table_b in
the order of IX_large_table_b_SigId happens to yield field_b values
that are very well ordered, it's possible that locality of access in
the other indexscans would be enough better in the second plan to
account for the speedup.  This seems the least likely theory, though.

BTW, how come is it that "SELECT large_table_b.field_b FROM
large_table_b WHERE field_a = 2673056" produces no duplicate field_b
values?  Is that just luck?  Is there a unique constraint on the table
that implies it will happen?

            regards, tom lane

Re: Bad plan by Planner (Already resolved?)

From
Robins Tharakan
Date:
Thanks Tom!

Regret the delay in reply, but two of the three guesses were spot-on and
resolved the doubt. 8.4.9 does take care of this case very well.

On 10/27/2011 01:27 AM, Tom Lane wrote:
> I suspect that you're just fooling yourself here, and the "optimized"
> query is no such thing.
:) I actually meant 'faster' query, but well...


> 1. The tables are horrendously bloated on the first database, so that
> many more pages have to be touched to get the same number of tuples.
> This would likely indicate an improper autovacuum configuration.
I believe you've nailed it pretty accurately. The tables are
horrendously bloated and I may need to tune AutoVacuum to be much more
aggressive than it is. I did see that HashAggregate makes only a minor
difference, but what didn't strike is that the slowness could be bloat.


> 2. You failed to account for caching effects, ie the first example
> is being run "cold" and has to actually read everything from disk,
> whereas the second example has everything it needs already in RAM.
> In that case the speed differential is quite illusory.
On hindsight, this was a miss. Should have warmed the caches before
posting. Re-running this query multiple times, brought out the result in
~100ms.

> BTW, how come is it that "SELECT large_table_b.field_b FROM
> large_table_b WHERE field_a = 2673056" produces no duplicate field_b
> values?  Is that just luck?  Is there a unique constraint on the table
> that implies it will happen?
Its just luck. Sometimes the corresponding values genuinely don't exist
in the other table, so that's ok.

--
Robins Tharakan


Attachment