Thread: Forcing the use of particular execution plans

Forcing the use of particular execution plans

From
"Tim Truman"
Date:
Hi,

I have the following query which has been running very slowly and after a
lot of testing/trial and error I found an execution plan that ran the query
in a fraction of the time (and then lost the statistics that produced it).
What I wish to know is how to force the query to use the faster execution
plan.

Query:
SELECT count(*) as count FROM
(
    SELECT *
        FROM transaction t, merchant m
        WHERE t.merchant_id = m.id
            AND m.id = 198
            AND t.transaction_date >= '20050101'
            AND t.transaction_date <= '20060925'
            AND credit_card_no LIKE '1111%111'

    UNION ALL
    SELECT *
        FROM transaction t, merchant m
        WHERE t.merchant_id = m.id
            AND m.parent_merchant_id = 198
            AND t.transaction_date >= '20050101'
            AND t.transaction_date <= '20060925'
            AND credit_card_no LIKE '1111%111'
) AS foobar

Desired Execution Plan:
Aggregate (cost=97377.90..97377.90 rows=1 width=0)
  -> Subquery Scan foobar (cost=0.00..97377.86 rows=16 width=0)
        -> Append (cost=0.00..97377.70 rows=16 width=636)
              -> Subquery Scan "*SELECT* 1" (cost=0.00..10304.81 rows=3
width=636)
                    -> Nested Loop (cost=0.00..10304.78 rows=3 width=636)
                          -> Index Scan using pk_merchant on merchant m
(cost=0.00..5.11 rows=1 width=282)
                                Index Cond: (id = 198)
                          -> Index Scan using ix_transaction_merchant_id on
"transaction" t (cost=0.00..10299.64 rows=3 width=354)
                                Index Cond: (198 = merchant_id)
                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
              -> Subquery Scan "*SELECT* 2" (cost=13.86..87072.89 rows=13
width=636)
                    -> Hash Join (cost=13.86..87072.76 rows=13 width=636)
                          Hash Cond: ("outer".merchant_id = "inner".id)
                          -> Seq Scan on "transaction" t
(cost=0.00..87052.65 rows=1223 width=354)
                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
                          -> Hash (cost=13.85..13.85 rows=4 width=282)
                                -> Index Scan using
ix_merchant_parent_merchant_id on merchant m (cost=0.00..13.85 rows=4
width=282)
                                      Index Cond: (parent_merchant_id = 198)

Undesired Execution Plan:
Aggregate  (cost=88228.82..88228.82 rows=1 width=0)
  ->  Subquery Scan foobar  (cost=0.00..88228.73 rows=35 width=0)
        ->  Append  (cost=0.00..88228.38 rows=35 width=631)
              ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1137.61 rows=1
width=631)
                    ->  Nested Loop  (cost=0.00..1137.60 rows=1 width=631)
                          ->  Index Scan using ix_transaction_merchant_id on
"transaction" t  (cost=0.00..1132.47 rows=1 width=349)
                                Index Cond: (198 = merchant_id)
                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
                          ->  Index Scan using pk_merchant on merchant m
(cost=0.00..5.11 rows=1 width=282)
                                Index Cond: (id = 198)
              ->  Subquery Scan "*SELECT* 2"  (cost=20.90..87090.77 rows=34
width=631)
                    ->  Hash Join  (cost=20.90..87090.43 rows=34 width=631)
                          Hash Cond: ("outer".merchant_id = "inner".id)
                          ->  Seq Scan on "transaction" t
(cost=0.00..87061.04 rows=1632 width=349)
                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))
                          ->  Hash  (cost=20.88..20.88 rows=8 width=282)
                                ->  Seq Scan on merchant m
(cost=0.00..20.88 rows=8 width=282)
                                      Filter: (parent_merchant_id = 198)



Thanks for any help/ideas


Tim


Re: Forcing the use of particular execution plans

From
Jochem van Dieten
Date:
Tim Truman wrote:
> Query:
> SELECT count(*) as count FROM
> (
>     SELECT *
>         FROM transaction t, merchant m
>         WHERE t.merchant_id = m.id
>             AND m.id = 198
>             AND t.transaction_date >= '20050101'
>             AND t.transaction_date <= '20060925'
>             AND credit_card_no LIKE '1111%111'
>
>     UNION ALL
>     SELECT *
>         FROM transaction t, merchant m
>         WHERE t.merchant_id = m.id
>             AND m.parent_merchant_id = 198
>             AND t.transaction_date >= '20050101'
>             AND t.transaction_date <= '20060925'
>             AND credit_card_no LIKE '1111%111'
> ) AS foobar
>

Actually, I think the best course of action is to rewrite the query to a
faster alternative. What you can try is:
SELECT SUM(count) AS count FROM
(
    SELECT count(*) AS count
        FROM transaction t, merchant m
        WHERE t.merchant_id = m.id
            AND m.id = 198
            AND t.transaction_date >= '20050101'
            AND t.transaction_date <= '20060925'
            AND credit_card_no LIKE '1111%111'

    UNION ALL
    SELECT count(*) AS count
        FROM transaction t, merchant m
        WHERE t.merchant_id = m.id
            AND m.parent_merchant_id = 198
            AND t.transaction_date >= '20050101'
            AND t.transaction_date <= '20060925'
            AND credit_card_no LIKE '1111%111'
) AS foobar;


The next optimization is to merge the 2 subqueries into one. If you
schema is such that m.id can not be the same as m.parent_merchant_id I
think your query can be reduced to:
SELECT count(*) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
        AND
        (
            m.id = 198
            OR
            m.parent_merchant_id = 198
        )
        AND t.transaction_date >= '20050101'
        AND t.transaction_date <= '20060925'
        AND credit_card_no LIKE '1111%111'


If m.id can be the same as m.parent_merchant_id you need something like:
SELECT SUM(
    CASE WHEN m.id = m.parent_merchant_id THEN 2 ELSE 1 END
    ) AS count
    FROM transaction t, merchant m
    WHERE t.merchant_id = m.id
        AND
        (
            m.id = 198
            OR
            m.parent_merchant_id = 198
        )
        AND t.transaction_date >= '20050101'
        AND t.transaction_date <= '20060925'
        AND credit_card_no LIKE '1111%111'

Jochem

Re: Forcing the use of particular execution plans

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of  Tim Truman
>
> Hi,
>
> I have the following query which has been running very slowly
> and after a
> lot of testing/trial and error I found an execution plan that
> ran the query
> in a fraction of the time (and then lost the statistics that
> produced it).
> What I wish to know is how to force the query to use the
> faster execution
> plan.

It would be a bit easier to diagnose the problem if you posted EXPLAIN
ANALYZE rather than just EXPLAIN.  The two plans you posted looked very
similar except for the order of the nested loop in subquery 1 and an index
scan rather than a seq scan in subquery 2.

My guess would be that the order of the nested loop is determined mostly by
estimates of matching rows.  If you ran an EXPLAIN ANALYZE you could tell if
the planner is estimating correctly.  If it is not, you could try increasing
your statistics target and running ANALYZE.

To make the planner prefer an index scan over a seq scan, I would first
check the statistics again, and then you can try setting enable_seqscan to
false (enable_seqscan is meant more for testing than production) or, you
could try reducing random_page_cost, but you should test that against a
range of queries before putting it in production.

Dave


Re: Forcing the use of particular execution plans

From
"Jim C. Nasby"
Date:
On Wed, Sep 27, 2006 at 10:51:26AM -0500, Dave Dutcher wrote:
> To make the planner prefer an index scan over a seq scan, I would first
> check the statistics again, and then you can try setting enable_seqscan to
> false (enable_seqscan is meant more for testing than production) or, you
> could try reducing random_page_cost, but you should test that against a
> range of queries before putting it in production.

Index scans are also pretty picky about correlation. If you have really
low correlation you don't want to index scan, but I think our current
estimates make it too eager to switch to a seqscan.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Forcing the use of particular execution plans

From
"Tim Truman"
Date:
Here is an "explain analyze" for the query that performs slowly, I hope this
helps unfortunately I can't reproduce the version of the query that ran
quickly and therefore can't provide and 'explain analyze' for it.

"Aggregate  (cost=88256.32..88256.32 rows=1 width=0) (actual
time=55829.000..55829.000 rows=1 loops=1)"
"  ->  Subquery Scan foobar  (cost=0.00..88256.23 rows=35 width=0) (actual
time=19235.000..55829.000 rows=24 loops=1)"
"        ->  Append  (cost=0.00..88255.88 rows=35 width=631) (actual
time=19235.000..55829.000 rows=24 loops=1)"
"              ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1165.12 rows=1
width=631) (actual time=16.000..16.000 rows=0 loops=1)"
"                    ->  Nested Loop  (cost=0.00..1165.11 rows=1 width=631)
(actual time=16.000..16.000 rows=0 loops=1)"
"                          ->  Index Scan using ix_transaction_merchant_id
on "transaction" t  (cost=0.00..1159.98 rows=1 width=349) (actual
time=16.000..16.000 rows=0 loops=1)"
"                                Index Cond: (198 = merchant_id)"
"                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))"
"                          ->  Index Scan using pk_merchant on merchant m
(cost=0.00..5.11 rows=1 width=282) (never executed)"
"                                Index Cond: (id = 198)"
"              ->  Subquery Scan "*SELECT* 2"  (cost=20.90..87090.76 rows=34
width=631) (actual time=19219.000..55813.000 rows=24 loops=1)"
"                    ->  Hash Join  (cost=20.90..87090.42 rows=34 width=631)
(actual time=19219.000..55813.000 rows=24 loops=1)"
"                          Hash Cond: ("outer".merchant_id = "inner".id)"
"                          ->  Seq Scan on "transaction" t
(cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
rows=200 loops=1)"
"                                Filter: ((transaction_date >=
'2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
((credit_card_no)::text ~~ '4564%549'::text))"
"                          ->  Hash  (cost=20.88..20.88 rows=8 width=282)
(actual time=16.000..16.000 rows=0 loops=1)"
"                                ->  Seq Scan on merchant m
(cost=0.00..20.88 rows=8 width=282) (actual time=0.000..16.000 rows=7
loops=1)"
"                                      Filter: (parent_merchant_id = 198)"
"Total runtime: 55829.000 ms"

Once again any help much appreciated.

Tim

-----Original Message-----
From: Dave Dutcher [mailto:dave@tridecap.com]
Sent: Thursday, 28 September 2006 1:21 AM
To: 'Tim Truman'; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Forcing the use of particular execution plans

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of  Tim Truman
>
> Hi,
>
> I have the following query which has been running very slowly
> and after a
> lot of testing/trial and error I found an execution plan that
> ran the query
> in a fraction of the time (and then lost the statistics that
> produced it).
> What I wish to know is how to force the query to use the
> faster execution
> plan.

It would be a bit easier to diagnose the problem if you posted EXPLAIN
ANALYZE rather than just EXPLAIN.  The two plans you posted looked very
similar except for the order of the nested loop in subquery 1 and an index
scan rather than a seq scan in subquery 2.

My guess would be that the order of the nested loop is determined mostly by
estimates of matching rows.  If you ran an EXPLAIN ANALYZE you could tell if
the planner is estimating correctly.  If it is not, you could try increasing
your statistics target and running ANALYZE.

To make the planner prefer an index scan over a seq scan, I would first
check the statistics again, and then you can try setting enable_seqscan to
false (enable_seqscan is meant more for testing than production) or, you
could try reducing random_page_cost, but you should test that against a
range of queries before putting it in production.

Dave

Re: Forcing the use of particular execution plans

From
Tom Lane
Date:
"Tim Truman" <tim@advam.com> writes:
> Here is an "explain analyze" for the query that performs slowly,

This shows that the planner is exactly correct in thinking that all
the runtime is going into the seqscan on transaction:

> "Aggregate  (cost=88256.32..88256.32 rows=1 width=0) (actual
> time=55829.000..55829.000 rows=1 loops=1)"
> ...
> "                          ->  Seq Scan on "transaction" t
> (cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
> rows=200 loops=1)"
> "                                Filter: ((transaction_date >=
> '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
> ((credit_card_no)::text ~~ '4564%549'::text))"

Since that component of the plan was identical in your two original
plans ("desired" and "undesired") it seems pretty clear that you have
not correctly identified what your problem is.

            regards, tom lane

Re: Forcing the use of particular execution plans

From
"Tim Truman"
Date:
Thanks Tom
The time difference did distract me from the issue. Switching Seq Scan to
off reduced the runtime greatly, so I am now adjusting the
effective_cache_size, random_page_cost settings to favor indexes over Seq
Scans.

Regards,
Tim


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, 3 October 2006 1:50 PM
To: Tim Truman
Cc: 'Dave Dutcher'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Forcing the use of particular execution plans

"Tim Truman" <tim@advam.com> writes:
> Here is an "explain analyze" for the query that performs slowly,

This shows that the planner is exactly correct in thinking that all
the runtime is going into the seqscan on transaction:

> "Aggregate  (cost=88256.32..88256.32 rows=1 width=0) (actual
> time=55829.000..55829.000 rows=1 loops=1)"
> ...
> "                          ->  Seq Scan on "transaction" t
> (cost=0.00..87061.04 rows=1630 width=349) (actual time=234.000..55797.000
> rows=200 loops=1)"
> "                                Filter: ((transaction_date >=
> '2005-01-01'::date) AND (transaction_date <= '2006-09-25'::date) AND
> ((credit_card_no)::text ~~ '4564%549'::text))"

Since that component of the plan was identical in your two original
plans ("desired" and "undesired") it seems pretty clear that you have
not correctly identified what your problem is.

            regards, tom lane

Re: Forcing the use of particular execution plans

From
Ron Mayer
Date:
Jim C. Nasby wrote:
>
> Index scans are also pretty picky about correlation. If you have really
> low correlation you don't want to index scan,

I'm still don't think "correlation" is the right metric
at all for making this decision.

If you have a list of addresses clustered by "zip"
the "correlation" of State, City, County, etc will all be zero (since
the zip codes don't match the alphabetical order of state or city names)
but index scans are still big wins because the data for any given
state or city will be packed on the same few pages - and in fact
the pages could be read mostly sequentially.

> but I think our current
> estimates make it too eager to switch to a seqscan.

Re: Forcing the use of particular execution plans

From
"Jim C. Nasby"
Date:
Adding -performance back in.

On Tue, Oct 03, 2006 at 05:10:04PM -0700, Ron Mayer wrote:
> Jim C. Nasby wrote:
> >
> > Index scans are also pretty picky about correlation. If you have really
> > low correlation you don't want to index scan,
>
> I'm still don't think "correlation" is the right metric
> at all for making this decision.
>
> If you have a list of addresses clustered by "zip"
> the "correlation" of State, City, County, etc will all be zero (since
> the zip codes don't match the alphabetical order of state or city names)
> but index scans are still big wins because the data for any given
> state or city will be packed on the same few pages - and in fact
> the pages could be read mostly sequentially.

That's a good point that I don't think has been considered before. I
think correlation is still somewhat important, but what's probably far
more important is data localization.

One possible way to calculate this would be to note the location of
every tuple with a given value in the heap. Calculate the geometric mean
of those locations (I think you could essentially average all the
ctids), and divide that by the average distance of each tuple from that
mean (or maybe the reciprocal of that would be more logical).

Obviously we don't want to scan the whole table to do that, but there
should be some way to do it via sampling as well.

Or perhaps someone knows of a research paper with real data on how to do
this instead of hand-waving. :)

> > but I think our current
> > estimates make it too eager to switch to a seqscan.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net