Thread: How to influence the planner

How to influence the planner

From
Richard Ray
Date:
I have a table, t1,  with about 12 million rows
The column foo is unique and indexed
The column bar is not unique and is indexed
"select * from t1 where length(bar) = 0;" runs about 2 minutes
"select * from t1 where length(bar) = 0 order by foo ;" ran until I
stopped it after about 20 minutes
My simple solution is "select * into t2 from t1 where length(bar) = 0;"
and "select * from t2 order by foo ;"
Is there a way to make "select * from t1 where length(bar) = 0 order by foo ;"
or something similar work

Thanks
Richard Ray



Re: How to influence the planner

From
Michael Glaesemann
Date:
On Aug 31, 2007, at 13:32 , Richard Ray wrote:

> "select * from t1 where length(bar) = 0;" runs about 2 minutes
> "select * from t1 where length(bar) = 0 order by foo ;" ran until I
> stopped it after about 20 minutes


EXPLAIN ANALYZE will help you see what the planner is doing to  
produce the results. Have you recently ANALYZEd t1? If length(bar) =  
0 is a common operation on this table, you might consider using an  
expression index on t1:

create index t1_length_bar_idx on t1 (length(bar));

You might want to ask on the performance list as well, as this is  
right up their alley.

Hope this gets you started on the right track.

Michael Glaesemann
grzm seespotcode net




Re: How to influence the planner

From
Richard Ray
Date:
On Fri, 31 Aug 2007, Michael Glaesemann wrote:

>
> On Aug 31, 2007, at 13:32 , Richard Ray wrote:
>
>> "select * from t1 where length(bar) = 0;" runs about 2 minutes
>> "select * from t1 where length(bar) = 0 order by foo ;" ran until I
>> stopped it after about 20 minutes
>
>
> EXPLAIN ANALYZE will help you see what the planner is doing to produce the 
> results.
mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0 order by foo;
    QUERY PLAN
 
--------------------------------------------------------------- Index Scan using t1_pkey on t1  (cost=0.00..46698478.18

rows=60038 width=334) (actual time=4612.740..2349582.520 rows=32705 
loops=1)   Filter: (length(bar) = 0) Total runtime: 2349614.258 ms
(3 rows)

mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0;                                                   QUERY
PLAN
--------------------------------------------------------------- Seq Scan on t1  (cost=100000000.00..102020349.17
rows=60038
 
width=334) (actual time=39.065..108645.233 rows=32705 loops=1)   Filter: (length(bar) = 0) Total runtime: 108677.759
ms
(3 rows)

mda=#

The index for foo on t1 is the primary index t1_pkey
Why is it slower using the index

> Have you recently ANALYZEd t1?

I run vacuum analyze nightly

> If length(bar) = 0 is a common 
> operation on this table, you might consider using an expression index on t1:
>
> create index t1_length_bar_idx on t1 (length(bar));

This is a one time procedure to fix some data but I've had this problem 
before
I'm running PostgreSQL 8.1.0 on Fedora Core 6


>
> You might want to ask on the performance list as well, as this is right up 
> their alley.
>
> Hope this gets you started on the right track.
>
> Michael Glaesemann
> grzm seespotcode net
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


Re: How to influence the planner

From
Michael Glaesemann
Date:
On Aug 31, 2007, at 16:07 , Richard Ray wrote:

>  Total runtime: 2349614.258 ms
> (3 rows)

Wow. Nearly 40 minutes! What are your work_mem set at? You may want  
to increase work_mem, as it might help with the sort.

> The index for foo on t1 is the primary index t1_pkey
> Why is it slower using the index

Using an index requires first scanning the index and then looking up  
the value in the table, so depending on the number of rows that need  
to be returned, using an index might have more overhead than just  
reading every row of the table (i.e., a sequential scan).

>> Have you recently ANALYZEd t1?
>
> I run vacuum analyze nightly

That might not be often enough. Looking at the number of rows in the  
cost estimate (60K rows) and the actual number of rows (~30K rows),  
it looks like there's a factor of two difference.

>> If length(bar) = 0 is a common operation on this table, you might  
>> consider using an expression index on t1:
>>
>> create index t1_length_bar_idx on t1 (length(bar));
>
> This is a one time procedure to fix some data but I've had this  
> problem before

Depending on the time it takes to build the index, it might prove  
worthwhile even for a one-off query. You're pretty much doing this by  
using a temporary table though.

> I'm running PostgreSQL 8.1.0 on Fedora Core 6

You should upgrade 8.1.9, the latest in the 8.1.x series. This may  
not help your performance issues, but there have been 9 point  
releases since the version you're running which include bug and  
security fixes. Even better, upgrade to 8.2.4, as there may very well  
be performance improvements in 8.2 which help you. You could look  
through the 8.2 release notes to see if any might apply.

Hope this helps.

Michael Glaesemann
grzm seespotcode net




Re: How to influence the planner

From
Tom Lane
Date:
Richard Ray <rray@mstc.state.ms.us> writes:
> On Fri, 31 Aug 2007, Michael Glaesemann wrote:
>> EXPLAIN ANALYZE will help you see what the planner is doing to produce the 
>> results.

> mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0 order by foo;
>                                                     QUERY PLAN
> ---------------------------------------------------------------
>   Index Scan using t1_pkey on t1  (cost=0.00..46698478.18 
> rows=60038 width=334) (actual time=4612.740..2349582.520 rows=32705 
> loops=1)
>     Filter: (length(bar) = 0)
>   Total runtime: 2349614.258 ms
> (3 rows)

> mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0;
>                                                     QUERY PLAN
> ---------------------------------------------------------------
>   Seq Scan on t1  (cost=100000000.00..102020349.17 rows=60038 
> width=334) (actual time=39.065..108645.233 rows=32705 loops=1)
>     Filter: (length(bar) = 0)
>   Total runtime: 108677.759 ms
> (3 rows)

The problem here is you've got enable_seqscan = off.  Don't do that.
That will make it use an index if it possibly can, whether using one
is a good idea or not.  In this case, since the useful condition on
length(bar) is not indexable, the best available index-using scan
uses the index to implement order by foo ... which is pointless here
in terms of saving runtime.

> I'm running PostgreSQL 8.1.0 on Fedora Core 6

Please update.  There are a *lot* of bugs fixed in the 8.1.x series
since then.
        regards, tom lane


Re: How to influence the planner

From
Tom Lane
Date:
Michael Glaesemann <grzm@seespotcode.net> writes:
> On Aug 31, 2007, at 16:07 , Richard Ray wrote:
>>> If length(bar) = 0 is a common operation on this table, you might  
>>> consider using an expression index on t1:
>> 
>>> create index t1_length_bar_idx on t1 (length(bar));
>> 
>> This is a one time procedure to fix some data but I've had this  
>> problem before

Actually, I just noticed that the OP does have an index on bar,
which means (assuming it's a string data type) that this query is
equivalent toselect * from t1 where bar = ''
which would be a far preferable way to do it because that condition
can use the index.  The Postgres planner is fairly data-type-agnostic
and does not have the knowledge that these are equivalent queries,
so you can't expect it to make that substitution for you.
        regards, tom lane


Re: How to influence the planner

From
Richard Ray
Date:
On Fri, 31 Aug 2007, Tom Lane wrote:

> Richard Ray <rray@mstc.state.ms.us> writes:
>> On Fri, 31 Aug 2007, Michael Glaesemann wrote:
>>> EXPLAIN ANALYZE will help you see what the planner is doing to produce the
>>> results.
>
>> mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0 order by foo;
>>                                                     QUERY PLAN
>> ---------------------------------------------------------------
>>   Index Scan using t1_pkey on t1  (cost=0.00..46698478.18
>> rows=60038 width=334) (actual time=4612.740..2349582.520 rows=32705
>> loops=1)
>>     Filter: (length(bar) = 0)
>>   Total runtime: 2349614.258 ms
>> (3 rows)
>
>> mda=# EXPLAIN ANALYZE select * from t1 where length(bar) = 0;
>>                                                     QUERY PLAN
>> ---------------------------------------------------------------
>>   Seq Scan on t1  (cost=100000000.00..102020349.17 rows=60038
>> width=334) (actual time=39.065..108645.233 rows=32705 loops=1)
>>     Filter: (length(bar) = 0)
>>   Total runtime: 108677.759 ms
>> (3 rows)
>
> The problem here is you've got enable_seqscan = off.  Don't do that.
> That will make it use an index if it possibly can, whether using one
> is a good idea or not.  In this case, since the useful condition on
> length(bar) is not indexable, the best available index-using scan
> uses the index to implement order by foo ... which is pointless here
> in terms of saving runtime.
>
>> I'm running PostgreSQL 8.1.0 on Fedora Core 6
>
> Please update.  There are a *lot* of bugs fixed in the 8.1.x series
> since then.

Changing to enable_seqscan = on does solve this problem, thanks
Is there some method of crafting a query that will assert my wishes to the planner

mda=# EXPLAIN ANALYZE select * from (select * from t1 where length(bar) = 0) a order by foo;
                                 QUERY PLAN
 
------------------------------------------------------------------------ Index Scan using t1_pkey on t1
(cost=0.00..46698482.18
 
rows=60038 width=334) (actual time=4784.869..2317363.298 rows=32705 
loops=1)   Filter: (length(bar) = 0) Total runtime: 2317395.137 ms
(3 rows)


When is enable_seqscan = off appropriate

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
>


Re: How to influence the planner

From
Richard Huxton
Date:
Richard Ray wrote:
> 
> When is enable_seqscan = off appropriate

Never(*)

It's not for normal usage, the various enable_xxx settings do let you 
experiment with different options if you think the planner is making a 
mistake.

(*) OK, sooner or later, if you have enough systems and enough 
performance-critical queries then you might want to use it in 
production. But only if you really know what you're doing and you've 
exhausted you're other options.

--   Richard Huxton  Archonet Ltd


Re: How to influence the planner

From
"Scott Marlowe"
Date:
On 8/31/07, Richard Ray <rray@mstc.state.ms.us> wrote:

> Changing to enable_seqscan = on does solve this problem, thanks
> Is there some method of crafting a query that will assert my wishes to the planner
> When is enable_seqscan = off appropriate

enable_xxx = off are troubleshooting tools.  They override the query planner.

For instance, I had a query that was running slow, and using set
enable_nestloop=off allowed the query to run fast.  However, using
explain analyze I could see that the estimated number of rows was off.Analyze didn't fix it, so I increased the stats
targetfor the column
 
I was working with, reanalyzed, and voila, the query ran fine with
nestloop=on.

So, enable_xxx=off is normally only appropriate when troubleshooting
an issue, not as a fix all.  That's doubly true for
enable_seqscan=off.

If you do have a query that nothing else seems to work on it, you can
set one of the enable_xxx settings off for that connection only and
not worry about messing up all the other sessions connecting to your
db.


Re: How to influence the planner

From
Richard Ray
Date:
Thanks guys
Lesson learned

On Tue, 4 Sep 2007, Scott Marlowe wrote:

> On 8/31/07, Richard Ray <rray@mstc.state.ms.us> wrote:
>
>> Changing to enable_seqscan = on does solve this problem, thanks
>> Is there some method of crafting a query that will assert my wishes to the planner
>> When is enable_seqscan = off appropriate
>
> enable_xxx = off are troubleshooting tools.  They override the query planner.
>
> For instance, I had a query that was running slow, and using set
> enable_nestloop=off allowed the query to run fast.  However, using
> explain analyze I could see that the estimated number of rows was off.
> Analyze didn't fix it, so I increased the stats target for the column
> I was working with, reanalyzed, and voila, the query ran fine with
> nestloop=on.
>
> So, enable_xxx=off is normally only appropriate when troubleshooting
> an issue, not as a fix all.  That's doubly true for
> enable_seqscan=off.
>
> If you do have a query that nothing else seems to work on it, you can
> set one of the enable_xxx settings off for that connection only and
> not worry about messing up all the other sessions connecting to your
> db.
>