Thread: adding 'limit' leads to very slow query

adding 'limit' leads to very slow query

From
"Michael McFarland"
Date:
   I'm trying to understand why a particular query is slow, and it seems
like the optimizer is choosing a strange plan.  See this summary:


* I have a large table, with an index on the primary key 'id' and on a
field 'foo'.
> select count(*) from foo;
1,000,000
> select count(*) from foo where bar = 41;
7

* This query happens very quickly.
> explain select * from foo where barId = 412 order by id desc;
Sort ()
   Sort key= id
  ->   Index scan using bar_index on foo ()
     Index cond: barId = 412

But this query takes forever

> explain select * from foo where barId = 412 order by id desc limit 25;
Limit ()
  ->   Index scan backward using primarykey_index
   Filter:  barID = 412


Could anyone shed some light on what might be happening here?

  - Michael


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: adding 'limit' leads to very slow query

From
John A Meinel
Date:
Michael McFarland wrote:

>   I'm trying to understand why a particular query is slow, and it
> seems  like the optimizer is choosing a strange plan.  See this summary:
>
...

>> explain select * from foo where barId = 412 order by id desc limit 25;
>
> Limit ()
>  ->   Index scan backward using primarykey_index
>   Filter:  barID = 412
>
>
> Could anyone shed some light on what might be happening here?
>
>  - Michael

It is using the wrong index. The problem is that order by + limit
generally means that you can use the index on the "+" to get the items
in the correct order. In this case, however, you need it to find all of
the barId=412 first, since apparently that is more selective than the limit.

It really sounds like the postgres statistics are out of date. And
either you haven't run vacuum analyze recently, or you need to keep
higher statistics on either one or both of barId and id.

John
=:->


Attachment

Re: adding 'limit' leads to very slow query

From
Stephan Szabo
Date:
On Mon, 7 Mar 2005, Michael McFarland wrote:

>    I'm trying to understand why a particular query is slow, and it seems
> like the optimizer is choosing a strange plan.  See this summary:
>
>
> * I have a large table, with an index on the primary key 'id' and on a
> field 'foo'.
> > select count(*) from foo;
> 1,000,000
> > select count(*) from foo where bar = 41;
> 7
>
> * This query happens very quickly.
> > explain select * from foo where barId = 412 order by id desc;
> Sort ()
>    Sort key= id
>   ->   Index scan using bar_index on foo ()
>      Index cond: barId = 412
>
> But this query takes forever
>
> > explain select * from foo where barId = 412 order by id desc limit 25;
> Limit ()
>   ->   Index scan backward using primarykey_index
>    Filter:  barID = 412

You didn't show the row estimates, but I'd guess that it's expecting
either that ther are more rows that match barId=412 than there actually
are (which may be solvable by raising the statistics target on the column
and re-analyzing) such that going backwards on id in order to make 25
matching rows isn't a bad plan or that barId and id are correlated which
is unfortunately not going to be recognized right now.

Re: adding 'limit' leads to very slow query

From
"Michael McFarland"
Date:
    I continue to be stumped by this.  You are right that I should have
listed the estimates provided by explain... basically for the select where
bar = 41, it's estimating there will be 40,000 rows instead of 7, out of
what's actuallly 5 million records in the table.

    So far I've tried increase statistics for the bar column from the
default 10 to 100 (vacuum analyzing after) and the explain-plan hasn't
changed.  I also notice that afterward, the pg_stats record for the bar
column still only lists the top 5 values of bar (out of 68 unique values
in the table).  Are there any other settings I could try to improve the
detail of the statistics?

    By the way, I think I do have a workaround for this particular query:
         select * from (select * from foo where barId = 412 order by id
desc) as tempview limit 25;
This query uses the bar index and completes instantly.  However, I feel
like I should find the heart of the problem, since bad statistics could
end up affecting other plans, right?

  - Mike


On Mon, 7 Mar 2005 23:03:43 -0800 (PST), Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:

> On Mon, 7 Mar 2005, Michael McFarland wrote:
>
>>    I'm trying to understand why a particular query is slow, and it seems
>> like the optimizer is choosing a strange plan.  See this summary:
>>
>>
>> * I have a large table, with an index on the primary key 'id' and on a
>> field 'foo'.
>> > select count(*) from foo;
>> 1,000,000
>> > select count(*) from foo where bar = 41;
>> 7
>>
>> * This query happens very quickly.
>> > explain select * from foo where barId = 412 order by id desc;
>> Sort ()
>>    Sort key= id
>>   ->   Index scan using bar_index on foo ()
>>      Index cond: barId = 412
>>
>> But this query takes forever
>>
>> > explain select * from foo where barId = 412 order by id desc limit 25;
>> Limit ()
>>   ->   Index scan backward using primarykey_index
>>    Filter:  barID = 412
>
> You didn't show the row estimates, but I'd guess that it's expecting
> either that ther are more rows that match barId=412 than there actually
> are (which may be solvable by raising the statistics target on the column
> and re-analyzing) such that going backwards on id in order to make 25
> matching rows isn't a bad plan or that barId and id are correlated which
> is unfortunately not going to be recognized right now.
>

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: adding 'limit' leads to very slow query

From
Stephan Szabo
Date:
On Wed, 9 Mar 2005, Michael McFarland wrote:

>     I continue to be stumped by this.  You are right that I should have
> listed the estimates provided by explain... basically for the select where
> bar = 41, it's estimating there will be 40,000 rows instead of 7, out of
> what's actuallly 5 million records in the table.
>
>     So far I've tried increase statistics for the bar column from the
> default 10 to 100 (vacuum analyzing after) and the explain-plan hasn't
> changed.  I also notice that afterward, the pg_stats record for the bar

Did the estimates change at all?

> column still only lists the top 5 values of bar (out of 68 unique values
> in the table).  Are there any other settings I could try to improve the
> detail of the statistics?

Well, I'd first try moving up to a statistic target of 1000 in
order to try sampling a greater number of rows.  I'd wonder if there's
enough difference in frequency that it's just not visiting any with the
other values.  I'm not sure that it'll help that much though; hopefully
someone else will have an idea.

>     By the way, I think I do have a workaround for this particular query:
>          select * from (select * from foo where barId = 412 order by id
> desc) as tempview limit 25;
> This query uses the bar index and completes instantly.  However, I feel
> like I should find the heart of the problem, since bad statistics could
> end up affecting other plans, right?

Yeah, it's best to get it to estimate somewhat reasonably before looking
for workarounds.