Thread: Sequential Scan with LIMIT

Sequential Scan with LIMIT

From
John Meinel
Date:
I was looking into another problem, and I found something that surprised
me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
Now "col" is indexed, by mytable has 500,000 rows, and 'myval' occurs
maybe 100,000 times. Without the LIMIT, this query should definitely do
a sequential scan.

But with the LIMIT, doesn't it know that it will return at max 1 value,
and thus be able to use the index?

It seems to be doing the LIMIT too late.

The real purpose of this query is to check to see if a value exists in
the column, so there might be a better way of doing it. Here is the demo
info:

# select count(*) from finst_t;
542315

# select count(*) from finst_t where store_id = 539960;
85076

# explain analyze select id from finst_t where store_id = 539960 limit 1;
                                                      QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.00..0.13 rows=1 width=4) (actual time=860.000..860.000
rows=1 loops=1)
     ->  Seq Scan on finst_t  (cost=0.00..11884.94 rows=88217 width=4)
(actual time=860.000..860.000 rows=1 loops=1)
           Filter: (store_id = 539960)
   Total runtime: 860.000 ms

Notice that the "actual rows=1", meaning it is aware of the limit as it
is going through the table. But for some reason the planner thinks it is
going to return 88,217 rows. (This is close to the reality of 85076 if
it actually had to find all of the rows).

Now, if I do a select on a value that *does* only have 1 value, it works
fine:

# explain analyze select id from finst_t where store_id = 9605 limit 1;
                                                               QUERY PLAN



------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.00..3.96 rows=1 width=4) (actual time=0.000..0.000
rows=1 loops=1)
     ->  Index Scan using finst_t_store_id_idx on finst_t
(cost=0.00..3.96 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
           Index Cond: (store_id = 9605)
   Total runtime: 0.000 ms

And 1 further thing, I *can* force it to do a fast index scan if I
disable sequential scanning.

# set enable_seqscan to off;
# explain analyze select id from finst_t where store_id = 539960 limit 1;
                                                                   QUERY
PLAN



------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=0.00..1.59 rows=1 width=4) (actual time=0.000..0.000
rows=1 loops=1)
     ->  Index Scan using finst_t_store_id_idx on finst_t
(cost=0.00..140417.22 rows=88217 width=4) (actual time=0.000..0.000
rows=1 loops=1)
           Index Cond: (store_id = 539960)
   Total runtime: 0.000 ms

Could being aware of LIMIT be added to the planner? Is there a better
way to check for existence?

John
=:->

PS> I'm using postgres 8.0-beta3 on win32 (the latest installer).


Attachment

Re: Sequential Scan with LIMIT

From
Tom Lane
Date:
John Meinel <john@johnmeinel.com> writes:
> I was looking into another problem, and I found something that surprised
> me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
> Now "col" is indexed, by mytable has 500,000 rows, and 'myval' occurs
> maybe 100,000 times. Without the LIMIT, this query should definitely do
> a sequential scan.

> But with the LIMIT, doesn't it know that it will return at max 1 value,
> and thus be able to use the index?

But the LIMIT will cut the cost of the seqscan case too.  Given the
numbers you posit above, about one row in five will have 'myval', so a
seqscan can reasonably expect to hit the first matching row in the first
page of the table.  This is still cheaper than doing an index scan
(which must require reading at least one index page plus at least one
table page).

The test case you are showing is probably suffering from nonrandom
placement of this particular data value; which is something that the
statistics we keep are too crude to detect.

            regards, tom lane

Re: Sequential Scan with LIMIT

From
John Meinel
Date:
Tom Lane wrote:
> John Meinel <john@johnmeinel.com> writes:
>
>>I was looking into another problem, and I found something that surprised
>>me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
>>Now "col" is indexed, by mytable has 500,000 rows, and 'myval' occurs
>>maybe 100,000 times. Without the LIMIT, this query should definitely do
>>a sequential scan.
>
>
>>But with the LIMIT, doesn't it know that it will return at max 1 value,
>>and thus be able to use the index?
>
>
> But the LIMIT will cut the cost of the seqscan case too.  Given the
> numbers you posit above, about one row in five will have 'myval', so a
> seqscan can reasonably expect to hit the first matching row in the first
> page of the table.  This is still cheaper than doing an index scan
> (which must require reading at least one index page plus at least one
> table page).
>
> The test case you are showing is probably suffering from nonrandom
> placement of this particular data value; which is something that the
> statistics we keep are too crude to detect.
>
>             regards, tom lane

You are correct about non-random placement. I'm a little surprised it
doesn't change with values, then. For instance,

# select count(*) from finst_t where store_id = 52;
13967

Still does a sequential scan for the "select id from..." query.

The only value it does an index query for is 9605 which only has 1 row.

It estimates ~18,000 rows, but that is still < 3% of the total data.

This row corresponds to disk location where files can be found. So when
a storage location fills up, generally a new one is created. This means
that *generally* the numbers will be increasing as you go further in the
table (not guaranteed, as there are multiple locations open at any one
time).

Am I better off in this case just wrapping my query with:

set enable_seqscan to off;
query
set enable_seqscan to on;


There is still the possibility that there is a better way to determine
existence of a value in a column. I was wondering about something like:

SELECT 1 WHERE EXISTS
    (SELECT id FROM finst_t WHERE store_id=52 LIMIT 1);

Though the second part is the same, so it still does the sequential scan.

This isn't critical, I was just trying to understand what's going on.
Thanks for your help.

John
=:->

Attachment

Re: Sequential Scan with LIMIT

From
Curt Sampson
Date:
On Sun, 24 Oct 2004, John Meinel wrote:

> I was looking into another problem, and I found something that surprised
> me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
> Now "col" is indexed...
> The real purpose of this query is to check to see if a value exists in
> the column,...

When you select all the columns, you're going to force it to go to the
table. If you select only the indexed column, it ought to be able to use
just the index, and never read the table at all. You could also use more
standard and more set-oriented SQL while you're at it:

    SELECT DISTINCT(col) FROM mytable WHERE col = 'myval'

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
     Make up enjoying your city life...produced by BIC CAMERA

Re: Sequential Scan with LIMIT

From
Neil Conway
Date:
On Mon, 2004-10-25 at 17:17, Curt Sampson wrote:
> When you select all the columns, you're going to force it to go to the
> table. If you select only the indexed column, it ought to be able to use
> just the index, and never read the table at all.

Perhaps in other database systems, but not in PostgreSQL. MVCC
information is only stored in the heap, not in indexes: therefore,
PostgreSQL cannot determine whether a given index entry refers to a
valid tuple. Therefore, it needs to check the heap even if the index
contains all the columns referenced by the query.

While it would be nice to be able to do index-only scans, there is good
reason for this design decision. Check the archives for past discussion
about the tradeoffs involved.

> You could also use more
> standard and more set-oriented SQL while you're at it:
>
>     SELECT DISTINCT(col) FROM mytable WHERE col = 'myval'

This is likely to be less efficient though.

-Neil



Re: Sequential Scan with LIMIT

From
John Meinel
Date:
Curt Sampson wrote:
> On Sun, 24 Oct 2004, John Meinel wrote:
>
>
>>I was looking into another problem, and I found something that surprised
>>me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
>>Now "col" is indexed...
>>The real purpose of this query is to check to see if a value exists in
>>the column,...
>
>
> When you select all the columns, you're going to force it to go to the
> table. If you select only the indexed column, it ought to be able to use
> just the index, and never read the table at all. You could also use more
> standard and more set-oriented SQL while you're at it:
>
>     SELECT DISTINCT(col) FROM mytable WHERE col = 'myval'
>
> cjs

Well, what you wrote was actually much slower, as it had to scan the
whole table, grab all the rows, and then distinct them in the end.

However, this query worked:


    SELECT DISTINCT(col) FROM mytable WHERE col = 'myval' LIMIT 1;


Now, *why* that works differently from:

SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
or
SELECT DISTINCT(col) FROM mytable WHERE col = 'myval';

I'm not sure. They all return the same information.

What's also weird is stuff like:
SELECT DISTINCT(NULL) FROM mytable WHERE col = 'myval' LIMIT 1;

Also searches the entire table, sorting that NULL == NULL wherever col =
'myval'. Which is as expensive as the non-limited case (I'm guessing
that the limit is occurring after the distinct, which is causing the
problem. SELECT NULL FROM ... still uses a sequential scan, but it stops
after finding the first one.)


Actually, in doing a little bit more testing, the working query only
works on some of the values. Probably it just increases the expense
enough that it switches over. It also has the downside that when it does
switch to seq scan, it is much more expensive as it has to do a sort and
a unique on all the entries.

John
=:->

Attachment

Re: Sequential Scan with LIMIT

From
Jaime Casanova
Date:
 --- John Meinel <john@johnmeinel.com> escribió:
> Curt Sampson wrote:
> > On Sun, 24 Oct 2004, John Meinel wrote:
> >
> >
> >>I was looking into another problem, and I found
> something that surprised
> >>me. If I'm doing "SELECT * FROM mytable WHERE col
> = 'myval' LIMIT 1.".
> >>Now "col" is indexed...
> >>The real purpose of this query is to check to see
> if a value exists in
> >>the column,...
> >
> >
> > When you select all the columns, you're going to
> force it to go to the
> > table. If you select only the indexed column, it
> ought to be able to use
> > just the index, and never read the table at all.
> You could also use more
> > standard and more set-oriented SQL while you're at
> it:
> >
> >     SELECT DISTINCT(col) FROM mytable WHERE col =
> 'myval'
> >
> > cjs
>
> Well, what you wrote was actually much slower, as it
> had to scan the
> whole table, grab all the rows, and then distinct
> them in the end.
>
> However, this query worked:
>
>
>     SELECT DISTINCT(col) FROM mytable WHERE col =
> 'myval' LIMIT 1;
>
>
> Now, *why* that works differently from:
>
> SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
> or
> SELECT DISTINCT(col) FROM mytable WHERE col =
> 'myval';
>
> I'm not sure. They all return the same information.

of course, both queries will return the same but
that's just because you forced it.

LIMIT and DISTINCT are different things so they behave
and are plenned different.


>
> What's also weird is stuff like:
> SELECT DISTINCT(NULL) FROM mytable WHERE col =
> 'myval' LIMIT 1;

why do you want to do such a thing?

regards,
Jaime Casanova

_________________________________________________________
Do You Yahoo!?
Información de Estados Unidos y América Latina, en Yahoo! Noticias.
Visítanos en http://noticias.espanol.yahoo.com

Re: Sequential Scan with LIMIT

From
John Meinel
Date:
Jaime Casanova wrote:
[...]
>>
>>I'm not sure. They all return the same information.
>
>
> of course, both queries will return the same but
> that's just because you forced it.
>
> LIMIT and DISTINCT are different things so they behave
> and are plenned different.
>
>
>
>>What's also weird is stuff like:
>>SELECT DISTINCT(NULL) FROM mytable WHERE col =
>>'myval' LIMIT 1;
>
>
> why do you want to do such a thing?
>
> regards,
> Jaime Casanova
>

I was trying to see if selecting a constant would change things.
I could have done SELECT DISTINCT(1) or just SELECT 1 FROM ...
The idea of the query is that if 'myval' exists in the table, return
something different than if 'myval' does not exist. If you are writing a
function, you can use:

SELECT something...
IF FOUND THEN
   do a
ELSE
   do b
END IF;

The whole point of this exercise was just to find what the cheapest
query is when you want to test for the existence of a value in a column.
The only thing I've found for my column is:

SET enable_seq_scan TO off;
SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
SET enable_seq_scan TO on;

My column is not distributed well (larger numbers occur later in the
dataset, but may occur many times.) In total there are something like
500,000 rows, the number 555647 occurs 100,000 times, but not until row
300,000 or so.

The analyzer looks at the data and says "1/5th of the time it is 555647,
so I can just do a sequential scan as the odds are I don't have to look
for very long, then I don't have to load the index". It turns out this
is very bad, where with an index you just have to do 2 page loads,
instead of reading 300,000 rows.

Obviously this isn't a general-case solution. But if you have a
situation similar to mine, it might be useful.

(That's one thing with DB tuning. It seems to be very situation
dependent, and it's hard to plan without a real dataset.)

John
=:->

Attachment

Re: Sequential Scan with LIMIT

From
Jaime Casanova
Date:
 --- John Meinel <john@johnmeinel.com> escribió:
> Jaime Casanova wrote:
> [...]
> >>
> >>I'm not sure. They all return the same
> information.
> >
> >
> > of course, both queries will return the same but
> > that's just because you forced it.
> >
> > LIMIT and DISTINCT are different things so they
> behave
> > and are plenned different.
> >
> >
> >
> >>What's also weird is stuff like:
> >>SELECT DISTINCT(NULL) FROM mytable WHERE col =
> >>'myval' LIMIT 1;
> >
> >
> > why do you want to do such a thing?
> >
> > regards,
> > Jaime Casanova
> >
>
> I was trying to see if selecting a constant would
> change things.
> I could have done SELECT DISTINCT(1) or just SELECT
> 1 FROM ...
> The idea of the query is that if 'myval' exists in
> the table, return
> something different than if 'myval' does not exist.
> If you are writing a
> function, you can use:
>
> SELECT something...
> IF FOUND THEN
>    do a
> ELSE
>    do b
> END IF;
>
> The whole point of this exercise was just to find
> what the cheapest
> query is when you want to test for the existence of
> a value in a column.
> The only thing I've found for my column is:
>
> SET enable_seq_scan TO off;
> SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
> SET enable_seq_scan TO on;
>
> My column is not distributed well (larger numbers
> occur later in the
> dataset, but may occur many times.) In total there
> are something like
> 500,000 rows, the number 555647 occurs 100,000
> times, but not until row
> 300,000 or so.
>
> The analyzer looks at the data and says "1/5th of
> the time it is 555647,
> so I can just do a sequential scan as the odds are I
> don't have to look
> for very long, then I don't have to load the index".
> It turns out this
> is very bad, where with an index you just have to do
> 2 page loads,
> instead of reading 300,000 rows.
>
> Obviously this isn't a general-case solution. But if
> you have a
> situation similar to mine, it might be useful.
>
> (That's one thing with DB tuning. It seems to be
> very situation
> dependent, and it's hard to plan without a real
> dataset.)
>
> John
> =:->
>

In http://www.postgresql.org/docs/faqs/FAQ.html under
"4.8) My queries are slow or don't make use of the
indexes. Why?" says:

"However, LIMIT combined with ORDER BY often will use
an index because only a small portion of the table is
returned. In fact, though MAX() and MIN() don't use
indexes, it is possible to retrieve such values using
an index with ORDER BY and LIMIT:
    SELECT col
    FROM tab
    ORDER BY col [ DESC ]
    LIMIT 1;"

So, maybe you can try your query as

SELECT col FROM mytable
WHERE col = 'myval'
ORDER BY col
LIMIT 1;

regards,
Jaime Casanova


_________________________________________________________
Do You Yahoo!?
Información de Estados Unidos y América Latina, en Yahoo! Noticias.
Visítanos en http://noticias.espanol.yahoo.com

Re: Sequential Scan with LIMIT

From
John Meinel
Date:
Jaime Casanova wrote:
[...]
>
> In http://www.postgresql.org/docs/faqs/FAQ.html under
> "4.8) My queries are slow or don't make use of the
> indexes. Why?" says:
>
> "However, LIMIT combined with ORDER BY often will use
> an index because only a small portion of the table is
> returned. In fact, though MAX() and MIN() don't use
> indexes, it is possible to retrieve such values using
> an index with ORDER BY and LIMIT:
>     SELECT col
>     FROM tab
>     ORDER BY col [ DESC ]
>     LIMIT 1;"
>
> So, maybe you can try your query as
>
> SELECT col FROM mytable
> WHERE col = 'myval'
> ORDER BY col
> LIMIT 1;
>
> regards,
> Jaime Casanova

Thanks for the heads up. This actually worked. All queries against that
table have turned into index scans instead of sequential.

John
=:->

Attachment

Re: Sequential Scan with LIMIT

From
"Jim C. Nasby"
Date:
On Sun, Oct 24, 2004 at 04:11:53PM -0400, Tom Lane wrote:
> But the LIMIT will cut the cost of the seqscan case too.  Given the
> numbers you posit above, about one row in five will have 'myval', so a
> seqscan can reasonably expect to hit the first matching row in the first
> page of the table.  This is still cheaper than doing an index scan
> (which must require reading at least one index page plus at least one
> table page).
>
> The test case you are showing is probably suffering from nonrandom
> placement of this particular data value; which is something that the
> statistics we keep are too crude to detect.

Isn't that exactly what pg_stats.correlation is?
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sequential Scan with LIMIT

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
> On Sun, Oct 24, 2004 at 04:11:53PM -0400, Tom Lane wrote:
>> The test case you are showing is probably suffering from nonrandom
>> placement of this particular data value; which is something that the
>> statistics we keep are too crude to detect.

> Isn't that exactly what pg_stats.correlation is?

No.  A far-from-zero correlation gives you a clue that on average, *all*
the data values are placed nonrandomly ... but it doesn't really tell
you much one way or the other about a single data value.

            regards, tom lane

Re: Sequential Scan with LIMIT

From
"Jim C. Nasby"
Date:
On Thu, Oct 28, 2004 at 07:49:28PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <decibel@decibel.org> writes:
> > On Sun, Oct 24, 2004 at 04:11:53PM -0400, Tom Lane wrote:
> >> The test case you are showing is probably suffering from nonrandom
> >> placement of this particular data value; which is something that the
> >> statistics we keep are too crude to detect.
>
> > Isn't that exactly what pg_stats.correlation is?
>
> No.  A far-from-zero correlation gives you a clue that on average, *all*
> the data values are placed nonrandomly ... but it doesn't really tell
> you much one way or the other about a single data value.

Maybe I'm confused about what the original issue was then... it appeared
that you were suggesting PGSQL was doing a seq scan instead of an index
scan because it thought it would find it on the first page if the data
was randomly distributed. If the correlation is highly non-zero though,
shouldn't it 'play it safe' and assume that unless it's picking the min
or max value stored in statistics it will be better to do an index scan,
since the value it's looking for is probably in the middle of the table
somewhere? IE: if the values in the field are between 1 and 5 and the
table is clustered on that field then clearly an index scan would be
better to find a row with field=3 than a seq scan.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"