Thread: Query planner, 7.2b1 select ... order by

Query planner, 7.2b1 select ... order by

From
mlw
Date:
I got an interesting question, and I can probably see both sides of any debate,
but.....

Say you have a fairly large table, several million records. In this table you
have a key that has a fairly good number of duplicate rows. It is a users
favorites table, each user will have a number of entries.

My problem is, if you do a select by the user name, it does an index scan. If
you do a select from the whole table, ordered by the user name, it does a
sequential scan not an index scan. It is arguable that this may be a faster
query, but at the cost of many more resources and a very long delay before any
results are returned. Is this the best behavior?

Anyone have any opinions?


cdinfo=# select count(*) from favorites ; count
---------4626568
(1 row)

cdinfo=# explain select * from favorites where scene_name_full = 'someone' ;
NOTICE:  QUERY PLAN:

Index Scan using fav_snf on favorites  (cost=0.00..258.12 rows=64 width=73)

EXPLAIN
cdinfo=# explain select * from favorites order by scene_name_full ;
NOTICE:  QUERY PLAN:

Sort  (cost=1782827.92..1782827.92 rows=4724736 width=73) ->  Seq Scan on favorites  (cost=0.00..113548.36 rows=4724736
width=73)

EXPLAIN
cdinfo=# set enable_seqscan=FALSE ;
SET VARIABLE
cdinfo=# explain select * from favorites order by scene_name_full ;
NOTICE:  QUERY PLAN:

Index Scan using fav_snf on favorites  (cost=0.00..18682771.23 rows=4724736
width=73)

EXPLAIN


Re: Query planner, 7.2b1 select ... order by

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> My problem is, if you do a select by the user name, it does an index
> scan. If you do a select from the whole table, ordered by the user
> name, it does a sequential scan not an index scan. It is arguable that
> this may be a faster query, but at the cost of many more resources and
> a very long delay before any results are returned. Is this the best
> behavior?

Unless you use a LIMIT, preferring the cheapest total cost still seems
like a win to me.  (libpq, at least, doesn't give back any results till
the whole query has been retrieved, so the claims of "higher cost" and
"long delay till first result" are both specious.)

If you do use a LIMIT, that affects the plan choice.

If you use a cursor, things get more interesting, since the planner
has no way to know how much of the query you intend to retrieve,
nor whether you'd be willing to sacrifice total time for fast initial
response on the first few rows.  Currently it's set to optimize
plans for cursors on the basis of assuming that 10% of the total rows
will be fetched.  Given the more-than-10X discrepancy between seqscan
and indexscan costs in your example, that'll probably still give you
the seqscan choice.  Hiroshi suggested making this fraction be a
user-settable parameter, which seems like a good idea to me but we
haven't gotten around to it yet.
        regards, tom lane


Re: Query planner, 7.2b1 select ... order by

From
mlw
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > My problem is, if you do a select by the user name, it does an index
> > scan. If you do a select from the whole table, ordered by the user
> > name, it does a sequential scan not an index scan. It is arguable that
> > this may be a faster query, but at the cost of many more resources and
> > a very long delay before any results are returned. Is this the best
> > behavior?
> 
> Unless you use a LIMIT, preferring the cheapest total cost still seems
> like a win to me.  (libpq, at least, doesn't give back any results till
> the whole query has been retrieved, so the claims of "higher cost" and
> "long delay till first result" are both specious.)

The table is pretty big, and I was performing the query with a binary cursor.
It really did take a little while to get results. I was using the query to
perform data analysis on a table. (If you are familiar with NetPerceptions,
think something like that)

The application framework, without any extra processing, executed the entire
query with a sequential scan in about 4 minutes, it performed the index scan in
about 34 minutes. The analysis app, takes about two hours to run with the
sequential scan.

So you are very right, it is much more efficient to run the sequential scan for
the whole table.