Thread: Seq scan vs. Index scan with different query conditions

Seq scan vs. Index scan with different query conditions

From
eleven@ludojad.itpp.pl
Date:
Hello,

Can anybody suggest any hint on this:

temp=> EXPLAIN SELECT DISTINCT "number" FROM "tablex" WHERE "Date" BETWEEN '2004-06-28'::date AND '2004-07-04'::date
AND"Time" BETWEEN '00:00:00'::time AND '18:01:00'::time; 

Unique  (cost=305669.92..306119.43 rows=89 width=8)
    ->  Sort  (cost=305669.92..305894.67 rows=89903 width=8)
        Sort Key: "number"
            ->  Index Scan using "DateTimeIndex" on "tablex"  (cost=0.00..298272.66 rows=89903 width=8)
                Index Cond: (("Date" >= '2004-06-28'::date) AND ("Date" <= '2004-07-04'::date) AND ("Time" >=
'00:00:00'::timewithout time zone) AND ("Time" <= '18:01:00'::time without time zone)) 


temp=> EXPLAIN SELECT DISTINCT "number" FROM "tablex" WHERE "Date" BETWEEN '2004-06-28'::date AND '2004-07-04'::date
AND"Time" BETWEEN '00:00:00'::time AND '19:01:00'::time; 

Unique  (cost=315252.77..315742.27 rows=97 width=8)
    ->  Sort  (cost=315252.77..315497.52 rows=97900 width=8)
        Sort Key: "number"
            ->  Seq Scan on "tablex"  (cost=0.00..307137.34 rows=97900 width=8)
            Filter: (("Date" >= '2004-06-28'::date) AND ("Date" <= '2004-07-04'::date) AND ("Time" >= '00:00:00'::time
withouttime zone) AND ("Time" <= '19:01:00'::time without time zone)) 

Basically, the difference is in upper "Time" value (as you can see, it's
18:01:00 in the first query and 19:01:00 in the other one).
The question is - why does it use index in first case and
it tries to do full sequential scan when the upper "Time" value
is different?

DateTimeIndex was created on both columns (Date/Time):
CREATE INDEX "DateTimeIndex" ON "tablex" USING btree ("Date", "Time");

--
wr

Re: Seq scan vs. Index scan with different query conditions

From
Richard Huxton
Date:
eleven@ludojad.itpp.pl wrote:

> ->  Index Scan using "DateTimeIndex" on "tablex"  (cost=0.00..298272.66 rows=89903 width=8)

> ->  Seq Scan on "tablex"  (cost=0.00..307137.34 rows=97900 width=8)

> Basically, the difference is in upper "Time" value (as you can see, it's
> 18:01:00 in the first query and 19:01:00 in the other one).
> The question is - why does it use index in first case and
> it tries to do full sequential scan when the upper "Time" value
> is different?

Look at the rows, and more importantly the cost. PG thinks the cost in
the second case (seq scan) is only slightly more than in the first case
(index), so presumably the index scan worked out more expensive.

You can test this by issuing "SET ENABLE_SEQSCAN=OFF;" and re-running
the second explain.

Now, the question is whether PG is right in these cost estimates. You'll
need to run "EXPLAIN ANALYSE" rather than just EXPLAIN to see what it
actually costs.

PS - all the usual questions: make sure you've vacuumed, have you read
the tuning document on varlena.com?

--
   Richard Huxton
   Archonet Ltd

Re: Seq scan vs. Index scan with different query

From
Andrew McMillan
Date:
On Mon, 2004-07-05 at 12:15 +0200, eleven@ludojad.itpp.pl wrote:
> Hello,
>
> Can anybody suggest any hint on this:
>
> temp=> EXPLAIN SELECT DISTINCT "number" FROM "tablex" WHERE "Date" BETWEEN '2004-06-28'::date AND '2004-07-04'::date
AND"Time" BETWEEN '00:00:00'::time AND '18:01:00'::time; 
>
> Unique  (cost=305669.92..306119.43 rows=89 width=8)
>     ->  Sort  (cost=305669.92..305894.67 rows=89903 width=8)
>         Sort Key: "number"
>             ->  Index Scan using "DateTimeIndex" on "tablex"  (cost=0.00..298272.66 rows=89903 width=8)
>                 Index Cond: (("Date" >= '2004-06-28'::date) AND ("Date" <= '2004-07-04'::date) AND ("Time" >=
'00:00:00'::timewithout time zone) AND ("Time" <= '18:01:00'::time without time zone)) 
>
>
> temp=> EXPLAIN SELECT DISTINCT "number" FROM "tablex" WHERE "Date" BETWEEN '2004-06-28'::date AND '2004-07-04'::date
AND"Time" BETWEEN '00:00:00'::time AND '19:01:00'::time; 
>
> Unique  (cost=315252.77..315742.27 rows=97 width=8)
>     ->  Sort  (cost=315252.77..315497.52 rows=97900 width=8)
>         Sort Key: "number"
>             ->  Seq Scan on "tablex"  (cost=0.00..307137.34 rows=97900 width=8)
>             Filter: (("Date" >= '2004-06-28'::date) AND ("Date" <= '2004-07-04'::date) AND ("Time" >=
'00:00:00'::timewithout time zone) AND ("Time" <= '19:01:00'::time without time zone)) 
>
> Basically, the difference is in upper "Time" value (as you can see, it's
> 18:01:00 in the first query and 19:01:00 in the other one).
> The question is - why does it use index in first case and
> it tries to do full sequential scan when the upper "Time" value
> is different?
>
> DateTimeIndex was created on both columns (Date/Time):
> CREATE INDEX "DateTimeIndex" ON "tablex" USING btree ("Date", "Time");

PostgreSQL is always going to switch at some point, where the number of
rows that have to be read from the table exceed some percentage of the
total rows in the table.

We can possibly be more helpful if you send EXPLAIN ANALYZE, rather than
just EXPLAIN.

A few things to be careful of:

- Is this supposed to be a slice of midnight to 6pm, for each day
between 28 June and 4 July?  If you want a continuous period from
Midnight 28 June -> 6pm 4 July you're better to have a single timestamp
field.

- It is unlikely that the , "Time" on your index is adding much to your
selectivity, and it may be that you would be better off without it.

- the DISTINCT can screw up your results, and it usually means that the
SQL is not really the best it could be.  A _real_ need for DISTINCT is
quite rare in my experience, and from what I have seen it adds overhead
and tends to encourage bad query plans when used unnecessarily.

Hope this is some help.

Regards,
                    Andrew McMillan

-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
      Make things as simple as possible, but no simpler -- Einstein
-------------------------------------------------------------------------

Attachment

Re: Seq scan vs. Index scan with different query conditions

From
eleven@ludojad.itpp.pl
Date:
On Mon, Jul 05, 2004 at 11:44:13PM +1200, Andrew McMillan wrote:

> > DateTimeIndex was created on both columns (Date/Time):
> > CREATE INDEX "DateTimeIndex" ON "tablex" USING btree ("Date", "Time");
> PostgreSQL is always going to switch at some point, where the number of
> rows that have to be read from the table exceed some percentage of the
> total rows in the table.
> We can possibly be more helpful if you send EXPLAIN ANALYZE, rather than
> just EXPLAIN.

Unfortunately that seq scan vs. index scan
heuristic was wrong - full scan kills the machine
in no time due to large amount of INSERTs happening
in the background (I/O bottleneck).

> - Is this supposed to be a slice of midnight to 6pm, for each day
> between 28 June and 4 July?  If you want a continuous period from
> Midnight 28 June -> 6pm 4 July you're better to have a single timestamp
> field.
> - It is unlikely that the , "Time" on your index is adding much to your
> selectivity, and it may be that you would be better off without it.

Yes, we've figured out that index on Date + Time is rather useless.
Thanks for the tip, we've created index upon Date column instead and
it should be enough.

> - the DISTINCT can screw up your results, and it usually means that the
> SQL is not really the best it could be.  A _real_ need for DISTINCT is
> quite rare in my experience, and from what I have seen it adds overhead
> and tends to encourage bad query plans when used unnecessarily.

What do you mean? The reason for which there's DISTINCT in that query is
because I want to know how many unique rows is in the table.
Do you suggest selecting all rows and doing "DISTINCT"/counting
on the application level?

--
11.

Re: Seq scan vs. Index scan with different query

From
Andrew McMillan
Date:
On Mon, 2004-07-05 at 15:46 +0200, eleven@ludojad.itpp.pl wrote:
> On Mon, Jul 05, 2004 at 11:44:13PM +1200, Andrew McMillan wrote:
>
> > > DateTimeIndex was created on both columns (Date/Time):
> > > CREATE INDEX "DateTimeIndex" ON "tablex" USING btree ("Date", "Time");
> > PostgreSQL is always going to switch at some point, where the number of
> > rows that have to be read from the table exceed some percentage of the
> > total rows in the table.
> > We can possibly be more helpful if you send EXPLAIN ANALYZE, rather than
> > just EXPLAIN.
>
> Unfortunately that seq scan vs. index scan
> heuristic was wrong - full scan kills the machine
> in no time due to large amount of INSERTs happening
> in the background (I/O bottleneck).

In that case you could perhaps consider tweaking various parameters in
your postgresql.conf - with an ideal setup the switch should happen when
the costs are roughly equal.

Have you gone through the information here:
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

Also, if table rows are regularly DELETEd or UPDATEd then you will need
to ensure it is regularly vacuumed.  Does a "VACUUM VERBOSE tablex" show
a large number of dead tuples?  Are you running pg_autovacuum?  Do you
get similar results immediately after a "VACUUM FULL ANALYZE tablex"?

Possibly there is an uneven distribution of rows in the table.  You
could consider increasing the statistics target:
ALTER TABLE tablex ALTER COLUMN "Date" SET STATISTICS;
ANALYZE tablex;


> > - Is this supposed to be a slice of midnight to 6pm, for each day
> > between 28 June and 4 July?  If you want a continuous period from
> > Midnight 28 June -> 6pm 4 July you're better to have a single timestamp
> > field.
> > - It is unlikely that the , "Time" on your index is adding much to your
> > selectivity, and it may be that you would be better off without it.
>
> Yes, we've figured out that index on Date + Time is rather useless.
> Thanks for the tip, we've created index upon Date column instead and
> it should be enough.

It may be that you are better with a single timestamp column with an
index on it in any case, if you want the data sorted in timestamp order.
Then you can ORDER BY <timestamp> as well, which will encourage the
index use further (although this advantage tends to get lost with the
DISTINCT).  You can still access the time part for a separate comparison
just with a cast.


> > - the DISTINCT can screw up your results, and it usually means that the
> > SQL is not really the best it could be.  A _real_ need for DISTINCT is
> > quite rare in my experience, and from what I have seen it adds overhead
> > and tends to encourage bad query plans when used unnecessarily.
>
> What do you mean? The reason for which there's DISTINCT in that query is
> because I want to know how many unique rows is in the table.
> Do you suggest selecting all rows and doing "DISTINCT"/counting
> on the application level?

That's fine, I've just seen it used far too many times as a substitute
for having an extra join, or an application that should only be
inserting unique rows in the first place.  Things like that.  It's just
one of those things that always sets off alarm bells when I'm reviewing
someone else's work, and on most of these occasions it has not been
justified when reexamined.

Cheers,
                    Andrew.

-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/            PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201      MOB: +64(272)DEBIAN      OFFICE: +64(4)499-2267
It is truth which you cannot contradict; you can without any difficulty
                      contradict Socrates. - Plato
-------------------------------------------------------------------------

Attachment