Thread: Avoiding a seq scan on a table.

Avoiding a seq scan on a table.

From
LWATCDR
Date:
I am trying to create and index that will prevent a seq scan table.
The querey that is causing the seq scan is this SELECT COUNT(*) FROM
issuetracking where (issue_target='david' OR manager='david') AND
date_closed=null;
Any ideas on what
Any suggestions on what index I can add that will make this not a seq scan?

Re: Avoiding a seq scan on a table.

From
LWATCDR
Date:
Thanks would you suggest a btree or a hash? My guess would a hash
since it uses an =.

On Jan 14, 2008 11:41 AM, Brian Hurt <bhurt@janestcapital.com> wrote:
>
> LWATCDR wrote:
>
> >I am trying to create and index that will prevent a seq scan table.
> >The querey that is causing the seq scan is this SELECT COUNT(*) FROM
> >issuetracking where (issue_target='david' OR manager='david') AND
> >date_closed=null;
> >Any ideas on what
> >Any suggestions on what index I can add that will make this not a seq scan?
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 5: don't forget to increase your free space map settings
> >
> >
> >
> >
> I would recommend making three indexes- one on issue_target, one on
> manager, and one on date_closed.  Postgres can then do a trick where it
> turns the indexes into bitscan indexes (with one "bit" per page as to
> wether that page might have a row of interest or not), which it can then
> bitwise and and or combine together.
>
> Don't forget to analyze the table after making the indexes.
>
> Brian
>
>

Re: Avoiding a seq scan on a table.

From
"Sean Davis"
Date:


On Jan 14, 2008 11:45 AM, LWATCDR <lwatcdr@gmail.com> wrote:
Thanks would you suggest a btree or a hash? My guess would a hash
since it uses an =.

You can pretty much ignore hash indexes in Postgres.  They are, in nearly every case (every case that I know of), slower than btree.  Just make the indexes using the default indexing scheme.  Again, do not forget to analyze the table after creating the indexes.

Sean


Re: Avoiding a seq scan on a table.

From
Brian Hurt
Date:
LWATCDR wrote:

>Thanks would you suggest a btree or a hash? My guess would a hash
>since it uses an =.
>
>
I only use btrees, so I can't speak to that.

Brian


Re: Avoiding a seq scan on a table.

From
LWATCDR
Date:
Really? From what I have done in writing my own code I have found
hashing to be faster than a btree but then when I wrote my own hashing
it was a specific type of key.
Anyway I put in the tree indexes and I am still getting a seq scan.

Aggregate  (cost=12.12..12.13 rows=1 width=0)
  ->  Result  (cost=0.00..12.12 rows=1 width=0)
        One-Time Filter: NULL::boolean
        ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
              Filter: (((issue_target)::text = 'david'::text) OR
((manager)::text = 'david'::text))


On Jan 14, 2008 11:54 AM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>
>
>
> On Jan 14, 2008 11:45 AM, LWATCDR <lwatcdr@gmail.com> wrote:
> > Thanks would you suggest a btree or a hash? My guess would a hash
> > since it uses an =.
> >
>
> You can pretty much ignore hash indexes in Postgres.  They are, in nearly
> every case (every case that I know of), slower than btree.  Just make the
> indexes using the default indexing scheme.  Again, do not forget to analyze
> the table after creating the indexes.
>
> Sean
>
>
>

Re: Avoiding a seq scan on a table.

From
"Daniel T. Staal"
Date:
On Mon, January 14, 2008 12:14 pm, LWATCDR wrote:
> Really? From what I have done in writing my own code I have found
> hashing to be faster than a btree but then when I wrote my own hashing
> it was a specific type of key.
> Anyway I put in the tree indexes and I am still getting a seq scan.
>
> Aggregate  (cost=12.12..12.13 rows=1 width=0)
>   ->  Result  (cost=0.00..12.12 rows=1 width=0)
>         One-Time Filter: NULL::boolean
>         ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
>               Filter: (((issue_target)::text = 'david'::text) OR
> ((manager)::text = 'david'::text))

Based on that cost, a sequence scan is probably the fastest yet: It's such
a small dataset that fetching the index and working with it before going
back and fetching the data is just overkill.

When you add a few dozen more rows or so, it'll switch to using the index.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


Re: Avoiding a seq scan on a table.

From
"Sean Davis"
Date:


On Jan 14, 2008 12:14 PM, LWATCDR <lwatcdr@gmail.com> wrote:
Really? From what I have done in writing my own code I have found
hashing to be faster than a btree but then when I wrote my own hashing
it was a specific type of key.
Anyway I put in the tree indexes and I am still getting a seq scan.

Aggregate  (cost=12.12..12.13 rows=1 width=0)
 ->  Result  (cost=0.00..12.12 rows=1 width=0)
       One-Time Filter: NULL::boolean
       ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
             Filter: (((issue_target)::text = 'david'::text) OR
((manager)::text = 'david'::text))

The Postgres planner will choose what it thinks is the fastest plan.  In this case, your table has only 1 row (?), so sequential scan will be fastest.  You will want to load data into your table before doing benchmarking.

Sean


Re: Avoiding a seq scan on a table.

From
Alan Hodgson
Date:
On Monday 14 January 2008, LWATCDR <lwatcdr@gmail.com> wrote:
> Really? From what I have done in writing my own code I have found
> hashing to be faster than a btree but then when I wrote my own hashing
> it was a specific type of key.
> Anyway I put in the tree indexes and I am still getting a seq scan.
>

> Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)

The planner will always choose a seq scan when your table size is very
small.

Re: Avoiding a seq scan on a table.

From
Brian Hurt
Date:
LWATCDR wrote:

>Really? From what I have done in writing my own code I have found
>hashing to be faster than a btree but then when I wrote my own hashing
>it was a specific type of key.
>Anyway I put in the tree indexes and I am still getting a seq scan.
>
>Aggregate  (cost=12.12..12.13 rows=1 width=0)
>  ->  Result  (cost=0.00..12.12 rows=1 width=0)
>        One-Time Filter: NULL::boolean
>        ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
>              Filter: (((issue_target)::text = 'david'::text) OR
>((manager)::text = 'david'::text))
>
>
>
>
For very small tables, Postgres will skip reading the indexes, because
it's not worth it.  Postgres thinks it's only going to have to read 12
pages or so.  At which point it'll likely have to read all the pages
anyways, so why also read the index?

Brian


Re: Avoiding a seq scan on a table.

From
LWATCDR
Date:
that is very odd since that table has 141 records in it.

here is a different query that I ran.
SELECT COUNT(*) FROM rma where terminatedate is NULL;
This returns a value of 254 for the count but this is what I get from explain.

Aggregate  (cost=219.77..219.78 rows=1 width=0)
  ->  Seq Scan on rma  (cost=0.00..219.11 rows=264 width=0)
        Filter: (terminatedate IS NULL)
This says that rows =1 but returns 254 rows of data?
The table contains over 7000 rows.


On Jan 14, 2008 12:22 PM, Daniel T. Staal <DStaal@usa.net> wrote:
>
> On Mon, January 14, 2008 12:14 pm, LWATCDR wrote:
> > Really? From what I have done in writing my own code I have found
> > hashing to be faster than a btree but then when I wrote my own hashing
> > it was a specific type of key.
> > Anyway I put in the tree indexes and I am still getting a seq scan.
> >
> > Aggregate  (cost=12.12..12.13 rows=1 width=0)
> >   ->  Result  (cost=0.00..12.12 rows=1 width=0)
> >         One-Time Filter: NULL::boolean
> >         ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
> >               Filter: (((issue_target)::text = 'david'::text) OR
> > ((manager)::text = 'david'::text))
>
> Based on that cost, a sequence scan is probably the fastest yet: It's such
> a small dataset that fetching the index and working with it before going
> back and fetching the data is just overkill.
>
> When you add a few dozen more rows or so, it'll switch to using the index.
>
> Daniel T. Staal
>
> ---------------------------------------------------------------
> This email copyright the author.  Unless otherwise noted, you
> are expressly allowed to retransmit, quote, or otherwise use
> the contents for non-commercial purposes.  This copyright will
> expire 5 years after the author's death, or in 30 years,
> whichever is longer, unless such a period is in excess of
> local copyright law.
> ---------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

Re: Avoiding a seq scan on a table.

From
"Daniel T. Staal"
Date:
On Mon, January 14, 2008 12:35 pm, LWATCDR wrote:
> that is very odd since that table has 141 records in it.
>
> here is a different query that I ran.
> SELECT COUNT(*) FROM rma where terminatedate is NULL;
> This returns a value of 254 for the count but this is what I get from
> explain.
>
> Aggregate  (cost=219.77..219.78 rows=1 width=0)
>   ->  Seq Scan on rma  (cost=0.00..219.11 rows=264 width=0)
>         Filter: (terminatedate IS NULL)
> This says that rows =1 but returns 254 rows of data?
> The table contains over 7000 rows.

Spend some time reading this page:
http://www.postgresql.org/docs/8.2/interactive/using-explain.html

This query returns one row: The _count_ of the number of rows.  This it
gets from a sequence scan, which that returns 264 rows, approximately.
(The explain isn't doing the query itself, it is only looking at the
statistics it has, and telling you what it _would_ to, and estimating
costs based on that.)  That scan has a filter on it, you initial
condition.

In reality when you run the query it will return 254 rows in the scan, and
then do the 'count' aggregate operation on those.

In your original, the cost was ~12.  That's a very low cost, really.  It
is unlikely any index plan will beat that, regardless of the database.
The above query would likely be sped up somewhat by an index, but how much
is a question.

A sequential scan is computationally cheap, and of predictable disk cost.
An index scan is computationally much more expensive, has a somewhat
unpredictable disk cost, and has a non-zero startup cost.  The planner
will prefer the first unless it is sure the second will do better.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


Re: Avoiding a seq scan on a table.

From
Tom Lane
Date:
LWATCDR <lwatcdr@gmail.com> writes:
> Anyway I put in the tree indexes and I am still getting a seq scan.

> Aggregate  (cost=12.12..12.13 rows=1 width=0)
>   ->  Result  (cost=0.00..12.12 rows=1 width=0)
>         One-Time Filter: NULL::boolean
>         ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
>               Filter: (((issue_target)::text = 'david'::text) OR
> ((manager)::text = 'david'::text))

You should worry about making the query correct before you worry about
making it fast.  That constant false one-time filter is a red flag...
looks like you forgot the difference between "IS NULL" and "= NULL".

            regards, tom lane

Re: Avoiding a seq scan on a table.

From
LWATCDR
Date:
Are you saying that it should be = NULL and not is NULL?

SELECT COUNT(*) FROM rma where terminatedate is NULL; which is what is
in my code returns 254 rows and is the correct value.

When I changed it to SELECT COUNT(*) FROM rma where terminatedate = NULL
the select returns no rows.
So which one is correct?
I thought that is NULL was correct if not then what is the correct syntax?


On Jan 14, 2008 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> LWATCDR <lwatcdr@gmail.com> writes:
> > Anyway I put in the tree indexes and I am still getting a seq scan.
>
> > Aggregate  (cost=12.12..12.13 rows=1 width=0)
> >   ->  Result  (cost=0.00..12.12 rows=1 width=0)
> >         One-Time Filter: NULL::boolean
> >         ->  Seq Scan on issuetracking  (cost=0.00..12.12 rows=1 width=0)
> >               Filter: (((issue_target)::text = 'david'::text) OR
> > ((manager)::text = 'david'::text))
>
> You should worry about making the query correct before you worry about
> making it fast.  That constant false one-time filter is a red flag...
> looks like you forgot the difference between "IS NULL" and "= NULL".
>
>                         regards, tom lane
>