Thread: Using index for IS NULL query

Using index for IS NULL query

From

"Andrus"

Date:

11 November 2008, 16:36:02

Index is not used for

 is null

condition:

create index  makse_dokumnr_idx on makse(dokumnr);
explain select
     sum( summa)
   from MAKSE
   where  dokumnr is null

"Aggregate  (cost=131927.95..131927.96 rows=1 width=10)"
"  ->  Seq Scan on makse  (cost=0.00..131927.94 rows=1 width=10)"
"        Filter: (dokumnr IS NULL)"



Table makse contains 1200000 rows and about 800 rows with dokumnr is null so
using index is much faster that seq scan.
How to fix ?

Andrus.

"PostgreSQL 8.1.4 on i686-pc-linux-gnu, compiled by GCC
i686-pc-linux-gnu-gcc (GCC) 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0,
pie-8.7.9)"

Re: Using index for IS NULL query

From

Tom Lane

Date:

11 November 2008, 16:45:27

"Andrus" <kobruleht2@hot.ee> writes:
> Index is not used for
>  is null

> How to fix ?

Update to something newer than 8.1 (specifically, you'll need 8.3).

            regards, tom lane

Re: Using index for IS NULL query

From

Andreas Kretschmer

Date:

11 November 2008, 16:47:10

Andrus <kobruleht2@hot.ee> schrieb:

> Index is not used for
>
> is null
>
> condition:
>
> create index  makse_dokumnr_idx on makse(dokumnr);
> explain select
>     sum( summa)
>   from MAKSE
>   where  dokumnr is null
>
> "Aggregate  (cost=131927.95..131927.96 rows=1 width=10)"
> "  ->  Seq Scan on makse  (cost=0.00..131927.94 rows=1 width=10)"
> "        Filter: (dokumnr IS NULL)"
>
>
>
> Table makse contains 1200000 rows and about 800 rows with dokumnr is null
> so using index is much faster that seq scan.
> How to fix ?

Create a partial index like below:

test=# create table foo ( i float);
CREATE TABLE
Zeit: 1,138 ms
test=*# insert into foo select random() from generate_series(1,1000000);
INSERT 0 1000000
test=*# insert into foo values (NULL);
INSERT 0 1
test=*# create index idx_foo on foo(i) where i is null;
CREATE INDEX
test=*# explain analyse select * from foo where i is null;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=5.51..4690.89 rows=5000 width=8) (actual
time=0.037..0.038 rows=1 loops=1)
   Recheck Cond: (i IS NULL)
   ->  Bitmap Index Scan on idx_foo  (cost=0.00..4.26 rows=5000 width=0)
(actual time=0.033..0.033 rows=1 loops=1)
         Index Cond: (i IS NULL)
 Total runtime: 0.068 ms
(5 Zeilen)


Maybe there are other solutions...


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

Re: Using index for IS NULL query

From

Andreas Kretschmer

Date:

11 November 2008, 16:50:56

Tom Lane <tgl@sss.pgh.pa.us> schrieb:

> "Andrus" <kobruleht2@hot.ee> writes:
> > Index is not used for
> >  is null
>
> > How to fix ?
>
> Update to something newer than 8.1 (specifically, you'll need 8.3).

Right. For my example in the other mail:

test=*# create index idx_foo on foo(i);
CREATE INDEX
test=*# explain analyse select * from foo where i is null;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=95.11..4780.49 rows=5000 width=8) (actual time=0.052..0.053 rows=1 loops=1)
   Recheck Cond: (i IS NULL)
   ->  Bitmap Index Scan on idx_foo  (cost=0.00..93.86 rows=5000 width=0) (actual time=0.047..0.047 rows=1 loops=1)
         Index Cond: (i IS NULL)
 Total runtime: 0.076 ms



Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

Re: Using index for IS NULL query

From

Tomas Vondra

Date:

11 November 2008, 17:02:02

> Index is not used for
>
> is null
>
> condition:
>
> create index  makse_dokumnr_idx on makse(dokumnr);
> explain select
>     sum( summa)
>   from MAKSE
>   where  dokumnr is null
>
> "Aggregate  (cost=131927.95..131927.96 rows=1 width=10)"
> "  ->  Seq Scan on makse  (cost=0.00..131927.94 rows=1 width=10)"
> "        Filter: (dokumnr IS NULL)"
 >
 >
> Table makse contains 1200000 rows and about 800 rows with dokumnr is
> null so using index is much faster that seq scan.
> How to fix ?

Yes, NULL values are not stored in the index, but you may create
functional index on

(CASE WHEN dokumnr IS NULL THEN -1 ELSE dokumnr END)

and then use the same expression in the WHERE clause. You may replace
the '-1' value by something that's not used in the dokumnr column.

regards
Tomas

Re: Using index for IS NULL query

From

"Vladimir Sitnikov"

Date:

11 November 2008, 19:00:19

Yes, NULL values are not stored in the index, but you may create functional index on

Re: Using index for IS NULL query

From

"Scott Marlowe"

Date:

11 November 2008, 19:34:48

On Tue, Nov 11, 2008 at 4:00 PM, Vladimir Sitnikov
<sitnikov.vladimir@gmail.com> wrote:
>
>> Yes, NULL values are not stored in the index, but you may create
>> functional index on
>
> Are you sure NULL values are not stored? btree, gist and bitmap index and
> search for NULL values.

It's not that they're not stored, it's that before 8.3 pg didn't know
how to compare to them I believe.  The standard trick was to create a
partial index with "where x is null" on the table / column.  8.3 knows
how to compare them and doesn't need the partial index.

Re: Using index for IS NULL query

From

Matthew Wakeling

Date:

12 November 2008, 09:29:39

On Tue, 11 Nov 2008, Tom Lane wrote:
>> Index is not used for
>>  is null
>> How to fix ?
>
> Update to something newer than 8.1 (specifically, you'll need 8.3).

Oooh, that's useful to know. We can get rid of all our extra nulls
indexes. Thanks.

Matthew

--
As you approach the airport, you see a sign saying "Beware - low
flying airplanes". There's not a lot you can do about that. Take
your hat off?                                  -- Michael Flanders

Index usage with sub select or inner joins

From

"Julien Theulier"

Date:

12 November 2008, 09:31:00

Hello,

I am doing some performances testing on Postgres & I discovered the
following behavior, when using 2 different ways of writing selects (but
doing the same aggregations at the end):
1. test case 1, using outer join:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,a.item_id,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '3 day') then 1 else 0 end)) as recent_sales_3d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '7 day') then 1 else 0 end)) as recent_sales_7d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '14 day') then 1 else 0 end)) as recent_sales_14d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '30 day') then 1 else 0 end)) as recent_sales_30d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '60 day') then 1 else 0 end)) as recent_sales_60d1
from bm_us_views_main_1609 a
left outer join bm_us_bids b on (b.item_id=a.item_id and
b.bid_date<a.pv_timestamp and (b.bid_date>=a.pv_timestamp - INTERVAL '60
day'))
where a.item_type in (7,9) and qty>1
group by soj_session_log_id, pv_timestamp, vi_pv_id, a.item_id;;

This query doesn't use any index according to the explain plan:
"HashAggregate  (cost=672109.07..683054.81 rows=182429 width=49)"
"  ->  Merge Left Join  (cost=646489.83..668004.42 rows=182429 width=49)"
"        Merge Cond: (a.item_id = b.item_id)"
"        Join Filter: ((b.bid_date < a.pv_timestamp) AND (b.bid_date >=
(a.pv_timestamp - '60 days'::interval)))"
"        ->  Sort  (cost=331768.62..332224.69 rows=182429 width=41)"
"              Sort Key: a.item_id"
"              ->  Seq Scan on bm_us_views_main_1609 a
(cost=0.00..315827.08 rows=182429 width=41)"
"                    Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND
(qty > 1))"
"        ->  Sort  (cost=314669.01..320949.52 rows=2512205 width=19)"
"              Sort Key: b.item_id"
"              ->  Seq Scan on bm_us_bids b  (cost=0.00..47615.05
rows=2512205 width=19)"

2. Test case 2, using sub queries:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,item_id,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '3 day' group by
item_id ),0) as recent_sales_3d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '7 day' group by
item_id ),0) as recent_sales_7d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '14 day' group
by item_id ),0) as recent_sales_14d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '30 day' group
by item_id ),0) as recent_sales_30d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '60 day' group
by item_id ),0) as recent_sales_60d
from bm_us_views_main_1609 a
where item_type in (7,9) and qty>1;

This query uses indexes  according to the explain plan:
"Seq Scan on bm_us_views_main_1609 a  (cost=0.00..8720230.77 rows=182429
width=41)"
"  Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty > 1))"
"  SubPlan"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '60 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '30 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '14 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '7 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '3 days'::interval)))"

The index bm_us_bids_item_ix is on columns item_id, bidder_id, bid_date


QUESTION: Why the planner choose seq scan in the first case & indexes scan
in the second case? In a more general way, I observed that the planner has
difficulties to select index scans & does in almost all the cases seq scan,
when doing join queries. After investigations, it looks like when you join
table a with table b on a column x and y and you have an index on column x
only, the planner is not able to choose the index scan. You have to build
the index corresponding exactly to the join statement btw the 2 tables

For example,by creating an new index on item_id and bid_date, the planner
has been able to choose this last index in both cases. Would it be possible
that the planner can choose in any case the closest index for queries having
outer join

Last thing, I am running Postgres 8.3.4 on a Windows laptop having 3.5Gb
RAM, 161Gb disk and dual core 2.5Gz processor

Regards,
Julien Theulier

Attachment

postgresql.conf

Re: Index usage with sub select or inner joins

From

Joshua Tolley

Date:

12 November 2008, 09:54:11

On Wed, Nov 12, 2008 at 02:22:47PM +0100, Julien Theulier wrote:
> QUESTION: Why the planner choose seq scan in the first case & indexes scan
> in the second case? In a more general way, I observed that the planner has
> difficulties to select index scans & does in almost all the cases seq scan,
> when doing join queries. After investigations, it looks like when you join
> table a with table b on a column x and y and you have an index on column x
> only, the planner is not able to choose the index scan. You have to build
> the index corresponding exactly to the join statement btw the 2 tables

Short, general answer: index scans aren't always faster than sequential
scans, and the planner is smart enough to know that. Googling "Why isn't
postgresql using my index" provides more detailed results, but in short,
if it scans an index, it has to read pages from the index, and for all
the tuples it finds in the index, it has to read once again from the
heap, whereas a sequential scan requires reading once from the heap. If
your query will visit most of the rows of the table, pgsql will choose a
sequential scan over an index scan.

- Josh / eggyknap

Attachment

signature.asc

Re: Index usage with sub select or outer joins

From

"Julien Theulier"

Date:

12 November 2008, 11:42:32

Hello, Joshua,

I did different test cases and here are the results (numbers in seconds),
using (case sub queries) or not (case join) the index:
Rows (main table)    Outer Join        Sub queries
setting
1396163 rows    39.2            19.6
work_mem=256Mb
3347443 rows     72.2            203.1
work_mem=256Mb
3347443 rows     70.3            31.1
work_mem=1024Mb
4321072 rows     115            554.9
work_mem=256Mb
4321072 rows     111            583
work_mem=1024Mb
All outer joins where done without index uses

To force the use of the index for the first case (outer join), I have change
the seq_scan cost (from 1 to 2.5), it takes now only 6.1s for the outer join
on 1.4M rows. New explain plan below:
"HashAggregate  (cost=457881.84..460248.84 rows=39450 width=49)"
"  ->  Nested Loop Left Join  (cost=0.00..456994.22 rows=39450 width=49)"
"        ->  Seq Scan on bm_us_views_main_2608 a  (cost=0.00..223677.45
rows=39450 width=41)"
"              Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty >
1))"
"        ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..5.65 rows=13 width=19)"
"              Index Cond: ((b.item_id = a.item_id) AND (b.bid_date <
a.pv_timestamp) AND (b.bid_date >= (a.pv_timestamp - '60 days'::interval)))"

Index bm_us_bids_item_ix is on item_id, bidder_id (not used in the
condition) & bid_date

What can be the recommendations on tuning the different costs so it can
better estimate the seq scan & index scans costs? I think the issue is
there. But didn't find any figures helping to choose the correct parameters
according to cpu & disks speed

Regards,
Julien Theulier

-----Message d'origine-----
De : Joshua Tolley [mailto:eggyknap@gmail.com]
Envoyé : mercredi 12 novembre 2008 14:54
À : Julien Theulier
Cc : pgsql-performance@postgresql.org
Objet : Re: [PERFORM] Index usage with sub select or inner joins

On Wed, Nov 12, 2008 at 02:22:47PM +0100, Julien Theulier wrote:
> QUESTION: Why the planner choose seq scan in the first case & indexes
> scan in the second case? In a more general way, I observed that the
> planner has difficulties to select index scans & does in almost all
> the cases seq scan, when doing join queries. After investigations, it
> looks like when you join table a with table b on a column x and y and
> you have an index on column x only, the planner is not able to choose
> the index scan. You have to build the index corresponding exactly to
> the join statement btw the 2 tables

Short, general answer: index scans aren't always faster than sequential
scans, and the planner is smart enough to know that. Googling "Why isn't
postgresql using my index" provides more detailed results, but in short, if
it scans an index, it has to read pages from the index, and for all the
tuples it finds in the index, it has to read once again from the heap,
whereas a sequential scan requires reading once from the heap. If your query
will visit most of the rows of the table, pgsql will choose a sequential
scan over an index scan.

- Josh / eggyknap

Re: Using index for IS NULL query

From

Tomas Vondra

Date:

12 November 2008, 20:27:41

Well, you're obviously right - I didn't know this. I guess I've found
that the index is not used for null values, and deduced somehow that
NULL values are not stored in the index.

Thanks, it's nice to find out a 'bug' before it's too late :-)

regards
Tomas

> Are you sure NULL values are not stored? btree, gist and bitmap index
> and search for NULL values.
>
> select amname, amindexnulls, amsearchnulls from pg_am;
>
>  amname | amindexnulls | amsearchnulls
> --------+--------------+---------------
>  btree  | t            | t
>  hash   | f            | f
>  gist   | t            | t
>  gin    | f            | f
>  bitmap | t            | t
> (5 rows)
>
>
> Sincerely yours,
> Vladimir Sitnikov