Thread: Planner doesn't take indexes into account

Planner doesn't take indexes into account

From
Grzegorz Olszewski
Date:
Hi,

I wonder why planner uses Seq Scan instead of Index Scan.

Here is my table (partial):
content.contents
-------------------------+-----------------------------+-----------------------------------------------------------------
 id                      | bigint                      | niepusty domyślnie nextval('content.contents_id_seq'::regclass)
 version                 | integer                     | niepusty
 date_published          | timestamp without time zone |
 moderation_status       | character varying(50)       |
 publication_status      | character varying(30)       |

And indexes (there are some other indexes too):
    "contents_id_pkey" PRIMARY KEY, btree (id)
    "contents_date_published_idx" btree (date_published)
    "contents_moderation_status_idx" btree (moderation_status)
    "contents_publication_status_idx" btree (publication_status)

I tried also creating following indexes:
    "contents_date_published_publication_status_moderation_statu_idx" btree (date_published, publication_status, moderation_status)
    "contents_publication_status_idx1" btree ((publication_status::text))
    "contents_moderation_status_idx1" btree ((moderation_status::text))

Then for this query (genrated by Hibernate):
explain (analyze, buffers) select count(*) as y0_ from content.contents this_ inner join content.content_categories cat1_ on this_.CONTENT_CATEGORY_ID=cat1_.ID where cat1_.name in ([...])
and this_.date_published<='2014-05-26 12:23:31.557000 +02:00:00'
and (this_.PUBLICATION_STATUS is null or this_.PUBLICATION_STATUS<>'DRAFT')
and (this_.moderation_status is null or this_.moderation_status<>'DANGEROUS')
and exists(select * from content.content_visibilities cv where cv.content_id = this_.ID and cv.user_group_id in (1,2));

Planner creates such plan:
                                                                                                                                  QUERY PLAN                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=31706.84..106020.81 rows=21871 width=2076) (actual time=1197.658..6012.406 rows=430218 loops=1)
   Hash Cond: (this_.id = cv.content_id)
   Buffers: shared hit=5 read=59031 written=3, temp read=47611 written=47549
   ->  Hash Join  (cost=2.22..56618.11 rows=22881 width=2076) (actual time=0.163..1977.304 rows=430221 loops=1)
         Hash Cond: (this_.content_category_id = cat1_.id)
         Buffers: shared hit=1 read=46829 written=1
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))
               Rows Removed by Filter: 50
               Buffers: shared read=46829 written=1
         ->  Hash  (cost=2.17..2.17 rows=4 width=46) (actual time=0.089..0.089 rows=4 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               Buffers: shared hit=1
               ->  Seq Scan on content_categories cat1_  (cost=0.00..2.17 rows=4 width=46) (actual time=0.053..0.076 rows=4 loops=1)
                     Filter: ((name)::text = ANY ('{przeglad-prasy/rp,przeglad-prasy/parkiet,komunikat-z-rynku-pap-emitent,komunikat-z-rynku-pap-depesze}'::text[]))
                     Rows Removed by Filter: 74
                     Buffers: shared hit=1
   ->  Hash  (cost=24435.09..24435.09 rows=443083 width=8) (actual time=1197.146..1197.146 rows=447624 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 560kB
         Buffers: shared hit=4 read=12202 written=2, temp written=1467
         ->  Bitmap Heap Scan on content_visibilities cv  (cost=7614.55..24435.09 rows=443083 width=8) (actual time=61.034..647.729 rows=447624 loops=1)
               Recheck Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
               Buffers: shared hit=4 read=12202 written=2
               ->  Bitmap Index Scan on content_visibilities_user_group_id_idx  (cost=0.00..7503.78 rows=443083 width=0) (actual time=58.680..58.680 rows=447626 loops=1)
                     Index Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
                     Buffers: shared hit=3 read=1226
 Total runtime: 6364.689 ms
(27 wierszy)

The suspicious part is:
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))

I don't understand why planner doesn't use indexes. The problem is there are about 0.5M rows satisfying condition (almost every row in the table). Could you please explain this behavior?

I'm using PostgreSQL 9.2.8 on Ubuntu 12.04 LTS x86_64

Best regards,
Grzegorz Olszewski

Re: Planner doesn't take indexes into account

From
AI Rumman
Date:
What is random_page_cost and seq_page_cost in your server?
And how many rows does the table have?


On Tue, May 27, 2014 at 2:09 PM, Grzegorz Olszewski <grzegorz.olszewski@outlook.com> wrote:
Hi,

I wonder why planner uses Seq Scan instead of Index Scan.

Here is my table (partial):
content.contents
-------------------------+-----------------------------+-----------------------------------------------------------------
 id                      | bigint                      | niepusty domyślnie nextval('content.contents_id_seq'::regclass)
 version                 | integer                     | niepusty
 date_published          | timestamp without time zone |
 moderation_status       | character varying(50)       |
 publication_status      | character varying(30)       |

And indexes (there are some other indexes too):
    "contents_id_pkey" PRIMARY KEY, btree (id)
    "contents_date_published_idx" btree (date_published)
    "contents_moderation_status_idx" btree (moderation_status)
    "contents_publication_status_idx" btree (publication_status)

I tried also creating following indexes:
    "contents_date_published_publication_status_moderation_statu_idx" btree (date_published, publication_status, moderation_status)
    "contents_publication_status_idx1" btree ((publication_status::text))
    "contents_moderation_status_idx1" btree ((moderation_status::text))

Then for this query (genrated by Hibernate):
explain (analyze, buffers) select count(*) as y0_ from content.contents this_ inner join content.content_categories cat1_ on this_.CONTENT_CATEGORY_ID=cat1_.ID where cat1_.name in ([...])
and this_.date_published<='2014-05-26 12:23:31.557000 +02:00:00'
and (this_.PUBLICATION_STATUS is null or this_.PUBLICATION_STATUS<>'DRAFT')
and (this_.moderation_status is null or this_.moderation_status<>'DANGEROUS')
and exists(select * from content.content_visibilities cv where cv.content_id = this_.ID and cv.user_group_id in (1,2));

Planner creates such plan:
                                                                                                                                  QUERY PLAN                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=31706.84..106020.81 rows=21871 width=2076) (actual time=1197.658..6012.406 rows=430218 loops=1)
   Hash Cond: (this_.id = cv.content_id)
   Buffers: shared hit=5 read=59031 written=3, temp read=47611 written=47549
   ->  Hash Join  (cost=2.22..56618.11 rows=22881 width=2076) (actual time=0.163..1977.304 rows=430221 loops=1)
         Hash Cond: (this_.content_category_id = cat1_.id)
         Buffers: shared hit=1 read=46829 written=1
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))
               Rows Removed by Filter: 50
               Buffers: shared read=46829 written=1
         ->  Hash  (cost=2.17..2.17 rows=4 width=46) (actual time=0.089..0.089 rows=4 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               Buffers: shared hit=1
               ->  Seq Scan on content_categories cat1_  (cost=0.00..2.17 rows=4 width=46) (actual time=0.053..0.076 rows=4 loops=1)
                     Filter: ((name)::text = ANY ('{przeglad-prasy/rp,przeglad-prasy/parkiet,komunikat-z-rynku-pap-emitent,komunikat-z-rynku-pap-depesze}'::text[]))
                     Rows Removed by Filter: 74
                     Buffers: shared hit=1
   ->  Hash  (cost=24435.09..24435.09 rows=443083 width=8) (actual time=1197.146..1197.146 rows=447624 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 560kB
         Buffers: shared hit=4 read=12202 written=2, temp written=1467
         ->  Bitmap Heap Scan on content_visibilities cv  (cost=7614.55..24435.09 rows=443083 width=8) (actual time=61.034..647.729 rows=447624 loops=1)
               Recheck Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
               Buffers: shared hit=4 read=12202 written=2
               ->  Bitmap Index Scan on content_visibilities_user_group_id_idx  (cost=0.00..7503.78 rows=443083 width=0) (actual time=58.680..58.680 rows=447626 loops=1)
                     Index Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
                     Buffers: shared hit=3 read=1226
 Total runtime: 6364.689 ms
(27 wierszy)

The suspicious part is:
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))

I don't understand why planner doesn't use indexes. The problem is there are about 0.5M rows satisfying condition (almost every row in the table). Could you please explain this behavior?

I'm using PostgreSQL 9.2.8 on Ubuntu 12.04 LTS x86_64

Best regards,
Grzegorz Olszewski

Re: Planner doesn't take indexes into account

From
Grzegorz Olszewski
Date:
random_page_cost = 4.0
seq_page_cost = 1.0

There is about 500,000 rows and about 500 new rows each business day.

About 96% of rows meet given conditions, that is, count shoud be about 480,000.

BR,
Grzegorz Olszewski


Date: Tue, 27 May 2014 14:14:21 -0700
Subject: Re: [PERFORM] Planner doesn't take indexes into account
From: rummandba@gmail.com
To: grzegorz.olszewski@outlook.com
CC: pgsql-performance@postgresql.org

What is random_page_cost and seq_page_cost in your server?
And how many rows does the table have?


On Tue, May 27, 2014 at 2:09 PM, Grzegorz Olszewski <grzegorz.olszewski@outlook.com> wrote:
Hi,

I wonder why planner uses Seq Scan instead of Index Scan.

Here is my table (partial):
content.contents
-------------------------+-----------------------------+-----------------------------------------------------------------
 id                      | bigint                      | niepusty domyślnie nextval('content.contents_id_seq'::regclass)
 version                 | integer                     | niepusty
 date_published          | timestamp without time zone |
 moderation_status       | character varying(50)       |
 publication_status      | character varying(30)       |

And indexes (there are some other indexes too):
    "contents_id_pkey" PRIMARY KEY, btree (id)
    "contents_date_published_idx" btree (date_published)
    "contents_moderation_status_idx" btree (moderation_status)
    "contents_publication_status_idx" btree (publication_status)

I tried also creating following indexes:
    "contents_date_published_publication_status_moderation_statu_idx" btree (date_published, publication_status, moderation_status)
    "contents_publication_status_idx1" btree ((publication_status::text))
    "contents_moderation_status_idx1" btree ((moderation_status::text))

Then for this query (genrated by Hibernate):
explain (analyze, buffers) select count(*) as y0_ from content.contents this_ inner join content.content_categories cat1_ on this_.CONTENT_CATEGORY_ID=cat1_.ID where cat1_.name in ([...])
and this_.date_published<='2014-05-26 12:23:31.557000 +02:00:00'
and (this_.PUBLICATION_STATUS is null or this_.PUBLICATION_STATUS<>'DRAFT')
and (this_.moderation_status is null or this_.moderation_status<>'DANGEROUS')
and exists(select * from content.content_visibilities cv where cv.content_id = this_.ID and cv.user_group_id in (1,2));

Planner creates such plan:
                                                                                                                                  QUERY PLAN                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=31706.84..106020.81 rows=21871 width=2076) (actual time=1197.658..6012.406 rows=430218 loops=1)
   Hash Cond: (this_.id = cv.content_id)
   Buffers: shared hit=5 read=59031 written=3, temp read=47611 written=47549
   ->  Hash Join  (cost=2.22..56618.11 rows=22881 width=2076) (actual time=0.163..1977.304 rows=430221 loops=1)
         Hash Cond: (this_.content_category_id = cat1_.id)
         Buffers: shared hit=1 read=46829 written=1
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))
               Rows Removed by Filter: 50
               Buffers: shared read=46829 written=1
         ->  Hash  (cost=2.17..2.17 rows=4 width=46) (actual time=0.089..0.089 rows=4 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               Buffers: shared hit=1
               ->  Seq Scan on content_categories cat1_  (cost=0.00..2.17 rows=4 width=46) (actual time=0.053..0.076 rows=4 loops=1)
                     Filter: ((name)::text = ANY ('{przeglad-prasy/rp,przeglad-prasy/parkiet,komunikat-z-rynku-pap-emitent,komunikat-z-rynku-pap-depesze}'::text[]))
                     Rows Removed by Filter: 74
                     Buffers: shared hit=1
   ->  Hash  (cost=24435.09..24435.09 rows=443083 width=8) (actual time=1197.146..1197.146 rows=447624 loops=1)
         Buckets: 4096  Batches: 32  Memory Usage: 560kB
         Buffers: shared hit=4 read=12202 written=2, temp written=1467
         ->  Bitmap Heap Scan on content_visibilities cv  (cost=7614.55..24435.09 rows=443083 width=8) (actual time=61.034..647.729 rows=447624 loops=1)
               Recheck Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
               Buffers: shared hit=4 read=12202 written=2
               ->  Bitmap Index Scan on content_visibilities_user_group_id_idx  (cost=0.00..7503.78 rows=443083 width=0) (actual time=58.680..58.680 rows=447626 loops=1)
                     Index Cond: (user_group_id = ANY ('{1,2}'::bigint[]))
                     Buffers: shared hit=3 read=1226
 Total runtime: 6364.689 ms
(27 wierszy)

The suspicious part is:
         ->  Seq Scan on contents this_  (cost=0.00..54713.92 rows=446176 width=2030) (actual time=0.048..915.724 rows=450517 loops=1)
               Filter: ((date_published <= '2014-05-26 12:23:31.557'::timestamp without time zone) AND ((publication_status IS NULL) OR ((publication_status)::text <> 'DRAFT'::text)) AND ((moderation_status IS NULL) OR ((moderation_status)::text <> 'DANGEROUS'::text)))

I don't understand why planner doesn't use indexes. The problem is there are about 0.5M rows satisfying condition (almost every row in the table). Could you please explain this behavior?

I'm using PostgreSQL 9.2.8 on Ubuntu 12.04 LTS x86_64

Best regards,
Grzegorz Olszewski

Re: Planner doesn't take indexes into account

From
Heikki Linnakangas
Date:
On 05/28/2014 12:59 PM, Grzegorz Olszewski wrote:
> random_page_cost = 4.0
> seq_page_cost = 1.0
>
> There is about 500,000 rows and about 500 new rows each business day.
>
> About 96% of rows meet given conditions, that is, count shoud be about 480,000.

When such a large percentage of the rows match, a sequential scan is
indeed a better plan than an index scan. Sequential access is much
faster than random access.

- Heikki


Re: Planner doesn't take indexes into account

From
Shaun Thomas
Date:
On 05/28/2014 04:59 AM, Grzegorz Olszewski wrote:

> There is about 500,000 rows and about 500 new rows each business day.
>
> About 96% of rows meet given conditions, that is, count shoud be about
> 480,000.

Heikki is right on this. Indexes are not a magic secret sauce that are
always used simply because they exist. Think of it like this...

If the table really matches about 480,000 rows, by forcing it to use the
index, it has to perform *at least* 480,000 random seeks. Even if you
have a high-performance SSD array that can do 100,000 random reads per
second, you will need about five seconds just to read the data.

A sequence scan can perform that same operation in a fraction of a
second because it's faster to read the entire table and filter out the
*non* matching rows.

Indexes are really only used, or useful, when the number of matches is
much lower than the row count of the table. I highly recommend reading
up on cardinality and selectivity before creating more indexes. This
page in the documentation does a really good job:

http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Planner doesn't take indexes into account

From
Grzegorz Olszewski
Date:
OK, thank you very much. I've tried similar query but with very few rows matching. In this case index was present in the plan.

BR,
Grzegorz Olszewski

> Date: Wed, 28 May 2014 08:31:38 -0500
> From: sthomas@optionshouse.com
> To: grzegorz.olszewski@outlook.com; rummandba@gmail.com
> CC: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Planner doesn't take indexes into account
>
> On 05/28/2014 04:59 AM, Grzegorz Olszewski wrote:
>
> > There is about 500,000 rows and about 500 new rows each business day.
> >
> > About 96% of rows meet given conditions, that is, count shoud be about
> > 480,000.
>
> Heikki is right on this. Indexes are not a magic secret sauce that are
> always used simply because they exist. Think of it like this...
>
> If the table really matches about 480,000 rows, by forcing it to use the
> index, it has to perform *at least* 480,000 random seeks. Even if you
> have a high-performance SSD array that can do 100,000 random reads per
> second, you will need about five seconds just to read the data.
>
> A sequence scan can perform that same operation in a fraction of a
> second because it's faster to read the entire table and filter out the
> *non* matching rows.
>
> Indexes are really only used, or useful, when the number of matches is
> much lower than the row count of the table. I highly recommend reading
> up on cardinality and selectivity before creating more indexes. This
> page in the documentation does a really good job:
>
> http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html
>
> --
> Shaun Thomas
> OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
> 312-676-8870
> sthomas@optionshouse.com
>
> ______________________________________________
>
> See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance