Thread: : PostgreSQL Index behavior

: PostgreSQL Index behavior

From
Venkat Balaji
Date:
Hello Community,

I intend to understand further on PostgreSQL Index behavior on a "SELECT" statement.

We have a situation where-in Index on unique column is not being picked up as expected when used with-in the WHERE clause with other non-unique columns using AND operator.

explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
               tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
               CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
               FROM campaign_base ubs
               JOIN tab_current_day_v2 tv
               ON ubs.ubs_seq_id = tv.ubs_seq_id
             AND tv.dt = CAST('2012-09-08' AS DATE)
             GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 HashAggregate  (cost=77754.57..77754.58 rows=1 width=38)
   ->  Nested Loop  (cost=0.00..77754.56 rows=1 width=38)
         ->  Seq Scan on tab_current_day_v2 tv  (cost=0.00..77746.26 rows=1 width=39)
               Filter: (dt = '2012-09-08'::date)
         ->  Index Scan using cb_ubs_id_idx on campaign_base ubs  (cost=0.00..8.28 rows=1 width=15)
               Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)

The above plan shows "seq scan" on tab_current_day_v2 table, though there is an index on "ubs_seq_id" column which is an unique column. 

Can anyone please help us understand, why PostgreSQL optimizer is not prioritizing the unique column and hitting ubs_seq_id_idx Index here ?

Later -

We have created composite Index on "dt" (one distinct value) and "ubs_seq_id" (no duplicate values) and the index has been picked up.

Below is the scenario where-in the same query's plan picking up the composite Index.

prod-db=# create index concurrently tab_dt_ubs_seq_id_idx on tab_current_day_v2(dt,ubs_seq_id);
CREATE INDEX

prod-db=# explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
               tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
               CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
            FROM campaign_base ubs
            JOIN tab_current_day_v2 tv
              ON ubs.ubs_seq_id = tv.ubs_seq_id
             AND tv.dt = CAST('2012-09-08' AS DATE)
             GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=16.88..16.89 rows=1 width=38)
   ->  Nested Loop  (cost=0.00..16.86 rows=1 width=38)
         ->  Index Scan using tab_dt_ubs_seq_id_idx on tab_current_day_v2 tv  (cost=0.00..8.57 rows=1 width=39)
               Index Cond: (dt = '2012-09-08'::date)
         ->  Index Scan using cb_ubs_id_idx on campaign_base ubs  (cost=0.00..8.28 rows=1 width=15)
               Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)

I was expecting the above behavior without a composite Index. A column with most unique values must be picked up when multiple columns are used in WHERE clause using AND operator. Any thoughts ?

prod-db# \d tab_current_day_v2

         Table "public.tab_current_day_v2"
          Column          |           Type           | Modifiers
--------------------------+--------------------------+-----------
 dt                       | date                     |
 chn                      | character varying(10)    |
 vert                     | character varying(20)    |
 isdn            | character varying        |
 bc          | character varying(40)    |
 status                   | text                     |
 is_rene                  | boolean                  |
 age_in_sys               | integer                  |
 age_in_grace             | integer                  |
 has_prof                 | boolean                  |
 short_code               | character varying        |
 sub_vert                 | character varying(30)    |
 mode                     | character varying        |
 ubs_seq_id               | bigint                   |
 pkg_name                 | character varying(200)   |
 pkg_id                   | integer                  |
 subs_charge              | money                    |
 subs_time                | timestamp with time zone |
 ulq_seq_id               | bigint                   |
 valid_till_time          | timestamp with time zone |
 valid_from_time          | timestamp with time zone |
 latest_ube_seq_id        | bigint                   |
 latest_pkg_id            | integer                  |
 price                    | integer                  |
Indexes:
    "tab_dt_ubs_seq_id_idx" btree (dt, ubs_seq_id)
    "tab_isdn_idx" btree (msisdn)
    "tab_status_idx" btree (status)
    "ubs_seq_id_idx" btree (ubs_seq_id)

Below is the table structure and the uniqueness of each of the columns.

airtel_user_data_oltp=# select attname, n_distinct from pg_Stats where tablename='tab_current_day_v2';

         attname          | n_distinct
--------------------------+------------
 dt                       |          1
 chn                      |          7
 vert                     |         94
 isdn                     |  -0.727331
 bc                       |          4
 status                   |          3
 is_rene                  |          2
 age_in_sys               |       1018
 age_in_grac              |        369
 has_prof                 |          2
 short_code               |         23
 sub_vert                 |          5
 mode                     |          0
 ubs_seq_id               |         -1
 pkg_name                 |        461
 pkg_id                   |        461
 subs_charge              |          7
 subs_time                |         -1
 ulq_seq_id               |     122887
 valid_till_time          |  -0.966585
 valid_from_time          |  -0.962563
 latest_ube_seq_id        |         -1
 latest_pkg_id            |        475
 price                    |         18

(24 rows)

This is not an issue, but, would like to understand how PostgreSQL optimizer picks up Indexes in SELECT queries.

In an other scenario, we had used 4 columns in WHERE clause with AND operator with an Index on the column with most unique values -- The Index was picked up.

Looking forward for your help !

Regards,
VB

DISCLAIMER:

Please note that this message and any attachments may contain confidential and proprietary material and information and are intended only for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any review, use, disclosure, dissemination, distribution or copying of this message and any attachments is strictly prohibited. If you have received this email in error, please immediately notify the sender and delete this e-mail , whether electronic or printed. Please also note that any views, opinions, conclusions or commitments expressed in this message are those of the individual sender and do not necessarily reflect the views of Ver sé Innovation Pvt Ltd.

Re: : PostgreSQL Index behavior

From
Jeff Janes
Date:
On Mon, Sep 10, 2012 at 5:39 AM, Venkat Balaji <venkat.balaji@verse.in> wrote:
> Hello Community,
>
> I intend to understand further on PostgreSQL Index behavior on a "SELECT"
> statement.
>
> We have a situation where-in Index on unique column is not being picked up
> as expected when used with-in the WHERE clause with other non-unique columns
> using AND operator.
>
> explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
>                tv.vert as pkg_vert, ubs.campaign_id as campaign,
> 'none'::varchar as referer,
>                CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn,
> count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
>                FROM campaign_base ubs
>                JOIN tab_current_day_v2 tv
>                ON ubs.ubs_seq_id = tv.ubs_seq_id
>              AND tv.dt = CAST('2012-09-08' AS DATE)
>              GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
...
>
> The above plan shows "seq scan" on tab_current_day_v2 table, though there is
> an index on "ubs_seq_id" column which is an unique column.
>
> Can anyone please help us understand, why PostgreSQL optimizer is not
> prioritizing the unique column and hitting ubs_seq_id_idx Index here ?

The query where clause does not specify a constant value for
ubs_seq_id.  So it is likely that the only way to use that index would
be to reverse the order of the nested loop and seq scan the other
table.  Is there any reason to think that doing that would be faster?


>
> Later -
>
> We have created composite Index on "dt" (one distinct value) and
> "ubs_seq_id" (no duplicate values) and the index has been picked up.

Postgres seems to think that "dt" has no duplicate values, the
opposite of having one distinct value.
That is based on the estimates given in the explain plan, that teh seq
scan will return only one row after the filter on Filter: "(dt =
'2012-09-08'::date)".   This does seem to conflict with what you
report from pg_stats, but I'm not familiar with that view, and you
haven't told us what version of pgsql you are using.



> Below is the scenario where-in the same query's plan picking up the
> composite Index.

It is only using the first column of that composite index.  So if you
built a single column index just on dt, it would be picked up as well.

Cheers,

Jeff


Re: : PostgreSQL Index behavior

From
Venkat Balaji
Date:
Thank you Jeff !

My comments are inline.

> explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
>                tv.vert as pkg_vert, ubs.campaign_id as campaign,
> 'none'::varchar as referer,
>                CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn,
> count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
>                FROM campaign_base ubs
>                JOIN tab_current_day_v2 tv
>                ON ubs.ubs_seq_id = tv.ubs_seq_id
>              AND tv.dt = CAST('2012-09-08' AS DATE)
>              GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
...
>
> The above plan shows "seq scan" on tab_current_day_v2 table, though there is
> an index on "ubs_seq_id" column which is an unique column.
>
> Can anyone please help us understand, why PostgreSQL optimizer is not
> prioritizing the unique column and hitting ubs_seq_id_idx Index here ?

The query where clause does not specify a constant value for
ubs_seq_id.  So it is likely that the only way to use that index would
be to reverse the order of the nested loop and seq scan the other
table.  Is there any reason to think that doing that would be faster?

I believe, I missed an important point here. Yes, since the constant value is not provided for ubs_seq_id, Index scan is not a prime preference. Makes sense. Further analysis is explained below.
 
> Later -
>
> We have created composite Index on "dt" (one distinct value) and
> "ubs_seq_id" (no duplicate values) and the index has been picked up.

Postgres seems to think that "dt" has no duplicate values, the
opposite of having one distinct value.
That is based on the estimates given in the explain plan, that teh seq
scan will return only one row after the filter on Filter: "(dt =
'2012-09-08'::date)".   This does seem to conflict with what you
report from pg_stats, but I'm not familiar with that view, and you
haven't told us what version of pgsql you are using.

We are using PostgreSQL-9.0.1. 

Yes, "dt" has one distinct value all the time is generated on daily basis.
"2012-09-08" is an non-existent value, so, Postgres seems to think there are no duplicates.

If i pass on the value which is existing in the table "2012-09-11", PostgreSQL optimizer is picking up "Seq Scan" (what ever Index is existing).

In our scenario, we cannot expect an Index scan to happen, because I believe, following are the reasons -

ubs_seq_id column in campaign_base table has 1.2 m rows -- all distinct
ubs_seq_id column in tab_current_day_v2 table has 1.9 m rows -- all distinct
dt has only 1 distinct value.

All being used with AND operator, extracted rows will be minimum 1.2 m. So, I believe, "seq scan" is the best choice PG is opting for.

I got the point. Thanks !

Regards,
Venkat

DISCLAIMER:

Please note that this message and any attachments may contain confidential and proprietary material and information and are intended only for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any review, use, disclosure, dissemination, distribution or copying of this message and any attachments is strictly prohibited. If you have received this email in error, please immediately notify the sender and delete this e-mail , whether electronic or printed. Please also note that any views, opinions, conclusions or commitments expressed in this message are those of the individual sender and do not necessarily reflect the views of Ver sé Innovation Pvt Ltd.

Re: : PostgreSQL Index behavior

From
Scott Marlowe
Date:
On Wed, Sep 12, 2012 at 12:57 AM, Venkat Balaji <venkat.balaji@verse.in> wrote:

> We are using PostgreSQL-9.0.1.

You are missing almost 2 years of updates, bug fixes, and security fixes.


Re: : PostgreSQL Index behavior

From
Venkat Balaji
Date:


On Wed, Sep 12, 2012 at 7:42 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Wed, Sep 12, 2012 at 12:57 AM, Venkat Balaji <venkat.balaji@verse.in> wrote:

> We are using PostgreSQL-9.0.1.

You are missing almost 2 years of updates, bug fixes, and security fixes.

Thank you Scott, We are planning to upgrade to the latest version (9.2) in the near future.

Regards,
VB


DISCLAIMER:

Please note that this message and any attachments may contain confidential and proprietary material and information and are intended only for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any review, use, disclosure, dissemination, distribution or copying of this message and any attachments is strictly prohibited. If you have received this email in error, please immediately notify the sender and delete this e-mail , whether electronic or printed. Please also note that any views, opinions, conclusions or commitments expressed in this message are those of the individual sender and do not necessarily reflect the views of Ver sé Innovation Pvt Ltd.

Re: : PostgreSQL Index behavior

From
Tomas Vondra
Date:
Dne 13.09.2012 11:22, Venkat Balaji napsal:
> On Wed, Sep 12, 2012 at 7:42 PM, Scott Marlowe
> <scott.marlowe@gmail.com [2]> wrote:
>
>> On Wed, Sep 12, 2012 at 12:57 AM, Venkat Balaji
>> <venkat.balaji@verse.in [1]> wrote:
>>
>> > We are using PostgreSQL-9.0.1.
>>
>> You are missing almost 2 years of updates, bug fixes, and security
>> fixes.
>
> Thank you Scott, We are planning to upgrade to the latest version
> (9.2) in the near future.

That was not the point. The last minor update in this branch (9.0) is
9.0.9. You're missing fixes and improvements that happened between 9.0.1
and 9.0.9, that's what Scott probably meant. And some of those fixes may
be quite important, so do the upgrade ASAP.

These minor updates are binary compatible, so all you need to do is
shutting down the DB, updating the binaries (e.g. installing a new
package) and starting the database again. Upgrading to 9.2 means you'll
have to do a dump/restore and possibly more.

Tomas