Thread: Partial index on date column

Partial index on date column

From
"Dave Page"
Date:
I have a table that is likely to grow over the next few years at a rate
of 1K-2K rows/day. As the vast majority of the activity on the table
(other than the inserts) will be selects of data for the current day, I
have a cron job that drops and recreates a partial index just after
midnight. It also vacuum analyzes the table.

-- Index: public.pbx_log_today_idx
CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time,
pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE
(pbx_date = '2003-03-06'::date);

I'm surprised by the following behaviour:

EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE;

Seq Scan on pbx_log  (cost=0.00..286.20 rows=1274 width=384) Filter: (pbx_date = ('now'::text)::date)


EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = '2003-03-06';

Index Scan using pbx_log_today_idx on pbx_log  (cost=0.00..5.00 rows=0
width=384) Filter: (pbx_date = '2003-03-06'::date)

Is this just an oddity because I don't have masses of data yet (4500
rows right now), or is this something the optimizer cannot handle?

Regards, Dave.


Re: Partial index on date column

From
"scott.marlowe"
Date:
On Thu, 6 Mar 2003, Dave Page wrote:

> I have a table that is likely to grow over the next few years at a rate
> of 1K-2K rows/day. As the vast majority of the activity on the table
> (other than the inserts) will be selects of data for the current day, I
> have a cron job that drops and recreates a partial index just after
> midnight. It also vacuum analyzes the table.
> 
> -- Index: public.pbx_log_today_idx
> CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time,
> pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE
> (pbx_date = '2003-03-06'::date);
> 
> I'm surprised by the following behaviour:
> 
> EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE;
> 
> Seq Scan on pbx_log  (cost=0.00..286.20 rows=1274 width=384)
>   Filter: (pbx_date = ('now'::text)::date)
> 
> 
> EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = '2003-03-06';
> 
> Index Scan using pbx_log_today_idx on pbx_log  (cost=0.00..5.00 rows=0
> width=384)
>   Filter: (pbx_date = '2003-03-06'::date)
> 
> Is this just an oddity because I don't have masses of data yet (4500
> rows right now), or is this something the optimizer cannot handle?

It's not an oddity, it's the planner trying to decide which is the better 
choice, a seq scan or an index scan.  If you had 150,000,000 rows and 
asked for 149,999,000 of them, it would be counterproductive to use an 
index, since you're gonna visit nearly every page of the table anyway.

If you were gonna get 1,000 rows out of 150,000,000 then an index scan 
makes more sense,

Somewhere in between is the switch point where going from one to the other 
makes sense.

Since you've got 4500 rows and asked for 1274 of them it's likely that the 
database will have to read the whole table anyway, so it goes ahead and 
does it.

Look for random_page_cost and a few other settings near it in the 
$PGDATA/postgresql.conf file.



Re: Partial index on date column

From
Tom Lane
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:
> CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time,
> pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE
> (pbx_date = '2003-03-06'::date);

> I'm surprised by the following behaviour:

> EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE;
> [ no indexscan ]

> Is this just an oddity because I don't have masses of data yet (4500
> rows right now), or is this something the optimizer cannot handle?

The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies the
partial index's WHERE condition.  I don't see any really good way around
this; to improve matters there'd need to be some concept of a plan that
is only good for a limited time.
        regards, tom lane


Re: Partial index on date column

From
"Dave Page"
Date:
It's rumoured that Tom Lane once said:
> "Dave Page" <dpage@vale-housing.co.uk> writes:
>> CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time,
>> pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE
>> (pbx_date = '2003-03-06'::date);
>
>> I'm surprised by the following behaviour:
>
>> EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE;
>> [ no indexscan ]
>
>> Is this just an oddity because I don't have masses of data yet (4500
>> rows right now), or is this something the optimizer cannot handle?
>
> The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies
> the partial index's WHERE condition.  I don't see any really good way
> around this; to improve matters there'd need to be some concept of a
> plan that is only good for a limited time.

Oh, OK. Thanks Tom. I can obviously work around this in my PHP code, it
just struck me as odd. I assume then that the optimizer doesn't execute
the function, and that that's done later on? Would the same be true of
simple expressions such as 1 + 2?
Regards, Dave.




Re: Partial index on date column

From
Tom Lane
Date:
"Dave Page" <dpage@vale-housing.co.uk> writes:
> It's rumoured that Tom Lane once said:
>> The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies
>> the partial index's WHERE condition.  I don't see any really good way
>> around this; to improve matters there'd need to be some concept of a
>> plan that is only good for a limited time.

> Oh, OK. Thanks Tom. I can obviously work around this in my PHP code, it
> just struck me as odd. I assume then that the optimizer doesn't execute
> the function, and that that's done later on? Would the same be true of
> simple expressions such as 1 + 2?

No, the optimizer will simplify constant expressions as much as it can.
But CURRENT_DATE is, by definition, not a constant expression.

You could cheat: make a wrapper function for CURRENT_DATE that is marked
IMMUTABLE (or isCachable, pre-7.3).  Then given something like "WHERE
pbx_date = my_date()", the optimizer would fold my_date() to a constant,
see that the constant satisfies the index's WHERE clause, and away you
go.

You'd have to be careful where you used this trick --- in a prepared
query or a plpgsql function, the pre-evaluation of my_date() would come
back to haunt you (unless maybe you are careful to end all your client
sessions at midnight).  But for interactive queries it'd work well
enough.
        regards, tom lane


Re: Partial index on date column

From
"Christopher Kings-Lynne"
Date:
> The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies the
> partial index's WHERE condition.  I don't see any really good way around
> this; to improve matters there'd need to be some concept of a plan that
> is only good for a limited time.

It's the same as the slight issue I had:

CREATE INDEX users_users_referrer_idx  ON users_users(referrer) WHERE
(referrer IS NOT NULL);

usa=# explain analyze select * from users_users where referrer=1;                                             QUERY
PLAN
----------------------------------------------------------------------------
---------------------------Seq Scan on users_users  (cost=0.00..3.89 rows=8 width=235) (actual
time=10.51..13.47 rows=8 loops=1)  Filter: (referrer = 1)

usa=# explain analyze select * from users_users where referrer=1 and
referrer is not null;                                                              QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------------Index Scan using users_users_referrer_idx on users_users
(cost=0.00..3.01
rows=1 width=235) (actual time=17.12..17.36 rows=8 loops=1)

Obviously to you and I, referrer=1 implies that referrer is not null, but
the planner doesn't know that.  You often have to add a redundant clause to
the query to ensure that the partial index is used.

Chris



Re: Partial index on date column

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> Obviously to you and I, referrer=1 implies that referrer is not null, but
> the planner doesn't know that.

Actually the planner does make exactly that deduction in some other
contexts --- but I'm hesitant to expend the cycles for partial indexes.
Partial-index condition matching is a horribly difficult problem in
general, and we only attempt a few limited cases right now.  I don't
think we want to put a general-purpose theorem prover in there ---
so it comes down to the likelihood of spotting a match in some cases,
versus the wasted cycles of checking for a match in every query that
doesn't fit the pattern.
        regards, tom lane


Re: Partial index on date column

From
"Christopher Kings-Lynne"
Date:
> "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> > Obviously to you and I, referrer=1 implies that referrer is not null,
but
> > the planner doesn't know that.
>
> Actually the planner does make exactly that deduction in some other
> contexts --- but I'm hesitant to expend the cycles for partial indexes.
> Partial-index condition matching is a horribly difficult problem in
> general, and we only attempt a few limited cases right now.  I don't
> think we want to put a general-purpose theorem prover in there ---
> so it comes down to the likelihood of spotting a match in some cases,
> versus the wasted cycles of checking for a match in every query that
> doesn't fit the pattern.

Yeah, it's not really a problem for me, I just put the extra clause in.

Is indexing excluding NULLs a common application of partial indexes?  It's
basically all I use it for, when a column has like 90-95% NULLS and I want
to exclude them from the index.  Is it worth hard-coding in the IS NOT NULL
case?

Chris



Re: Partial index on date column

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
>> Partial-index condition matching is a horribly difficult problem in
>> general, and we only attempt a few limited cases right now.

> Is it worth hard-coding in the IS NOT NULL case?

Damifino.  I have no fundamental objection to doing so --- but I'd want
to see some sort of cost-benefit argument showing that it wouldn't be
a net loss on average.  It's real easy to blow a few cycles on every
query looking for cases that don't show up often enough to justify the
distributed cost :-(.

It helps a lot if you can put in short-circuits that prevent the
matching work from being done on simple queries.  For example, the
parser/rewriter/planner take care to keep track of whether a query
contains any sub-SELECTs, and if you look in the planner you will notice
quite a lot of work that gets short-circuited when there aren't any.
I'm not sure how to make a short-circuit test for this case, however.
        regards, tom lane


Re: Partial index on date column

From
Hannu Krosing
Date:
Christopher Kings-Lynne kirjutas R, 07.03.2003 kell 07:28:
> Yeah, it's not really a problem for me, I just put the extra clause in.
> 
> Is indexing excluding NULLs a common application of partial indexes? 

For me it is ;)

> It's
> basically all I use it for, when a column has like 90-95% NULLS and I want
> to exclude them from the index.  

> Is it worth hard-coding in the IS NOT NULL case?

I'd vote for it.

------------
Hannu