Thread: Partial index on date column
I have a table that is likely to grow over the next few years at a rate of 1K-2K rows/day. As the vast majority of the activity on the table (other than the inserts) will be selects of data for the current day, I have a cron job that drops and recreates a partial index just after midnight. It also vacuum analyzes the table. -- Index: public.pbx_log_today_idx CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time, pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE (pbx_date = '2003-03-06'::date); I'm surprised by the following behaviour: EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE; Seq Scan on pbx_log (cost=0.00..286.20 rows=1274 width=384) Filter: (pbx_date = ('now'::text)::date) EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = '2003-03-06'; Index Scan using pbx_log_today_idx on pbx_log (cost=0.00..5.00 rows=0 width=384) Filter: (pbx_date = '2003-03-06'::date) Is this just an oddity because I don't have masses of data yet (4500 rows right now), or is this something the optimizer cannot handle? Regards, Dave.
On Thu, 6 Mar 2003, Dave Page wrote: > I have a table that is likely to grow over the next few years at a rate > of 1K-2K rows/day. As the vast majority of the activity on the table > (other than the inserts) will be selects of data for the current day, I > have a cron job that drops and recreates a partial index just after > midnight. It also vacuum analyzes the table. > > -- Index: public.pbx_log_today_idx > CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time, > pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE > (pbx_date = '2003-03-06'::date); > > I'm surprised by the following behaviour: > > EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE; > > Seq Scan on pbx_log (cost=0.00..286.20 rows=1274 width=384) > Filter: (pbx_date = ('now'::text)::date) > > > EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = '2003-03-06'; > > Index Scan using pbx_log_today_idx on pbx_log (cost=0.00..5.00 rows=0 > width=384) > Filter: (pbx_date = '2003-03-06'::date) > > Is this just an oddity because I don't have masses of data yet (4500 > rows right now), or is this something the optimizer cannot handle? It's not an oddity, it's the planner trying to decide which is the better choice, a seq scan or an index scan. If you had 150,000,000 rows and asked for 149,999,000 of them, it would be counterproductive to use an index, since you're gonna visit nearly every page of the table anyway. If you were gonna get 1,000 rows out of 150,000,000 then an index scan makes more sense, Somewhere in between is the switch point where going from one to the other makes sense. Since you've got 4500 rows and asked for 1274 of them it's likely that the database will have to read the whole table anyway, so it goes ahead and does it. Look for random_page_cost and a few other settings near it in the $PGDATA/postgresql.conf file.
"Dave Page" <dpage@vale-housing.co.uk> writes: > CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time, > pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE > (pbx_date = '2003-03-06'::date); > I'm surprised by the following behaviour: > EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE; > [ no indexscan ] > Is this just an oddity because I don't have masses of data yet (4500 > rows right now), or is this something the optimizer cannot handle? The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies the partial index's WHERE condition. I don't see any really good way around this; to improve matters there'd need to be some concept of a plan that is only good for a limited time. regards, tom lane
It's rumoured that Tom Lane once said: > "Dave Page" <dpage@vale-housing.co.uk> writes: >> CREATE INDEX pbx_log_today_idx ON pbx_log USING btree (pbx_time, >> pbx_call_type, pbx_digits_source, pbx_digits_destination) WHERE >> (pbx_date = '2003-03-06'::date); > >> I'm surprised by the following behaviour: > >> EXPLAIN SELECT * FROM pbx_log WHERE pbx_date = CURRENT_DATE; >> [ no indexscan ] > >> Is this just an oddity because I don't have masses of data yet (4500 >> rows right now), or is this something the optimizer cannot handle? > > The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies > the partial index's WHERE condition. I don't see any really good way > around this; to improve matters there'd need to be some concept of a > plan that is only good for a limited time. Oh, OK. Thanks Tom. I can obviously work around this in my PHP code, it just struck me as odd. I assume then that the optimizer doesn't execute the function, and that that's done later on? Would the same be true of simple expressions such as 1 + 2? Regards, Dave.
"Dave Page" <dpage@vale-housing.co.uk> writes: > It's rumoured that Tom Lane once said: >> The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies >> the partial index's WHERE condition. I don't see any really good way >> around this; to improve matters there'd need to be some concept of a >> plan that is only good for a limited time. > Oh, OK. Thanks Tom. I can obviously work around this in my PHP code, it > just struck me as odd. I assume then that the optimizer doesn't execute > the function, and that that's done later on? Would the same be true of > simple expressions such as 1 + 2? No, the optimizer will simplify constant expressions as much as it can. But CURRENT_DATE is, by definition, not a constant expression. You could cheat: make a wrapper function for CURRENT_DATE that is marked IMMUTABLE (or isCachable, pre-7.3). Then given something like "WHERE pbx_date = my_date()", the optimizer would fold my_date() to a constant, see that the constant satisfies the index's WHERE clause, and away you go. You'd have to be careful where you used this trick --- in a prepared query or a plpgsql function, the pre-evaluation of my_date() would come back to haunt you (unless maybe you are careful to end all your client sessions at midnight). But for interactive queries it'd work well enough. regards, tom lane
> The optimizer does not think that "pbx_date = CURRENT_DATE" satisfies the > partial index's WHERE condition. I don't see any really good way around > this; to improve matters there'd need to be some concept of a plan that > is only good for a limited time. It's the same as the slight issue I had: CREATE INDEX users_users_referrer_idx ON users_users(referrer) WHERE (referrer IS NOT NULL); usa=# explain analyze select * from users_users where referrer=1; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------Seq Scan on users_users (cost=0.00..3.89 rows=8 width=235) (actual time=10.51..13.47 rows=8 loops=1) Filter: (referrer = 1) usa=# explain analyze select * from users_users where referrer=1 and referrer is not null; QUERY PLAN ---------------------------------------------------------------------------- ------------------------------------------------------------Index Scan using users_users_referrer_idx on users_users (cost=0.00..3.01 rows=1 width=235) (actual time=17.12..17.36 rows=8 loops=1) Obviously to you and I, referrer=1 implies that referrer is not null, but the planner doesn't know that. You often have to add a redundant clause to the query to ensure that the partial index is used. Chris
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > Obviously to you and I, referrer=1 implies that referrer is not null, but > the planner doesn't know that. Actually the planner does make exactly that deduction in some other contexts --- but I'm hesitant to expend the cycles for partial indexes. Partial-index condition matching is a horribly difficult problem in general, and we only attempt a few limited cases right now. I don't think we want to put a general-purpose theorem prover in there --- so it comes down to the likelihood of spotting a match in some cases, versus the wasted cycles of checking for a match in every query that doesn't fit the pattern. regards, tom lane
> "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > > Obviously to you and I, referrer=1 implies that referrer is not null, but > > the planner doesn't know that. > > Actually the planner does make exactly that deduction in some other > contexts --- but I'm hesitant to expend the cycles for partial indexes. > Partial-index condition matching is a horribly difficult problem in > general, and we only attempt a few limited cases right now. I don't > think we want to put a general-purpose theorem prover in there --- > so it comes down to the likelihood of spotting a match in some cases, > versus the wasted cycles of checking for a match in every query that > doesn't fit the pattern. Yeah, it's not really a problem for me, I just put the extra clause in. Is indexing excluding NULLs a common application of partial indexes? It's basically all I use it for, when a column has like 90-95% NULLS and I want to exclude them from the index. Is it worth hard-coding in the IS NOT NULL case? Chris
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: >> Partial-index condition matching is a horribly difficult problem in >> general, and we only attempt a few limited cases right now. > Is it worth hard-coding in the IS NOT NULL case? Damifino. I have no fundamental objection to doing so --- but I'd want to see some sort of cost-benefit argument showing that it wouldn't be a net loss on average. It's real easy to blow a few cycles on every query looking for cases that don't show up often enough to justify the distributed cost :-(. It helps a lot if you can put in short-circuits that prevent the matching work from being done on simple queries. For example, the parser/rewriter/planner take care to keep track of whether a query contains any sub-SELECTs, and if you look in the planner you will notice quite a lot of work that gets short-circuited when there aren't any. I'm not sure how to make a short-circuit test for this case, however. regards, tom lane
Christopher Kings-Lynne kirjutas R, 07.03.2003 kell 07:28: > Yeah, it's not really a problem for me, I just put the extra clause in. > > Is indexing excluding NULLs a common application of partial indexes? For me it is ;) > It's > basically all I use it for, when a column has like 90-95% NULLS and I want > to exclude them from the index. > Is it worth hard-coding in the IS NOT NULL case? I'd vote for it. ------------ Hannu