Thread: On using "date 'XXX' + interval 'XXX'" vs "date 'XXX'"
Hello- In rewriting some queries I noticed a huge performance penalty when using a "date + interval" summation in the SELECT statement, versus a single simple "date". It is almost as though the "date + interval" is being calculated for each row... Much easier to demonstrate than it is to explain (note the runtimes of the queries, 72ms versus 482ms): =============================================== main_v0_8=# explain analyze select count(*) from sample_tracker where initdate>=date '2002-02-01'; NOTICE: QUERY PLAN: Aggregate (cost=607.24..607.24 rows=1 width=0) (actual time=72.08..72.08 rows=1 loops=1) -> Seq Scan on sample_tracker (cost=0.00..595.74 rows=4600 width=0) (actual time=0.04..63.62 rows=4266 loops=1) Total runtime: 72.20 msec EXPLAIN =============================================== main_v0_8=# explain analyze select count(*) from sample_tracker where initdate>=date '2002-01-01' + interval '1 month'; NOTICE: QUERY PLAN: Aggregate (cost=738.23..738.23 rows=1 width=0) (actual time=482.49..482.49 rows=1 loops=1) -> Seq Scan on sample_tracker (cost=0.00..723.98 rows=5700 width=0) (actual time=0.13..470.94 rows=4266 loops=1) Total runtime: 482.62 msec EXPLAIN =============================================== main_v0_8=# explain ANALYZE select date '2002-01-01' + interval '1 month'; NOTICE: QUERY PLAN: Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.04..0.05 rows=1 loops=1) Total runtime: 0.09 msec =============================================== main_v0_8=# select version(); version ------------------------------------------------------------- PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.96 Seems like this could be something ripe for optimization... -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham Extracta Moléculas Naturais, Rio de Janeiro, Brasil email: lapham@extracta.com.br web: http://www.extracta.com.br/ ***-*--*----*-------*------------*--------------------*---------------
Jon Lapham <lapham@extracta.com.br> writes: > In rewriting some queries I noticed a huge performance penalty when > using a "date + interval" summation in the SELECT statement, versus a > single simple "date". It is almost as though the "date + interval" is > being calculated for each row... Try coercing the sum result back to a date. It's a little easier to see what's happening in current sources: regression=# create table sample_tracker (initdate date primary key); regression=# set enable_seqscan TO 0; SET regression=# explain select count(*) from sample_tracker where regression-# initdate>=date '2002-02-01'; QUERY PLAN --------------------------------------------------------------------------------------------------- Aggregate (cost=46.33..46.33 rows=1 width=0) -> Index Scan using sample_tracker_pkey on sample_tracker (cost=0.00..45.50 rows=333 width=0) Index Cond: (initdate >= '2002-02-01'::date) (3 rows) regression=# explain select count(*) from sample_tracker where regression-# initdate>=date '2002-01-01' + interval '1 month'; QUERY PLAN ----------------------------------------------------------------------------------------------- Aggregate (cost=100000025.83..100000025.83 rows=1 width=0) -> Seq Scan on sample_tracker (cost=100000000.00..100000025.00 rows=333 width=0) Filter: ("timestamp"(initdate) >= '2002-02-01 00:00:00'::timestamp without time zone) (3 rows) regression=# Writing date(date '2002-01-01' + interval '1 month') gets me back to the first plan. regards, tom lane
Tom Lane wrote: > Try coercing the sum result back to a date. Yup, that does the trick. As usual, thanks Tom. -Jon PS: Is this optimizable? (ie: have the coersion be implicit in situations like this). It sure seems like it is a good candidate... -- -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*--- Jon Lapham Extracta Moléculas Naturais, Rio de Janeiro, Brasil email: lapham@extracta.com.br web: http://www.extracta.com.br/ ***-*--*----*-------*------------*--------------------*---------------
> > Try coercing the sum result back to a date. ... > PS: Is this optimizable? (ie: have the coersion be implicit in > situations like this). It sure seems like it is a good candidate... Not at the parser or optimizer level. You *could* have had hours, minutes, or seconds in that interval value you specified, in which case you would be truncating to get back to date. We'd need more infrastructure to somehow know how to optimize something like that. Or, we could split the INTERVAL type into the (ugh) bunch-o-types envisioned by the SQL standard. YEAR, MONTH, YEAR TO MONTH, and DAY intervals could be converted directly to dates rather than timestamps. This would allow the optimizer to know what the output range would be, whereas now the range info is just used for input and output (and is usually a don't-care internally since the other fields are zeros). - Thomas