Thread: "Interesting" query planning for timestamp ranges in where clause?
"Interesting" query planning for timestamp ranges in where clause?
From
rvigmbdc@umail.furryterror.org (Zygo Blaxell)
Date:
I have a table with a few million rows of temperature data keyed by timestamp. I want to group these rows by timestamp intervals (e.g. every 32 seconds), compute aggregate functions on the columns, and ultimately feed the result into a graph-drawing web thingy. I'm trying a few different ways to get what seems to be the same data, and seeing some odd behavior from the query planner. The table looks like this: ilt=# \d+ temp Table "public.temp" Column | Type | Modifiers | Description --------+--------------------------------+-----------+------------- t | timestamp(0) without time zone | not null | t1 | double precision | | t2 | double precision | | t3 | double precision | | t4 | double precision | | t5 | double precision | | t6 | double precision | | t7 | double precision | | t8 | double precision | | Indexes: "temp_pkey" primary key, btree (t) ilt=# select count(*) from temp; count --------- 3316004 (1 row) Time: 18144.953 ms I have a function for computing rounded timestamps which uses abstime (any better suggestions gladly appreciated...): ilt=# \df+ time_bucket List of functions Result data type | Schema | Name | Argument data types | Owner | Language | Source code | Description ------------------+--------+-------------+-----------------------------------+----------+----------+-----------------------------------------------------+------------- abstime | public | time_bucket | timestamp with time zone, integer | zblaxell | sql | select abstime(int4(extract(epochfrom $1)/$2)*$2); | (1 row) Now suppose I want a table of the last 24 hours' data at 32 second intervals. I might try a WHERE clause with now() and now() - interval '1 day': ilt=# explain analyze select time_bucket(t, 32), avg(t1), avg(t2), avg(t3), avg(t4), avg(t5), avg(t6), avg(t7), avg(t8) fromtemp where t between now() - interval '1 day' and now() group by time_bucket(t, 32) order by time_bucket(t, 32) desc; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=210842.95..212603.60 rows=30620 width=72) (actual time=39835.357..41150.006 rows=2697 loops=1) -> Sort (cost=210842.95..210919.50 rows=30620 width=72) (actual time=39835.064..40002.349 rows=85360 loops=1) Sort Key: ((int4((date_part('epoch'::text, (t)::timestamp with time zone) / 32::double precision)) * 32))::abstime -> Seq Scan on "temp" (cost=0.00..207797.15 rows=30620 width=72) (actual time=35275.231..38323.524 rows=85360loops=1) Filter: (((t)::timestamp with time zone >= (now() - '1 day'::interval)) AND ((t)::timestamp with time zone<= now())) Total runtime: 41165.330 ms (6 rows) 41 seconds is a long time for a 86400-row query, especially if I want to draw a graph once per sampling interval. This is way too slow, so I gave up on this for a while, and I spent a day playing around with other stuff. During that day I accidentally discovered that there's a very different way to do this query. Actually, I think it's equivalent, but the query planner disagrees: ilt=# explain analyze select time_bucket(t, 32), avg(t1), avg(t2), avg(t3), avg(t4), avg(t5), avg(t6), avg(t7), avg(t8) fromtemp where t between time_bucket(now() - interval '1 day', 1) and time_bucket(now(), 1) group by time_bucket(t, 32) orderby time_bucket(t, 32) desc; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=5197.30..6958.64 rows=30632 width=72) (actual time=4170.763..5611.031 rows=2697 loops=1) -> Sort (cost=5197.30..5273.88 rows=30632 width=72) (actual time=4169.819..4336.096 rows=85360 loops=1) Sort Key: ((int4((date_part('epoch'::text, (t)::timestamp with time zone) / 32::double precision)) * 32))::abstime -> Index Scan using temp_pkey on "temp" (cost=0.00..2150.53 rows=30632 width=72) (actual time=0.278..2090.791rows=85360 loops=1) Index Cond: ((t >= (((int4((date_part('epoch'::text, (now() - '1 day'::interval)) / 1::double precision))* 1))::abstime)::timestamp without time zone) AND (t <= (((int4((date_part('epoch'::text, now()) / 1::double precision))* 1))::abstime)::timestamp without time zone)) Total runtime: 5639.385 ms (6 rows) Another query that is slightly faster uses timestamps that are constants supplied by the application. I did two variations, one using the 'today' and 'yesterday' keywords, and one with literal dates. There wasn't significant difference between those, and they look like this: ilt=# select now(); now ------------------------------- 2004-06-15 22:41:46.507174-04 (1 row) ilt=# explain analyze select time_bucket(t, 32), avg(t1), avg(t2), avg(t3), avg(t4), avg(t5), avg(t6), avg(t7), avg(t8) fromtemp where t between 'yesterday 22:41:46' and 'today 22:41:46' group by time_bucket(t, 32) order by time_bucket(t, 32)desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=34083.64..43162.37 rows=157891 width=72) (actual time=3533.097..4738.877 rows=2697 loops=1) -> Sort (cost=34083.64..34478.37 rows=157891 width=72) (actual time=3532.455..3679.959 rows=85362 loops=1) Sort Key: ((int4((date_part('epoch'::text, (t)::timestamp with time zone) / 32::double precision)) * 32))::abstime -> Index Scan using temp_pkey on "temp" (cost=0.00..5940.88 rows=157891 width=72) (actual time=0.227..1907.830rows=85362 loops=1) Index Cond: ((t >= '2004-06-14 22:41:46'::timestamp without time zone) AND (t <= '2004-06-15 22:41:46'::timestampwithout time zone)) Total runtime: 4755.107 ms (6 rows) Generally the last form is slightly faster than the second one, but that's reasonable assuming extra computation overhead to calculate the functions in the WHERE clause. My question is: what makes the first of those queries so much slower than the other two, and more than two times slower than a full table scan? The only thing I can think of that distinguishes the cases is the lack of fractional component in the timestamps for the two fast variants, but I can't imagine _why_ this would blow up the planner like this. -- Zygo Blaxell (Laptop) <zblaxell@feedme.hungrycats.org> GPG = D13D 6651 F446 9787 600B AD1E CCF3 6F93 2823 44AD
rvigmbdc@umail.furryterror.org (Zygo Blaxell) writes: > Column | Type | Modifiers | Description > --------+--------------------------------+-----------+------------- > t | timestamp(0) without time zone | not null | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > ilt=# explain analyze select time_bucket(t, 32), avg(t1), avg(t2), avg(t3), avg(t4), avg(t5), avg(t6), avg(t7), avg(t8)from temp where t between now() - interval '1 day' and now() group by time_bucket(t, 32) order by time_bucket(t, 32)desc; > -> Seq Scan on "temp" (cost=0.00..207797.15 rows=30620 width=72) (actual time=35275.231..38323.524 rows=85360loops=1) > Filter: (((t)::timestamp with time zone >= (now() - '1 day'::interval)) AND ((t)::timestamp with time zone<= now())) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is ye olde bog-standard "can't index a cross-datatype comparison" problem. Personally I think you probably want your t column to be timestamp with tz --- it's highly annoying that the SQL spec says unadorned "timestamp" must mean "timestamp without time zone". But if you really want it without tz, cast the result of now() to timestamp without tz, or use LOCALTIMESTAMP instead of now(). regards, tom lane