Thread: Time intersect query
I'm wanting to optimize and improve a query to get the maximum number of users over a period of time. What I'm trying to accomplish is to get graphable data points of the maximum number of simultaneous users at a specified interval over a period of time, preferably with only a single pass through the data.
----------
streamlog table (I've only included relevant fields and indexes):
id bigint
ts timestamp
viewtime integer
client integer
-- primary key on id field
-- streamlog_tsrange_index btree (client, ts, startts(ts, viewtime))
\df+ startts
List of functions
Schema | Name | Result data type | Argument data types | Volatility | Owner | Language | Sou
rce code | Description
--------+---------+-----------------------------+--------------------------------------+------------+-------+----------+-----------------------
-----------------------------+-------------
public | startts | timestamp without time zone | timestamp without time zone, integer | immutable | root | sql | select $1 - ($2::varch
ar || ' seconds')::interval; |
The ts timestamp is the time which the data was logged.
The viewtime is the amount of time the user was online in seconds
the startts function determines when the session started by subtracting viewtime from ts and returning a timestamp
-----------------
My current procedure...
1) Create temporary table with each possible data point. This example uses recursive functions from pgsql 8.4 but was originally implemented by using large numbers of queries from php. My knowledge of the recursive functions is pretty week, but I was able to put this together without too much trouble.
create temp table timerange as with recursive f as (
select '2009-03-21 18:20:00'::timestamp as a
union all
select a+'30 seconds'::interval as a from f where a < '2009-03-21 20:20:00'::timestamp
) select a from f;
2) Update table with record counts
alter table timerange add column mycount integer;
explain analyze update timerange set mycount = (select count(*) from streamlogfoo where client = 3041 and a between startts(ts,viewtime) and ts);
-----------------
This seems to work reasonably well, with the following exceptions...
1) The number reported is the number at the set time period, not the highest value between each data point. With a 30 second interval, this isn't a big problem, but with larger intervals gives results that do not match what I'm looking for (maximum users).
2) This does not scale well for large numbers of points, as internally each data point is a complete scan through the data, even though most of the data points will be common for the entire range.
I'm thinking this would be a good use for the new window functions, but I'm not sure where to begin. Any ideas?
-----------------
----------
streamlog table (I've only included relevant fields and indexes):
id bigint
ts timestamp
viewtime integer
client integer
-- primary key on id field
-- streamlog_tsrange_index btree (client, ts, startts(ts, viewtime))
\df+ startts
List of functions
Schema | Name | Result data type | Argument data types | Volatility | Owner | Language | Sou
rce code | Description
--------+---------+-----------------------------+--------------------------------------+------------+-------+----------+-----------------------
-----------------------------+-------------
public | startts | timestamp without time zone | timestamp without time zone, integer | immutable | root | sql | select $1 - ($2::varch
ar || ' seconds')::interval; |
The ts timestamp is the time which the data was logged.
The viewtime is the amount of time the user was online in seconds
the startts function determines when the session started by subtracting viewtime from ts and returning a timestamp
-----------------
My current procedure...
1) Create temporary table with each possible data point. This example uses recursive functions from pgsql 8.4 but was originally implemented by using large numbers of queries from php. My knowledge of the recursive functions is pretty week, but I was able to put this together without too much trouble.
create temp table timerange as with recursive f as (
select '2009-03-21 18:20:00'::timestamp as a
union all
select a+'30 seconds'::interval as a from f where a < '2009-03-21 20:20:00'::timestamp
) select a from f;
2) Update table with record counts
alter table timerange add column mycount integer;
explain analyze update timerange set mycount = (select count(*) from streamlogfoo where client = 3041 and a between startts(ts,viewtime) and ts);
-----------------
This seems to work reasonably well, with the following exceptions...
1) The number reported is the number at the set time period, not the highest value between each data point. With a 30 second interval, this isn't a big problem, but with larger intervals gives results that do not match what I'm looking for (maximum users).
2) This does not scale well for large numbers of points, as internally each data point is a complete scan through the data, even though most of the data points will be common for the entire range.
I'm thinking this would be a good use for the new window functions, but I'm not sure where to begin. Any ideas?
-----------------
On Mar 23, 2009, at 5:44 AM, Brad Murray wrote: > My current procedure... > 1) Create temporary table with each possible data point. This > example uses > recursive functions from pgsql 8.4 but was originally implemented by > using > large numbers of queries from php. My knowledge of the recursive > functions > is pretty week, but I was able to put this together without too much > trouble. > > create temp table timerange as with recursive f as ( > select '2009-03-21 18:20:00'::timestamp as a > union all > select a+'30 seconds'::interval as a from f where a < '2009-03-21 > 20:20:00'::timestamp > ) select a from f; I think you can do that easier with the generate_series function, no need to use recursion that way. It's probably also convenient to have the end of the interval available. It would be something like: SELECT TIMESTAMP WITH TIME ZONE 'epoch' + f.a * INTERVAL '1 second' AS start, TIMESTAMP WITH TIME ZONE 'epoch' + f.a * INTERVAL '1 second' + INTERVAL '30 seconds' AS end FROM generate_series( EXTRACT(EPOCH FROM '2009-03-21 18:20:00'::timestamp)::bigint, EXTRACT(EPOCH FROM '2009-03-21 20:20:00'::timestamp)::bigint, 30 ) AS f(a) I get the impression you don't use this just once, so it may be better to keep the results (maybe with some added columns with derived values that can be used to join on easily) instead of creating a temp table. You could also add your mycount column here with some initialisation value (likely 0). I used something similar to generate a table that contained start and end dates of weeks based on week numbers and years. We had another table that contained periodic information, left joining the two tables it was easy to split the period table into a record per week with either the periodic information or NULL values (which meant no data for that week). I realise weeks per year aren't much data, but neither are your periods I think (although more). Having a scheduled script that would delete everything older than say a month would keep this table quite manageable (~90k records). > 2) Update table with record counts > alter table timerange add column mycount integer; > explain analyze update timerange set mycount = (select count(*) from > streamlogfoo where client = 3041 and a between startts(ts,viewtime) > and ts); With the above you could join streamlogfoo and group by timerange.start, like so: SELECT timerange.start, COUNT(*) FROM timerange LEFT JOIN streamlog ON (streamlog.ts BETWEEN timerange.start AND timerange.end) GROUP BY timerange.start > ----------------- > > This seems to work reasonably well, with the following exceptions... > > 1) The number reported is the number at the set time period, not the > highest > value between each data point. With a 30 second interval, this > isn't a big > problem, but with larger intervals gives results that do not match > what I'm > looking for (maximum users). > 2) This does not scale well for large numbers of points, as > internally each > data point is a complete scan through the data, even though most of > the data > points will be common for the entire range. > > I'm thinking this would be a good use for the new window functions, > but I'm > not sure where to begin. Any ideas? Well, you'd need something to partition over, and since you don't know where your intervals start and end I don't see how you could do that without at least first generating your intervals. After that there doesn't seem to be much use for the windowing functions, as a simple group by seems to do what you want. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,49c75d4b129742009819935!