Thread: Timestamp-based indexing

Timestamp-based indexing

From
"Harmon S. Nine"
Date:
Hello --

To increase query (i.e. select) performance, we're trying to get
postgres to use an index based on a timestamp column in a given table.

Event-based data is put into this table several times a minute, with the
timestamp indicating when a particular row was placed in the table.

The table is purged daily, retaining only the rows that are less than 7
days old.  That is, any row within the table is less than 1 week old (+
1 day, since the purge is daily).

A typical number of rows in the table is around 400,000.

A "VACUUM FULL ANALYZE"  is performed every 3 hours.


The problem:
We often query the table to extract those rows that are, say, 10 minutes
old or less.

Given there are 10080 minutes per week, the planner could, properly
configured, estimate the number of rows returned by such a query to be:

10 min/ 10080 min  *  400,000 = 0.001 * 400,000 = 400.

Making an index scan, with the timestamp field the index, far faster
then a sequential scan.


However, we can't get the planner to do an timestamp-based index scan.

Anyone know what to do?


Here's the table specs:

monitor=# \d "eventtable"
                                        Table "public.eventtable"
  Column   |            Type             |
Modifiers
-----------+-----------------------------+--------------------------------------------------------------
 timestamp | timestamp without time zone | not null default
('now'::text)::timestamp(6) with time zone
 key       | bigint                      | not null default
nextval('public."eventtable_key_seq"'::text)
 propagate | boolean                     |
 facility  | character(10)               |
 priority  | character(10)               |
 host      | character varying(128)      | not null
 message   | text                        | not null
Indexes:
    "eventtable_pkey" primary key, btree ("timestamp", "key")
    "eventtable_host" btree (host)
    "eventtable_timestamp" btree ("timestamp")


Here's a query (with "explain analyze"):

monitor=# explain analyze select * from "eventtable" where timestamp >
CURRENT_TIMESTAMP - INTERVAL '10 minutes';
                                                         QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on "eventtable"  (cost=0.00..19009.97 rows=136444 width=155)
(actual time=11071.073..11432.522 rows=821 loops=1)
   Filter: (("timestamp")::timestamp with time zone >
(('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
 Total runtime: 11433.384 ms
(3 rows)


Here's something strange.  We try to disable sequential scans, but to no
avail.  The estimated cost skyrockets, though:

monitor=# set enable_seqscan = false;
SET
monitor=# explain analyze select * from "eventtable" where timestamp >
CURRENT_TIMESTAMP - INTERVAL '10 minutes';
                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on "eventtable"  (cost=100000000.00..100019009.97 rows=136444
width=155) (actual time=9909.847..9932.438 rows=1763 loops=1)
   Filter: (("timestamp")::timestamp with time zone >
(('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
 Total runtime: 9934.353 ms
(3 rows)

monitor=# set enable_seqscan = true;
SET
monitor=#



Any help is greatly appreciated :)

-- Harmon




Re: Timestamp-based indexing

From
Kevin Barnard
Date:

Harmon S. Nine wrote:

> monitor=# explain analyze select * from "eventtable" where timestamp >
> CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>                                                         QUERY PLAN

Try

SELECT * FROM eventtable where timestamp BETWEEN  (CURRENT_TIMESTAMP -
INTERVAL '10 minutes') AND CURRENT_TIMESTAMP;

This should will use a range off valid times.  What your query is doing
is looking for 10 minutes ago to an infinate future.  Statically
speaking that should encompass most of the table because you have an
infinate range.  No index will be used.  If you assign a range the
planner can fiqure out what you are looking for.

--
Kevin Barnard
Speed Fulfillment and Call Center
kbarnard@speedfc.com
214-258-0120


Re: Timestamp-based indexing

From
"Matthew T. O'Connor"
Date:
VACUUM FULL ANALYZE every 3 hours seems a little severe.  You will
probably be be served just as well by VACUUM ANALYZE.  But you probably
don't need the VACUUM part most of the time.   You might try doing an
ANALYZE on the specific tables you are having issues with.  Since
ANALYZE should be much quicker and not have the performance impact of a
VACUUM, you could do it every hour, or even every 15 minutes.

Good luck...

Harmon S. Nine wrote:

> Hello --
>
> To increase query (i.e. select) performance, we're trying to get
> postgres to use an index based on a timestamp column in a given table.
>
> Event-based data is put into this table several times a minute, with
> the timestamp indicating when a particular row was placed in the table.
>
> The table is purged daily, retaining only the rows that are less than
> 7 days old.  That is, any row within the table is less than 1 week old
> (+ 1 day, since the purge is daily).
>
> A typical number of rows in the table is around 400,000.
>
> A "VACUUM FULL ANALYZE"  is performed every 3 hours.
>
>
> The problem:
> We often query the table to extract those rows that are, say, 10
> minutes old or less.
>
> Given there are 10080 minutes per week, the planner could, properly
> configured, estimate the number of rows returned by such a query to be:
>
> 10 min/ 10080 min  *  400,000 = 0.001 * 400,000 = 400.
>
> Making an index scan, with the timestamp field the index, far faster
> then a sequential scan.
>
>
> However, we can't get the planner to do an timestamp-based index scan.
>
> Anyone know what to do?
>
>
> Here's the table specs:
>
> monitor=# \d "eventtable"
>                                        Table "public.eventtable"
>  Column   |            Type             |
> Modifiers
> -----------+-----------------------------+--------------------------------------------------------------
>
> timestamp | timestamp without time zone | not null default
> ('now'::text)::timestamp(6) with time zone
> key       | bigint                      | not null default
> nextval('public."eventtable_key_seq"'::text)
> propagate | boolean                     |
> facility  | character(10)               |
> priority  | character(10)               |
> host      | character varying(128)      | not null
> message   | text                        | not null
> Indexes:
>    "eventtable_pkey" primary key, btree ("timestamp", "key")
>    "eventtable_host" btree (host)
>    "eventtable_timestamp" btree ("timestamp")
>
>
> Here's a query (with "explain analyze"):
>
> monitor=# explain analyze select * from "eventtable" where timestamp >
> CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>                                                         QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------

>
> Seq Scan on "eventtable"  (cost=0.00..19009.97 rows=136444 width=155)
> (actual time=11071.073..11432.522 rows=821 loops=1)
>   Filter: (("timestamp")::timestamp with time zone >
> (('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
> Total runtime: 11433.384 ms
> (3 rows)
>
>
> Here's something strange.  We try to disable sequential scans, but to
> no avail.  The estimated cost skyrockets, though:
>
> monitor=# set enable_seqscan = false;
> SET
> monitor=# explain analyze select * from "eventtable" where timestamp >
> CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>                                                             QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------

>
> Seq Scan on "eventtable"  (cost=100000000.00..100019009.97 rows=136444
> width=155) (actual time=9909.847..9932.438 rows=1763 loops=1)
>   Filter: (("timestamp")::timestamp with time zone >
> (('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
> Total runtime: 9934.353 ms
> (3 rows)
>
> monitor=# set enable_seqscan = true;
> SET
> monitor=#
>
>
>
> Any help is greatly appreciated :)
>
> -- Harmon
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>


Re: Timestamp-based indexing

From
"Harmon S. Nine"
Date:
Thank you for your response :)

This improves the row estimation, but it is still using a sequential scan.

It really seems like the query would go faster if an index scan was
used, given the number of rows fetched (both estimated and actual) is
significantly less than the number of rows in the table.

Is there some way to get the planner to use the timestamp as an index on
these queries?


monitor=# explain analyze select * from "eventtable" where timestamp
between (CURRENT_TIMESTAMP - INTERVAL '10 min') AND CURRENT_TIMESTAMP;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on "eventtable"  (cost=0.00..23103.29 rows=2047 width=155)
(actual time=10227.253..10276.944 rows=1662 loops=1)
   Filter: ((("timestamp")::timestamp with time zone >=
(('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
AND (("timestamp")::timestamp with time zone <=
('now'::text)::timestamp(6) with time zone))
 Total runtime: 10278.628 ms
(3 rows)


monitor=# SELECT COUNT(*) FROM "eventtable";
 count
--------
 425602
(1 row)

monitor=#


-- Harmon


Kevin Barnard wrote:

>
>
> Harmon S. Nine wrote:
>
>> monitor=# explain analyze select * from "eventtable" where timestamp
>> > CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>>                                                         QUERY PLAN
>
>
> Try
>
> SELECT * FROM eventtable where timestamp BETWEEN  (CURRENT_TIMESTAMP -
> INTERVAL '10 minutes') AND CURRENT_TIMESTAMP;
>
> This should will use a range off valid times.  What your query is
> doing is looking for 10 minutes ago to an infinate future.  Statically
> speaking that should encompass most of the table because you have an
> infinate range.  No index will be used.  If you assign a range the
> planner can fiqure out what you are looking for.
>


Re: Timestamp-based indexing

From
"Harmon S. Nine"
Date:
We were getting a little desperate, so we engaged in overkill to rule
out lack-of-analyze as a cause for the slow queries.

Thanks for your advice :)

-- Harmon

Matthew T. O'Connor wrote:

> VACUUM FULL ANALYZE every 3 hours seems a little severe.  You will
> probably be be served just as well by VACUUM ANALYZE.  But you
> probably don't need the VACUUM part most of the time.   You might try
> doing an ANALYZE on the specific tables you are having issues with.
> Since ANALYZE should be much quicker and not have the performance
> impact of a VACUUM, you could do it every hour, or even every 15 minutes.
>
> Good luck...
>
> Harmon S. Nine wrote:
>
>> Hello --
>>
>> To increase query (i.e. select) performance, we're trying to get
>> postgres to use an index based on a timestamp column in a given table.
>>
>> Event-based data is put into this table several times a minute, with
>> the timestamp indicating when a particular row was placed in the table.
>>
>> The table is purged daily, retaining only the rows that are less than
>> 7 days old.  That is, any row within the table is less than 1 week
>> old (+ 1 day, since the purge is daily).
>>
>> A typical number of rows in the table is around 400,000.
>>
>> A "VACUUM FULL ANALYZE"  is performed every 3 hours.
>>
>>
>> The problem:
>> We often query the table to extract those rows that are, say, 10
>> minutes old or less.
>>
>> Given there are 10080 minutes per week, the planner could, properly
>> configured, estimate the number of rows returned by such a query to be:
>>
>> 10 min/ 10080 min  *  400,000 = 0.001 * 400,000 = 400.
>>
>> Making an index scan, with the timestamp field the index, far faster
>> then a sequential scan.
>>
>>
>> However, we can't get the planner to do an timestamp-based index scan.
>>
>> Anyone know what to do?
>>
>>
>> Here's the table specs:
>>
>> monitor=# \d "eventtable"
>>                                        Table "public.eventtable"
>>  Column   |            Type             |
>> Modifiers
>> -----------+-----------------------------+--------------------------------------------------------------
>>
>> timestamp | timestamp without time zone | not null default
>> ('now'::text)::timestamp(6) with time zone
>> key       | bigint                      | not null default
>> nextval('public."eventtable_key_seq"'::text)
>> propagate | boolean                     |
>> facility  | character(10)               |
>> priority  | character(10)               |
>> host      | character varying(128)      | not null
>> message   | text                        | not null
>> Indexes:
>>    "eventtable_pkey" primary key, btree ("timestamp", "key")
>>    "eventtable_host" btree (host)
>>    "eventtable_timestamp" btree ("timestamp")
>>
>>
>> Here's a query (with "explain analyze"):
>>
>> monitor=# explain analyze select * from "eventtable" where timestamp
>> > CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>>                                                         QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------

>>
>> Seq Scan on "eventtable"  (cost=0.00..19009.97 rows=136444 width=155)
>> (actual time=11071.073..11432.522 rows=821 loops=1)
>>   Filter: (("timestamp")::timestamp with time zone >
>> (('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
>> Total runtime: 11433.384 ms
>> (3 rows)
>>
>>
>> Here's something strange.  We try to disable sequential scans, but to
>> no avail.  The estimated cost skyrockets, though:
>>
>> monitor=# set enable_seqscan = false;
>> SET
>> monitor=# explain analyze select * from "eventtable" where timestamp
>> > CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>>                                                             QUERY PLAN
>>
-------------------------------------------------------------------------------------------------------------------------------------

>>
>> Seq Scan on "eventtable"  (cost=100000000.00..100019009.97
>> rows=136444 width=155) (actual time=9909.847..9932.438 rows=1763
>> loops=1)
>>   Filter: (("timestamp")::timestamp with time zone >
>> (('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
>> Total runtime: 9934.353 ms
>> (3 rows)
>>
>> monitor=# set enable_seqscan = true;
>> SET
>> monitor=#
>>
>>
>>
>> Any help is greatly appreciated :)
>>
>> -- Harmon
>>
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>      subscribe-nomail command to majordomo@postgresql.org so that your
>>      message can get through to the mailing list cleanly
>>
>


Re: Timestamp-based indexing

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> VACUUM FULL ANALYZE every 3 hours seems a little severe.

If rows are only deleted once a day, that's a complete waste of time,
indeed.

I'd suggest running a plain VACUUM just after the deletion pass is done.
ANALYZEs are a different matter and possibly need to be done every
few hours, seeing that your maximum timestamp value is constantly
changing.

>> monitor=# set enable_seqscan = false;
>> SET
>> monitor=# explain analyze select * from "eventtable" where timestamp >
>> CURRENT_TIMESTAMP - INTERVAL '10 minutes';
>> QUERY PLAN
>>
-------------------------------------------------------------------------------------------------------------------------------------

>>
>> Seq Scan on "eventtable"  (cost=100000000.00..100019009.97 rows=136444
>> width=155) (actual time=9909.847..9932.438 rows=1763 loops=1)
>> Filter: (("timestamp")::timestamp with time zone >
>> (('now'::text)::timestamp(6) with time zone - '@ 10 mins'::interval))
>> Total runtime: 9934.353 ms

You've got some datatype confusion, too.  CURRENT_TIMESTAMP yields
timestamp with time zone, and since you made the timestamp column
timestamp without time zone, you've got a cross-type comparison which is
not indexable (at least not in 7.4).  My opinion is that you chose the
wrong type for the column.  Values that represent specific real-world
instants should always be timestamp with time zone, so that they mean
the same thing if you look at them in a different time zone.

Another issue here is that because CURRENT_TIMESTAMP - INTERVAL '10
minutes' isn't a constant, the planner isn't able to make use of the
statistics gathered by ANALYZE anyway.  That's why the rowcount estimate
has nothing to do with reality.  Unless you force the decision with
"set enable_seqscan", the planner will never pick an indexscan with this
rowcount estimate.  The standard advice for getting around this is to
hide the nonconstant calculation inside a function that's deliberately
mislabeled immutable.  For example,

create function ago(interval) returns timestamp with time zone as
'select now() - $1' language sql strict immutable;

select * from "eventtable" where timestamp > ago('10 minutes');

The planner folds the "ago('10 minutes')" to a constant, checks the
statistics, and should do the right thing.  Note however that this
technique may break if you put a call to ago() inside a function
or prepared statement --- it's only safe in interactive queries,
where you don't care that the value is reduced to a constant during
planning instead of during execution.

            regards, tom lane

Re: Timestamp-based indexing

From
Stephan Szabo
Date:
On Mon, 26 Jul 2004, Harmon S. Nine wrote:

> However, we can't get the planner to do an timestamp-based index scan.
>
> Anyone know what to do?

I'd wonder if the type conversion is causing you problems.
CURRENT_TIMESTAMP - INTERVAL '10 minutes' is a timestamp with time zone
while the column is timestamp without time zone.  Casting
CURRENT_TIMESTAMP to timestamp without time zone seemed to make it able to
choose an index scan on 7.4.


Re: Timestamp-based indexing

From
"Harmon S. Nine"
Date:
THAT WAS IT!!

Thank you very much.
Is there a way to change the type of "CURRENT_TIMESTAMP" to "timestamp without time zone" so that casting isn't needed?


BTW, isn't this a bug?

-- Harmon


Stephan Szabo wrote:
On Mon, 26 Jul 2004, Harmon S. Nine wrote:
 
However, we can't get the planner to do an timestamp-based index scan.

Anyone know what to do?   
I'd wonder if the type conversion is causing you problems.
CURRENT_TIMESTAMP - INTERVAL '10 minutes' is a timestamp with time zone
while the column is timestamp without time zone.  Casting
CURRENT_TIMESTAMP to timestamp without time zone seemed to make it able to
choose an index scan on 7.4.
 

Re: Timestamp-based indexing

From
Litao Wu
Date:
Hi,

How about changing:

CURRENT_TIMESTAMP - INTERVAL '10 minutes'
to
'now'::timestamptz - INTERVAL '10 minutes'

It seems to me that Postgres will treat it as
a constant.

Thanks,

--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
> > VACUUM FULL ANALYZE every 3 hours seems a little
> severe.
>
> If rows are only deleted once a day, that's a
> complete waste of time,
> indeed.
>
> I'd suggest running a plain VACUUM just after the
> deletion pass is done.
> ANALYZEs are a different matter and possibly need to
> be done every
> few hours, seeing that your maximum timestamp value
> is constantly
> changing.
>
> >> monitor=# set enable_seqscan = false;
> >> SET
> >> monitor=# explain analyze select * from
> "eventtable" where timestamp >
> >> CURRENT_TIMESTAMP - INTERVAL '10 minutes';
> >> QUERY PLAN
> >>
>

-------------------------------------------------------------------------------------------------------------------------------------
>
> >>
> >> Seq Scan on "eventtable"
> (cost=100000000.00..100019009.97 rows=136444
> >> width=155) (actual time=9909.847..9932.438
> rows=1763 loops=1)
> >> Filter: (("timestamp")::timestamp with time zone
> >
> >> (('now'::text)::timestamp(6) with time zone - '@
> 10 mins'::interval))
> >> Total runtime: 9934.353 ms
>
> You've got some datatype confusion, too.
> CURRENT_TIMESTAMP yields
> timestamp with time zone, and since you made the
> timestamp column
> timestamp without time zone, you've got a cross-type
> comparison which is
> not indexable (at least not in 7.4).  My opinion is
> that you chose the
> wrong type for the column.  Values that represent
> specific real-world
> instants should always be timestamp with time zone,
> so that they mean
> the same thing if you look at them in a different
> time zone.
>
> Another issue here is that because CURRENT_TIMESTAMP
> - INTERVAL '10
> minutes' isn't a constant, the planner isn't able to
> make use of the
> statistics gathered by ANALYZE anyway.  That's why
> the rowcount estimate
> has nothing to do with reality.  Unless you force
> the decision with
> "set enable_seqscan", the planner will never pick an
> indexscan with this
> rowcount estimate.  The standard advice for getting
> around this is to
> hide the nonconstant calculation inside a function
> that's deliberately
> mislabeled immutable.  For example,
>
> create function ago(interval) returns timestamp with
> time zone as
> 'select now() - $1' language sql strict immutable;
>
> select * from "eventtable" where timestamp > ago('10
> minutes');
>
> The planner folds the "ago('10 minutes')" to a
> constant, checks the
> statistics, and should do the right thing.  Note
> however that this
> technique may break if you put a call to ago()
> inside a function
> or prepared statement --- it's only safe in
> interactive queries,
> where you don't care that the value is reduced to a
> constant during
> planning instead of during execution.
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose
> an index scan if your
>       joining column's datatypes do not match
>




__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail

Re: Timestamp-based indexing

From
Tom Lane
Date:
Litao Wu <litaowu@yahoo.com> writes:
> How about changing:

> CURRENT_TIMESTAMP - INTERVAL '10 minutes'
> to
> 'now'::timestamptz - INTERVAL '10 minutes'

> It seems to me that Postgres will treat it as
> a constant.

Yeah, that works too, though again it might burn you if used inside a
function or prepared statement.  What you're doing here is to push the
freezing of the "now" value even further upstream, namely to initial
parsing of the command.

            regards, tom lane

Re: Timestamp-based indexing

From
Christopher Kings-Lynne
Date:
>>It seems to me that Postgres will treat it as
>>a constant.
>
>
> Yeah, that works too, though again it might burn you if used inside a
> function or prepared statement.  What you're doing here is to push the
> freezing of the "now" value even further upstream, namely to initial
> parsing of the command.

What I do in my apps to get postgres to use the timestamp indexes in
some situations is to just generate the current timestamp in iso format
and then just insert it into the query as a constant, for that run of
the query.

Chris


Re: Timestamp-based indexing

From
Josh Berkus
Date:
Harmon,

> A "VACUUM FULL ANALYZE"  is performed every 3 hours.

The FULL part should not be necessary if you've set your max_fsm_pages high
enough.

> Given there are 10080 minutes per week, the planner could, properly
> configured, estimate the number of rows returned by such a query to be:
>
> 10 min/ 10080 min  *  400,000 = 0.001 * 400,000 = 400.

The planner doesn't work that way.

> monitor=# explain analyze select * from "eventtable" where timestamp >
> CURRENT_TIMESTAMP - INTERVAL '10 minutes';

Hmmm.  What verison of PostgreSQL are you running?  I seem to remember an
issue in one version with selecting comparisons against now().    What
happens when you supply a constant instead of ( current_timestamp - interval
'10 minutes' ) ?

> Here's something strange.  We try to disable sequential scans, but to no
> avail.  The estimated cost skyrockets, though:

That's how "enable_*=false" works in most cases.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Timestamp-based indexing

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> monitor=# explain analyze select * from "eventtable" where timestamp >
>> CURRENT_TIMESTAMP - INTERVAL '10 minutes';

> Hmmm.  What verison of PostgreSQL are you running?  I seem to remember an
> issue in one version with selecting comparisons against now().

I'm also wondering about the exact datatype of the "timestamp" column.
If it's timestamp without timezone, then the above is a cross-datatype
comparison (timestamp vs timestamptz) and hence not indexable before
8.0.  This could be fixed easily by using the right current-time
function, viz LOCALTIMESTAMP not CURRENT_TIMESTAMP.  (Consistency has
obviously never been a high priority with the SQL committee :-(.)

Less easily but possibly better in the long run, change the column type
to timestamp with time zone.  IMHO, columns representing definable
real-world time instants should always be timestamptz, because the other
way leaves you open to serious confusion about what the time value
really means.

            regards, tom lane