Thread: Optimization with dates

Optimization with dates

From
Jean-Christophe Boggio
Date:
Hello,

I really have a problem dealing with dates :

I have a big table (~10M rows) like :

create table gains (      dategain    timestamp,      otherfields
);

There's an index on dategain and I want to use it to get the last
30 days records :

explain select count(*) from gains where dategain>=now()-30;

Aggregate  (cost=256546.78..256546.78 rows=1 width=0) ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640
width=0)

whereas : 
explain select count(*) from gains where dategain>='now';

Aggregate  (cost=27338.47..27338.47 rows=1 width=0) ->  Index Scan using ix_gains_dategain on gains
(cost=0.00..27320.71rows=7103 width=0)
 

I have tried : where dategain>=(select now()-30); where dategain+30>='now' where date(dategain)>=date('now')-30 and
manyother, syntactically absurd :-)
 

Anyone can help me use this index ?

TIA, 
--
Jean-Christophe Boggio                       
cat@thefreecat.org                           -o)
Independant Consultant and Developer         /\\
Delphi, Linux, Perl, PostgreSQL, Debian     _\_V



Re: Optimization with dates

From
"Josh Berkus"
Date:
Jean-Christophe,

> Aggregate  (cost=256546.78..256546.78 rows=1 width=0)
>   ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640 width=0)
> 
> whereas :

Hmmm... if the number of rows is actually accurate (2M out of 10M in the
last 30 days) then a Seq Scan seems like a good plan to me.  If the
numbers aren't accurate, it's time to run a VACUUM ANALYZE.

Regardless, if you're actually querying for 2M recordsets, you'd better
throw some hardware at the problem, and learn the .conf parameters.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco
 


Re: Optimization with dates

From
Stephan Szabo
Date:
On Wed, 14 Nov 2001, Jean-Christophe Boggio wrote:

> Hello,
>
> I really have a problem dealing with dates :
>
> I have a big table (~10M rows) like :
>
> create table gains (
>        dategain    timestamp,
>        otherfields
> );
>
> There's an index on dategain and I want to use it to get the last
> 30 days records :
>
> explain select count(*) from gains where dategain>=now()-30;
>
> Aggregate  (cost=256546.78..256546.78 rows=1 width=0)
>   ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640 width=0)
>
> whereas :
>
> explain select count(*) from gains where dategain>='now';
>
> Aggregate  (cost=27338.47..27338.47 rows=1 width=0)
>   ->  Index Scan using ix_gains_dategain on gains  (cost=0.00..27320.71 rows=7103 width=0)
>
> I have tried :
>   where dategain>=(select now()-30);
>   where dategain+30>='now'
>   where date(dategain)>=date('now')-30
>   and many other, syntactically absurd :-)
>
> Anyone can help me use this index ?

Is 2367640 a reasonable estimate for the number of
rows that match the condition?  Have you run vacuum
analyze?
If the estimate is right, you'll probably find that
the sequence scan is actually faster than an index
scan since about 1/4 of the table is being selected.




Re: Optimization with dates

From
Jason Earl
Date:
Yikes!  Good catch.  My example database returns on ~.6M for the last
30 days, and an index scan still turns out to be a win (of course, it
turned out to be a bigger win to have a separate table with 15 minute
summarizations of the data :).

Josh is right, chances are good that a sequential scan will actually
perform better for you than an index scan if you are returning 20% of
your entire table.

Jason

"Josh Berkus" <josh@agliodbs.com> writes:

> Jean-Christophe,
> 
> > Aggregate  (cost=256546.78..256546.78 rows=1 width=0)
> >   ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640 width=0)
> > 
> > whereas :
> 
> Hmmm... if the number of rows is actually accurate (2M out of 10M in the
> last 30 days) then a Seq Scan seems like a good plan to me.  If the
> numbers aren't accurate, it's time to run a VACUUM ANALYZE.
> 
> Regardless, if you're actually querying for 2M recordsets, you'd better
> throw some hardware at the problem, and learn the .conf parameters.
> 
> -Josh
> 
> ______AGLIO DATABASE SOLUTIONS___________________________
>                                        Josh Berkus
>   Complete information technology      josh@agliodbs.com
>    and data management solutions       (415) 565-7293
>   for law firms, small businesses        fax 621-2533
>     and non-profit organizations.      San Francisco
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html


Re: Optimization with dates

From
Jason Earl
Date:
I have a similar table (~16M rows) with an indexed timestamp, and have
had similar problems.  I have found that even when I am using a
constant timestamp like in this query.

SELECT * FROM caseweights1 WHERE dt > '2001-11-01';

I start getting sequential scans with 7.1.3 long before they are
faster than index based queries.  I believe that there is a constant
that can be fiddled to modify this behavior, and it seems like I have
also read that this constant has been modified in the new 7.2 release.
However, for queries that you *know* will be faster using the index
you can always issue:

set enable_seqscan to off;

before running your query.  This will force PostgreSQL to use the
index even in queries like this one:

SELECT * FROM caseweights1 WHERE dt > (SELECT now() - '30 days'::interval);

Jason

Jean-Christophe Boggio <cat@thefreecat.org> writes:

> Hello,
> 
> I really have a problem dealing with dates :
> 
> I have a big table (~10M rows) like :
> 
> create table gains (
>        dategain    timestamp,
>        otherfields
> );
> 
> There's an index on dategain and I want to use it to get the last
> 30 days records :
> 
> explain select count(*) from gains where dategain>=now()-30;
> 
> Aggregate  (cost=256546.78..256546.78 rows=1 width=0)
>   ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640 width=0)
> 
> whereas :
>   
> explain select count(*) from gains where dategain>='now';
> 
> Aggregate  (cost=27338.47..27338.47 rows=1 width=0)
>   ->  Index Scan using ix_gains_dategain on gains  (cost=0.00..27320.71 rows=7103 width=0)
> 
> I have tried :
>   where dategain>=(select now()-30);
>   where dategain+30>='now'
>   where date(dategain)>=date('now')-30
>   and many other, syntactically absurd :-)
> 
> Anyone can help me use this index ?
> 
> TIA,
>   
> --
> Jean-Christophe Boggio                       
> cat@thefreecat.org                           -o)
> Independant Consultant and Developer         /\\
> Delphi, Linux, Perl, PostgreSQL, Debian     _\_V
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: Optimization with dates

From
Tom Lane
Date:
Jason Earl <jason.earl@simplot.com> writes:
> I have a similar table (~16M rows) with an indexed timestamp, and have
> had similar problems.  I have found that even when I am using a
> constant timestamp like in this query.

> SELECT * FROM caseweights1 WHERE dt > '2001-11-01';

> I start getting sequential scans with 7.1.3 long before they are
> faster than index based queries.

Just out of curiosity, do the indexed timestamps correlate closely to
the physical order of the table?  I'd expect that to happen if you
are timestamping records by insertion time and there are few or no
updates.

7.1 and before assume that the index order is random with respect to
the physical order, which is a good conservative assumption ... but it
results in drastically overestimating the cost of an indexscan when
strong correlation exists.  7.2 keeps some statistics about ordering
correlation, and may perhaps do better with this sort of situation.
(I have no doubt that its estimates will need further tweaking, but
at least the effect is somewhat accounted for now.)
        regards, tom lane


Re: Optimization with dates

From
Tom Lane
Date:
Jean-Christophe Boggio <cat@thefreecat.org> writes:
> I have tried :
>   where dategain>=(select now()-30);
>   where dategain+30>='now'
>   where date(dategain)>=date('now')-30
>   and many other, syntactically absurd :-)

Trydategain >= ago(30)

where "ago" is a function that computes "date(now()) - n" and is
marked "iscachable".  This is a cheat but it keeps the planner from
being distracted by the noncachable nature of "now()".  You can find
past discussions of this if you search the archives for "iscachable",
eg
http://fts.postgresql.org/db/mw/msg.html?mid=1037521
http://fts.postgresql.org/db/mw/msg.html?mid=60584
http://fts.postgresql.org/db/mw/msg.html?mid=97823

As several other people pointed out, there's also a question of whether
the system *should* use the index --- you haven't told us how many
rows you expect the query to visit.  But until you express the WHERE
condition in the form "column >= something-that-can-be-reduced-to-
a-constant", you're unlikely to get the system to even try.
        regards, tom lane


Re: Optimization with dates

From
Jean-Christophe Boggio
Date:
Hello,

Thanks for all the answers. A little feedback :

>> I have tried :
>>   where dategain>=(select now()-30);
>>   and many other, syntactically absurd :-)

TL>         dategain >= ago(30)
TL> where "ago" is a function that computes "date(now()) - n" and is
TL> marked "iscachable".

create function ago(interval) returns timestamp as ' select now() - $1
' language 'sql' with (iscachable);

explain select count(*) from gains where dategain>=ago('30 0:00');

Aggregate  (cost=180640.90..180640.90 rows=1 width=0) ->  Seq Scan on gains  (cost=0.00..179761.71 rows=351676
width=0)

===== 
explain select count(*) from gains where dategain>=ago('5 days');

Aggregate  (cost=172340.65..172340.65 rows=1 width=0) ->  Index Scan using ix_gains_dategain on gains
(cost=0.00..172202.94rows=55084 width=0)
 

===== 
explain select count(*) from gains where dategain>=ago('6 days');

Aggregate  (cost=179929.06..179929.06 rows=1 width=0) ->  Seq Scan on gains  (cost=0.00..179761.71 rows=66940 width=0)


TL> Just out of curiosity, do the indexed timestamps correlate closely to
TL> the physical order of the table?  I'd expect that to happen if you
TL> are timestamping records by insertion time and there are few or no
TL> updates.

That's right, there are very few updates. 
===========================================
Now, for Jason's idea :

set enable_seqscan to off;
SET VARIABLE

explain select count(*) from gains where dategain>=now()-30;

Aggregate  (cost=100256770.86..100256770.86 rows=1 width=0) ->  Seq Scan on gains  (cost=100000000.00..100250847.08
rows=2369512width=0)
 

Strange isn't it ?

Is it possible to do the equivalent of "set enable_seqscan to off"
out of psql (in php or perl code) ?

===========================================

To answer Stephan and Josh :

SS> Is 2367640 a reasonable estimate for the number of
SS> rows that match the condition?

JB> Hmmm... if the number of rows is actually accurate (2M out of 10M in the
JB> last 30 days) then a Seq Scan seems like a good plan to me.  If the
JB> numbers aren't accurate, it's time to run a VACUUM ANALYZE.

select avg(cnt) from (select count(*) as cnt from gains group bydate(dategain)) as foo;      avg
------------------12009.6131756757

If I did it right, this should be the average number of rows per day.
The db exists since April 1st 2000.

select date('now')-date('2000-04-01');     592

select 592*12009; 7109328

select count(*) from gains; count
---------7109753

As you see, dategain is *quite* linear !

So to answer your question, a reasonable estimate for the number of
rows that match the condition is :

select 30*12009;  360270

The real answer is :

select count(*) from gains where dategain>=now()-30;231781  
SS> Have you run vacuum analyze?

Every night (and it's a VEERRYYYY long process, even dropping the
indexes before and recreating them afterwards, maybe that's the real
problem ?) Keeping the index makes the VACUUM process several hours.
We'll try 7.2 which should solve part of this problem but since these
are production systems, we wait a little feedback from 7.2 users.

SS> If the estimate is right, you'll probably find that
SS> the sequence scan is actually faster than an index
SS> scan since about 1/4 of the table is being selected.

It should select 1/592 of the table !



Any further advises VERY appreciated. Thanks again everyone for your
help.

--
Jean-Christophe Boggio                       
cat@thefreecat.org                           -o)
Independant Consultant and Developer         /\\
Delphi, Linux, Perl, PostgreSQL, Debian     _\_V



ago()

From
Haller Christoph
Date:
Hi Tom, 
I've found this in a recent mail of yours 
>     dategain >= ago(30)
It looks very useful for some of my applications. 
Since when is this function implemented? 
Regards, Christoph 


Re: Optimization with dates

From
Jason Earl
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Jason Earl <jason.earl@simplot.com> writes:
> > I have a similar table (~16M rows) with an indexed timestamp, and have
> > had similar problems.  I have found that even when I am using a
> > constant timestamp like in this query.
> 
> > SELECT * FROM caseweights1 WHERE dt > '2001-11-01';
> 
> > I start getting sequential scans with 7.1.3 long before they are
> > faster than index based queries.
> 
> Just out of curiosity, do the indexed timestamps correlate closely
> to the physical order of the table?  I'd expect that to happen if
> you are timestamping records by insertion time and there are few or
> no updates.

That's it precisely.  Records in this table are inserted only, and
never updated.  The records are timestamped when inserted.

> 7.1 and before assume that the index order is random with respect to
> the physical order, which is a good conservative assumption ... but
> it results in drastically overestimating the cost of an indexscan
> when strong correlation exists.  7.2 keeps some statistics about
> ordering correlation, and may perhaps do better with this sort of
> situation.  (I have no doubt that its estimates will need further
> tweaking, but at least the effect is somewhat accounted for now.)
> 
>             regards, tom lane

I feel bad that I don't have 7.2b in testing yet.  I have been waiting
for a Debian package :).  I am quite excited about several new
features.  I read HACKERS pretty religiously and so I was aware that
you had added some new statistics and that you have fiddled a bit with
the constant that decides when a sequential scan is a good idea.  But
the fact of the matter is that I have already learned how to make
PostgreSQL do "the right thing" when it comes to using these indexes.
Besides, I generally only select from this table in 15 minute
intervals (~200 records) and PostgreSQL has no problems doing the
right thing by default.

The feature that I am most excited about is the new vacuum.  Currently
I can only vacuum this table (and several similar tables) once every
two weeks while the factory is down for maintenance.  This isn't a
very big deal as there are no updates or deletions from these tables
and the statistics in a ~16 million row table don't change much in two
weeks, but the new vacuum should do the correct thing for my tables
without any special handling on my part.

Thanks again,

Jason


Re: Optimization with dates

From
Jean-Christophe Boggio
Date:
So, no one can help ?

Should I stop defining indexes for date fields ?

ie: should DATE types be considered broken with no short time solution
and be replaced with strings ?

--
Jean-Christophe Boggio                       
cat@thefreecat.org                           -o)
Independant Consultant and Developer         /\\
Delphi, Linux, Perl, PostgreSQL, Debian     _\_V



Re: Optimization with dates

From
Jason Earl
Date:
Yikes!  Good catch.  My example database returns on ~.6M for the last
30 days, and an index scan still turns out to be a win (of course, it
turned out to be a bigger win to have a separate table with 15 minute
summarizations of the data :).

Josh is right, chances are good that a sequential scan will actually
perform better for you than an index scan if you are returning 20% of
your entire table.

Jason

"Josh Berkus" <josh@agliodbs.com> writes:

> Jean-Christophe,
> 
> > Aggregate  (cost=256546.78..256546.78 rows=1 width=0)
> >   ->  Seq Scan on gains  (cost=0.00..250627.68 rows=2367640 width=0)
> > 
> > whereas :
> 
> Hmmm... if the number of rows is actually accurate (2M out of 10M in the
> last 30 days) then a Seq Scan seems like a good plan to me.  If the
> numbers aren't accurate, it's time to run a VACUUM ANALYZE.
> 
> Regardless, if you're actually querying for 2M recordsets, you'd better
> throw some hardware at the problem, and learn the .conf parameters.
> 
> -Josh
> 
> ______AGLIO DATABASE SOLUTIONS___________________________
>                                        Josh Berkus
>   Complete information technology      josh@agliodbs.com
>    and data management solutions       (415) 565-7293
>   for law firms, small businesses        fax 621-2533
>     and non-profit organizations.      San Francisco
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html


Re: Optimization with dates

From
caldodge@fpcc.net (Calvin Dodge)
Date:
cat@thefreecat.org (Jean-Christophe Boggio) wrote in message news:<72319017913.20011114005656@thefreecat.org>...
> I have tried :
>   where dategain>=(select now()-30);
>   where dategain+30>='now'
>   where date(dategain)>=date('now')-30
>   and many other, syntactically absurd :-)
>
> Anyone can help me use this index ?

From my brief experiments, it _looks_ like the optimizer uses index
scan on dates only when those dates are constant (in my case, anyway -
as always, YMMV).

I tried functions, variables, and using both upper and lower limits.

The only time index scanning was used (according to "explain") was
when I used constant values for the dates.

So ... do you _have_ to do your querying in psql?  Or will your
application permit you to create queries with embedded constant date
values in the Planguage (PHP, Python, Perl) of your choice?


Calvin