Thread: Optimization with dates
Hello, I really have a problem dealing with dates : I have a big table (~10M rows) like : create table gains ( dategain timestamp, otherfields ); There's an index on dategain and I want to use it to get the last 30 days records : explain select count(*) from gains where dategain>=now()-30; Aggregate (cost=256546.78..256546.78 rows=1 width=0) -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) whereas : explain select count(*) from gains where dategain>='now'; Aggregate (cost=27338.47..27338.47 rows=1 width=0) -> Index Scan using ix_gains_dategain on gains (cost=0.00..27320.71rows=7103 width=0) I have tried : where dategain>=(select now()-30); where dategain+30>='now' where date(dategain)>=date('now')-30 and manyother, syntactically absurd :-) Anyone can help me use this index ? TIA, -- Jean-Christophe Boggio cat@thefreecat.org -o) Independant Consultant and Developer /\\ Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Jean-Christophe, > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > whereas : Hmmm... if the number of rows is actually accurate (2M out of 10M in the last 30 days) then a Seq Scan seems like a good plan to me. If the numbers aren't accurate, it's time to run a VACUUM ANALYZE. Regardless, if you're actually querying for 2M recordsets, you'd better throw some hardware at the problem, and learn the .conf parameters. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
On Wed, 14 Nov 2001, Jean-Christophe Boggio wrote: > Hello, > > I really have a problem dealing with dates : > > I have a big table (~10M rows) like : > > create table gains ( > dategain timestamp, > otherfields > ); > > There's an index on dategain and I want to use it to get the last > 30 days records : > > explain select count(*) from gains where dategain>=now()-30; > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > whereas : > > explain select count(*) from gains where dategain>='now'; > > Aggregate (cost=27338.47..27338.47 rows=1 width=0) > -> Index Scan using ix_gains_dategain on gains (cost=0.00..27320.71 rows=7103 width=0) > > I have tried : > where dategain>=(select now()-30); > where dategain+30>='now' > where date(dategain)>=date('now')-30 > and many other, syntactically absurd :-) > > Anyone can help me use this index ? Is 2367640 a reasonable estimate for the number of rows that match the condition? Have you run vacuum analyze? If the estimate is right, you'll probably find that the sequence scan is actually faster than an index scan since about 1/4 of the table is being selected.
Yikes! Good catch. My example database returns on ~.6M for the last 30 days, and an index scan still turns out to be a win (of course, it turned out to be a bigger win to have a separate table with 15 minute summarizations of the data :). Josh is right, chances are good that a sequential scan will actually perform better for you than an index scan if you are returning 20% of your entire table. Jason "Josh Berkus" <josh@agliodbs.com> writes: > Jean-Christophe, > > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > > > whereas : > > Hmmm... if the number of rows is actually accurate (2M out of 10M in the > last 30 days) then a Seq Scan seems like a good plan to me. If the > numbers aren't accurate, it's time to run a VACUUM ANALYZE. > > Regardless, if you're actually querying for 2M recordsets, you'd better > throw some hardware at the problem, and learn the .conf parameters. > > -Josh > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
I have a similar table (~16M rows) with an indexed timestamp, and have had similar problems. I have found that even when I am using a constant timestamp like in this query. SELECT * FROM caseweights1 WHERE dt > '2001-11-01'; I start getting sequential scans with 7.1.3 long before they are faster than index based queries. I believe that there is a constant that can be fiddled to modify this behavior, and it seems like I have also read that this constant has been modified in the new 7.2 release. However, for queries that you *know* will be faster using the index you can always issue: set enable_seqscan to off; before running your query. This will force PostgreSQL to use the index even in queries like this one: SELECT * FROM caseweights1 WHERE dt > (SELECT now() - '30 days'::interval); Jason Jean-Christophe Boggio <cat@thefreecat.org> writes: > Hello, > > I really have a problem dealing with dates : > > I have a big table (~10M rows) like : > > create table gains ( > dategain timestamp, > otherfields > ); > > There's an index on dategain and I want to use it to get the last > 30 days records : > > explain select count(*) from gains where dategain>=now()-30; > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > whereas : > > explain select count(*) from gains where dategain>='now'; > > Aggregate (cost=27338.47..27338.47 rows=1 width=0) > -> Index Scan using ix_gains_dategain on gains (cost=0.00..27320.71 rows=7103 width=0) > > I have tried : > where dategain>=(select now()-30); > where dategain+30>='now' > where date(dategain)>=date('now')-30 > and many other, syntactically absurd :-) > > Anyone can help me use this index ? > > TIA, > > -- > Jean-Christophe Boggio > cat@thefreecat.org -o) > Independant Consultant and Developer /\\ > Delphi, Linux, Perl, PostgreSQL, Debian _\_V > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Jason Earl <jason.earl@simplot.com> writes: > I have a similar table (~16M rows) with an indexed timestamp, and have > had similar problems. I have found that even when I am using a > constant timestamp like in this query. > SELECT * FROM caseweights1 WHERE dt > '2001-11-01'; > I start getting sequential scans with 7.1.3 long before they are > faster than index based queries. Just out of curiosity, do the indexed timestamps correlate closely to the physical order of the table? I'd expect that to happen if you are timestamping records by insertion time and there are few or no updates. 7.1 and before assume that the index order is random with respect to the physical order, which is a good conservative assumption ... but it results in drastically overestimating the cost of an indexscan when strong correlation exists. 7.2 keeps some statistics about ordering correlation, and may perhaps do better with this sort of situation. (I have no doubt that its estimates will need further tweaking, but at least the effect is somewhat accounted for now.) regards, tom lane
Jean-Christophe Boggio <cat@thefreecat.org> writes: > I have tried : > where dategain>=(select now()-30); > where dategain+30>='now' > where date(dategain)>=date('now')-30 > and many other, syntactically absurd :-) Trydategain >= ago(30) where "ago" is a function that computes "date(now()) - n" and is marked "iscachable". This is a cheat but it keeps the planner from being distracted by the noncachable nature of "now()". You can find past discussions of this if you search the archives for "iscachable", eg http://fts.postgresql.org/db/mw/msg.html?mid=1037521 http://fts.postgresql.org/db/mw/msg.html?mid=60584 http://fts.postgresql.org/db/mw/msg.html?mid=97823 As several other people pointed out, there's also a question of whether the system *should* use the index --- you haven't told us how many rows you expect the query to visit. But until you express the WHERE condition in the form "column >= something-that-can-be-reduced-to- a-constant", you're unlikely to get the system to even try. regards, tom lane
Hello, Thanks for all the answers. A little feedback : >> I have tried : >> where dategain>=(select now()-30); >> and many other, syntactically absurd :-) TL> dategain >= ago(30) TL> where "ago" is a function that computes "date(now()) - n" and is TL> marked "iscachable". create function ago(interval) returns timestamp as ' select now() - $1 ' language 'sql' with (iscachable); explain select count(*) from gains where dategain>=ago('30 0:00'); Aggregate (cost=180640.90..180640.90 rows=1 width=0) -> Seq Scan on gains (cost=0.00..179761.71 rows=351676 width=0) ===== explain select count(*) from gains where dategain>=ago('5 days'); Aggregate (cost=172340.65..172340.65 rows=1 width=0) -> Index Scan using ix_gains_dategain on gains (cost=0.00..172202.94rows=55084 width=0) ===== explain select count(*) from gains where dategain>=ago('6 days'); Aggregate (cost=179929.06..179929.06 rows=1 width=0) -> Seq Scan on gains (cost=0.00..179761.71 rows=66940 width=0) TL> Just out of curiosity, do the indexed timestamps correlate closely to TL> the physical order of the table? I'd expect that to happen if you TL> are timestamping records by insertion time and there are few or no TL> updates. That's right, there are very few updates. =========================================== Now, for Jason's idea : set enable_seqscan to off; SET VARIABLE explain select count(*) from gains where dategain>=now()-30; Aggregate (cost=100256770.86..100256770.86 rows=1 width=0) -> Seq Scan on gains (cost=100000000.00..100250847.08 rows=2369512width=0) Strange isn't it ? Is it possible to do the equivalent of "set enable_seqscan to off" out of psql (in php or perl code) ? =========================================== To answer Stephan and Josh : SS> Is 2367640 a reasonable estimate for the number of SS> rows that match the condition? JB> Hmmm... if the number of rows is actually accurate (2M out of 10M in the JB> last 30 days) then a Seq Scan seems like a good plan to me. If the JB> numbers aren't accurate, it's time to run a VACUUM ANALYZE. select avg(cnt) from (select count(*) as cnt from gains group bydate(dategain)) as foo; avg ------------------12009.6131756757 If I did it right, this should be the average number of rows per day. The db exists since April 1st 2000. select date('now')-date('2000-04-01'); 592 select 592*12009; 7109328 select count(*) from gains; count ---------7109753 As you see, dategain is *quite* linear ! So to answer your question, a reasonable estimate for the number of rows that match the condition is : select 30*12009; 360270 The real answer is : select count(*) from gains where dategain>=now()-30;231781 SS> Have you run vacuum analyze? Every night (and it's a VEERRYYYY long process, even dropping the indexes before and recreating them afterwards, maybe that's the real problem ?) Keeping the index makes the VACUUM process several hours. We'll try 7.2 which should solve part of this problem but since these are production systems, we wait a little feedback from 7.2 users. SS> If the estimate is right, you'll probably find that SS> the sequence scan is actually faster than an index SS> scan since about 1/4 of the table is being selected. It should select 1/592 of the table ! Any further advises VERY appreciated. Thanks again everyone for your help. -- Jean-Christophe Boggio cat@thefreecat.org -o) Independant Consultant and Developer /\\ Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Hi Tom, I've found this in a recent mail of yours > dategain >= ago(30) It looks very useful for some of my applications. Since when is this function implemented? Regards, Christoph
Tom Lane <tgl@sss.pgh.pa.us> writes: > Jason Earl <jason.earl@simplot.com> writes: > > I have a similar table (~16M rows) with an indexed timestamp, and have > > had similar problems. I have found that even when I am using a > > constant timestamp like in this query. > > > SELECT * FROM caseweights1 WHERE dt > '2001-11-01'; > > > I start getting sequential scans with 7.1.3 long before they are > > faster than index based queries. > > Just out of curiosity, do the indexed timestamps correlate closely > to the physical order of the table? I'd expect that to happen if > you are timestamping records by insertion time and there are few or > no updates. That's it precisely. Records in this table are inserted only, and never updated. The records are timestamped when inserted. > 7.1 and before assume that the index order is random with respect to > the physical order, which is a good conservative assumption ... but > it results in drastically overestimating the cost of an indexscan > when strong correlation exists. 7.2 keeps some statistics about > ordering correlation, and may perhaps do better with this sort of > situation. (I have no doubt that its estimates will need further > tweaking, but at least the effect is somewhat accounted for now.) > > regards, tom lane I feel bad that I don't have 7.2b in testing yet. I have been waiting for a Debian package :). I am quite excited about several new features. I read HACKERS pretty religiously and so I was aware that you had added some new statistics and that you have fiddled a bit with the constant that decides when a sequential scan is a good idea. But the fact of the matter is that I have already learned how to make PostgreSQL do "the right thing" when it comes to using these indexes. Besides, I generally only select from this table in 15 minute intervals (~200 records) and PostgreSQL has no problems doing the right thing by default. The feature that I am most excited about is the new vacuum. Currently I can only vacuum this table (and several similar tables) once every two weeks while the factory is down for maintenance. This isn't a very big deal as there are no updates or deletions from these tables and the statistics in a ~16 million row table don't change much in two weeks, but the new vacuum should do the correct thing for my tables without any special handling on my part. Thanks again, Jason
So, no one can help ? Should I stop defining indexes for date fields ? ie: should DATE types be considered broken with no short time solution and be replaced with strings ? -- Jean-Christophe Boggio cat@thefreecat.org -o) Independant Consultant and Developer /\\ Delphi, Linux, Perl, PostgreSQL, Debian _\_V
Yikes! Good catch. My example database returns on ~.6M for the last 30 days, and an index scan still turns out to be a win (of course, it turned out to be a bigger win to have a separate table with 15 minute summarizations of the data :). Josh is right, chances are good that a sequential scan will actually perform better for you than an index scan if you are returning 20% of your entire table. Jason "Josh Berkus" <josh@agliodbs.com> writes: > Jean-Christophe, > > > Aggregate (cost=256546.78..256546.78 rows=1 width=0) > > -> Seq Scan on gains (cost=0.00..250627.68 rows=2367640 width=0) > > > > whereas : > > Hmmm... if the number of rows is actually accurate (2M out of 10M in the > last 30 days) then a Seq Scan seems like a good plan to me. If the > numbers aren't accurate, it's time to run a VACUUM ANALYZE. > > Regardless, if you're actually querying for 2M recordsets, you'd better > throw some hardware at the problem, and learn the .conf parameters. > > -Josh > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
cat@thefreecat.org (Jean-Christophe Boggio) wrote in message news:<72319017913.20011114005656@thefreecat.org>... > I have tried : > where dategain>=(select now()-30); > where dategain+30>='now' > where date(dategain)>=date('now')-30 > and many other, syntactically absurd :-) > > Anyone can help me use this index ? From my brief experiments, it _looks_ like the optimizer uses index scan on dates only when those dates are constant (in my case, anyway - as always, YMMV). I tried functions, variables, and using both upper and lower limits. The only time index scanning was used (according to "explain") was when I used constant values for the dates. So ... do you _have_ to do your querying in psql? Or will your application permit you to create queries with embedded constant date values in the Planguage (PHP, Python, Perl) of your choice? Calvin