Thread: Query across a date range
Summary: Doing a two or three table join for a date range performs worse than doing the same query individually for each date in the range. What works: Doing a query just on a single date or a date range (against just one table) runs quick; 'explain' says it uses an index scan. Doing a query on a single date for one store or for one market uses all index scans, and runs quick as well. The problem: Doing a query for a date range on a particular store or market, though, for a date range of more than a few days does a sequential scan of sales_tickets, and performs worse than doing one single date query for each date. My 'explain' for one such query is below. Background: I have two or three tables involved in a query. One table is holds stores (7 rows at present), one holds sales tickets (about 5 million) and one holds line items (about 10 million). It's test data that I've generated and loaded using '\copy from'. Each has a primary key, and line items have two dates, written and delivered, that are indexed individually. Store has a market id; a market contains multiple stores (in my case, 2 or 3). Each sales ticket has 1-3 line items. Is there a way to tell postgres to use an index scan on sales_tickets? Curiously, in response to recent postings in the "Low Performance for big hospital server" thread, when I flatten the tables by putting storeid into line_items, it runs somewhat faster in all cases, and much faster in some; (I have times, if anyone is interested). Thanks, Dave mydb=> explain select * from line_items t, sales_tickets s where writtenDate >= '12/01/2002' and writtenDate <= '12/31/2002' and t.ticketId = s.ticketId and s.storeId = 1; QUERY PLAN --------------------------------------------------------------------------------------------------- Hash Join (cost=93865.46..114054.74 rows=19898 width=28) Hash Cond: ("outer".ticketId = "inner".ticketId) -> Index Scan using line_items_written on line_items t (cost=0.00..3823.11 rows=158757 width=16) Index Cond: ((writtendate >= '2002-12-01'::date) AND (writtendate <= '2002-12-31'::date)) -> Hash (cost=89543.50..89543.50 rows=626783 width=12) -> Seq Scan on sales_tickets s (cost=0.00..89543.50 rows=626783 width=12) Filter: (storeid = 1) (7 rows) mydb=> explain select * from line_items t, sales_tickets s where writtenDate = '12/01/2002' and t.ticketId = s.ticketId and s.storeid = 1; QUERY PLAN --------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..16942.25 rows=697 width=28) -> Index Scan using line_items_written on soldtrx t (cost=0.00..121.97 rows=5554 width=16) Index Cond: (writtendate = '2002-12-01'::date) -> Index Scan using sales_tickets_pkey on sales_tickets s (cost=0.00..3.02 rows=1 width=12) Index Cond: ("outer".ticketId = s.ticketId) Filter: (storeid = 1) (6 rows) The tables: create table stores -- 7 rows ( storeId integer not null, marketId integer not null ); create table sales_tickets -- 500,000 rows ( ticketId integer primary key, storeId integer not null, custId integer not null ); create table line_items -- 1,000,000 rows ( lineItemId integer primary key, ticketId integer references sales_tickets, writtenDate date not null, deliveredDate date not null ); create index line_items_written on line_items (writtenDate); create index line_items_delivered on line_items (deliveredDate);
David, > The problem: Doing a query for a date range on a particular store or > market, though, for a date range of more than a few days does a > sequential scan of sales_tickets, and performs worse than doing one > single date query for each date. My 'explain' for one such query is > below. Can you run EXPLAIN ANALYZE instead of just EXPLAIN? That will show you the discrepancy between estimated and actual costs, and probably show you what needs fixing. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
David Jaquay <djaquay@gmail.com> writes: > Summary: Doing a two or three table join for a date range performs > worse than doing the same query individually for each date in the > range. Could we see EXPLAIN ANALYZE, not just EXPLAIN, results? Also, have you ANALYZEd lately? If the estimated row counts are at all accurate, I doubt that forcing a nestloop indexscan would improve the situation. Also, what PG version is this? regards, tom lane
'explain analyze' output is below. I have done analyze recently, and am using pg 7.4.2 on SuSE 9.1. I'd be curious to know how to "a nestloop indexscan" to try it out. Thanks, Dave mydb=> explain analyze select * from line_items t, sales_tickets s where writtenDate >= '12/01/2002' and writtenDate <= '12/31/2002' and t.ticketid = s.ticketId and s.storeId = 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=93865.46..114054.74 rows=19898 width=28) (actual time=25419.088..32140.217 rows=23914 loops=1) Hash Cond: ("outer".ticketid = "inner".ticketid) -> Index Scan using line_items_written on line_items t (cost=0.00..3823.11 rows=158757 width=16) (actual time=100.621..3354.818 rows=169770 loops=1) Index Cond: ((writtendate >= '2002-12-01'::date) AND (writtendate <= '2002-12-31'::date)) -> Hash (cost=89543.50..89543.50 rows=626783 width=12) (actual time=22844.146..22844.146 rows=0 loops=1) -> Seq Scan on sales_tickets s (cost=0.00..89543.50 rows=626783 width=12) (actual time=38.017..19387.447 rows=713846 loops=1) Filter: (storeid = 1) Total runtime: 32164.948 ms (8 rows) On Fri, 7 Jan 2005 11:35:11 -0800, Josh Berkus <josh@agliodbs.com> wrote: > Can you run EXPLAIN ANALYZE instead of just EXPLAIN? That will show you the > discrepancy between estimated and actual costs, and probably show you what > needs fixing. Also, Tom Lane wrote: > Could we see EXPLAIN ANALYZE, not just EXPLAIN, results? > > Also, have you ANALYZEd lately? If the estimated row counts are at all > accurate, I doubt that forcing a nestloop indexscan would improve the > situation. > > Also, what PG version is this?
Dave, Ah .... > -> Seq Scan on sales_tickets s (cost=0.00..89543.50 > rows=626783 width=12) (actual time=38.017..19387.447 rows=713846 > loops=1) This is just more than 1/2 the time of your query. The issue is that you're pulling 713,000 rows (PG estimates 626,000 which is in the right ballpark) and PG thinks that this is enough rows where a seq scan is faster. It could be right. You can test that, force an indexscan by doing: SET enable_seqscan = FALSE; Also, please remember to run each query 3 times and report the time of the *last* run to us. We don't want differences in caching to throw off your evaulation. -- --Josh Josh Berkus Aglio Database Solutions San Francisco