Thread: Query across a date range

Query across a date range

From
David Jaquay
Date:
Summary: Doing a two or three table join for a date range performs
worse than doing the same query individually for each date in the
range.

What works: Doing a query just on a single date or a date range
(against just one table) runs quick; 'explain' says it uses an index
scan.  Doing a query on a single date for one store or for one market
uses all index scans, and runs quick as well.

The problem: Doing a query for a date range on a particular store or
market, though, for a date range of more than a few days does a
sequential scan of sales_tickets, and performs worse than doing one
single date query for each date.  My 'explain' for one such query is
below.

Background: I have two or three tables involved in a query.  One table
is holds stores (7 rows at present), one holds sales tickets (about 5
million) and one holds line items (about 10 million).  It's test data
that I've generated and loaded using '\copy from'.  Each has a primary
key, and line items have two dates, written and delivered, that are
indexed individually.  Store has a market id; a market contains
multiple stores (in my case, 2 or 3).  Each sales ticket has 1-3 line
items.

Is there a way to tell postgres to use an index scan on sales_tickets?

Curiously, in response to recent postings in the "Low Performance for
big hospital server" thread, when I flatten the tables by putting
storeid into line_items, it runs somewhat faster in all cases, and
much faster in some; (I have times, if anyone is interested).

Thanks,
Dave


mydb=> explain select * from line_items t, sales_tickets s where
writtenDate >= '12/01/2002' and writtenDate <= '12/31/2002' and
t.ticketId = s.ticketId and s.storeId = 1;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Hash Join  (cost=93865.46..114054.74 rows=19898 width=28)
   Hash Cond: ("outer".ticketId = "inner".ticketId)
   ->  Index Scan using line_items_written on line_items t
(cost=0.00..3823.11 rows=158757 width=16)
         Index Cond: ((writtendate >= '2002-12-01'::date) AND
(writtendate <= '2002-12-31'::date))
   ->  Hash  (cost=89543.50..89543.50 rows=626783 width=12)
         ->  Seq Scan on sales_tickets s  (cost=0.00..89543.50
rows=626783 width=12)
               Filter: (storeid = 1)
(7 rows)

mydb=> explain select * from line_items t, sales_tickets s where
writtenDate = '12/01/2002' and t.ticketId = s.ticketId and s.storeid =
1;
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..16942.25 rows=697 width=28)
   ->  Index Scan using line_items_written on soldtrx t
(cost=0.00..121.97 rows=5554 width=16)
         Index Cond: (writtendate = '2002-12-01'::date)
   ->  Index Scan using sales_tickets_pkey on sales_tickets s
(cost=0.00..3.02 rows=1 width=12)
         Index Cond: ("outer".ticketId = s.ticketId)
         Filter: (storeid = 1)
(6 rows)


The tables:

create table stores   -- 7 rows
(
    storeId  integer not null,
    marketId integer not null
);

create table sales_tickets   -- 500,000 rows
(
    ticketId integer primary key,
    storeId  integer not null,
    custId   integer not null
);

create table line_items   -- 1,000,000 rows
(
    lineItemId    integer primary key,
    ticketId      integer references sales_tickets,
    writtenDate   date    not null,
    deliveredDate date    not null
);

create index line_items_written   on line_items (writtenDate);
create index line_items_delivered on line_items (deliveredDate);

Re: Query across a date range

From
Josh Berkus
Date:
David,

> The problem: Doing a query for a date range on a particular store or
> market, though, for a date range of more than a few days does a
> sequential scan of sales_tickets, and performs worse than doing one
> single date query for each date.  My 'explain' for one such query is
> below.

Can you run EXPLAIN ANALYZE instead of just EXPLAIN?  That will show you the
discrepancy between estimated and actual costs, and probably show you what
needs fixing.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Query across a date range

From
Tom Lane
Date:
David Jaquay <djaquay@gmail.com> writes:
> Summary: Doing a two or three table join for a date range performs
> worse than doing the same query individually for each date in the
> range.

Could we see EXPLAIN ANALYZE, not just EXPLAIN, results?

Also, have you ANALYZEd lately?  If the estimated row counts are at all
accurate, I doubt that forcing a nestloop indexscan would improve the
situation.

Also, what PG version is this?

            regards, tom lane

Re: Query across a date range

From
David Jaquay
Date:
'explain analyze' output is below.  I have done analyze recently, and
am using pg 7.4.2 on SuSE 9.1.  I'd be curious to know how to "a
nestloop indexscan" to try it out.

Thanks,
Dave

mydb=> explain analyze select * from line_items t, sales_tickets s
where writtenDate >= '12/01/2002' and writtenDate <= '12/31/2002' and
t.ticketid = s.ticketId and s.storeId = 1;
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=93865.46..114054.74 rows=19898 width=28) (actual
time=25419.088..32140.217 rows=23914 loops=1)
   Hash Cond: ("outer".ticketid = "inner".ticketid)
   ->  Index Scan using line_items_written on line_items t
(cost=0.00..3823.11 rows=158757 width=16) (actual
time=100.621..3354.818 rows=169770 loops=1)
         Index Cond: ((writtendate >= '2002-12-01'::date) AND
(writtendate <= '2002-12-31'::date))
   ->  Hash  (cost=89543.50..89543.50 rows=626783 width=12) (actual
time=22844.146..22844.146 rows=0 loops=1)
         ->  Seq Scan on sales_tickets s  (cost=0.00..89543.50
rows=626783 width=12) (actual time=38.017..19387.447 rows=713846
loops=1)
               Filter: (storeid = 1)
 Total runtime: 32164.948 ms
(8 rows)




On Fri, 7 Jan 2005 11:35:11 -0800, Josh Berkus <josh@agliodbs.com> wrote:
> Can you run EXPLAIN ANALYZE instead of just EXPLAIN?  That will show you the
> discrepancy between estimated and actual costs, and probably show you what
> needs fixing.

Also, Tom Lane wrote:
> Could we see EXPLAIN ANALYZE, not just EXPLAIN, results?
>
> Also, have you ANALYZEd lately?  If the estimated row counts are at all
> accurate, I doubt that forcing a nestloop indexscan would improve the
> situation.
>
> Also, what PG version is this?

Re: Query across a date range

From
Josh Berkus
Date:
Dave,

Ah ....

>          ->  Seq Scan on sales_tickets s  (cost=0.00..89543.50
> rows=626783 width=12) (actual time=38.017..19387.447 rows=713846
> loops=1)

This is just more than 1/2 the time of your query.   The issue is that you're
pulling 713,000 rows (PG estimates 626,000 which is in the right ballpark)
and PG thinks that this is enough rows where a seq scan is faster.  It could
be right.

You can test that, force an indexscan by doing:
SET enable_seqscan = FALSE;

Also, please remember to run each query 3 times and report the time of the
*last* run to us.   We don't want differences in caching to throw off your
evaulation.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco