Query across a date range - Mailing list pgsql-performance

From David Jaquay
Subject Query across a date range
Date
Msg-id ad4aa5a805010711171a3c1696@mail.gmail.com
Whole thread Raw
Responses Re: Query across a date range
Re: Query across a date range
List pgsql-performance
Summary: Doing a two or three table join for a date range performs
worse than doing the same query individually for each date in the
range.

What works: Doing a query just on a single date or a date range
(against just one table) runs quick; 'explain' says it uses an index
scan.  Doing a query on a single date for one store or for one market
uses all index scans, and runs quick as well.

The problem: Doing a query for a date range on a particular store or
market, though, for a date range of more than a few days does a
sequential scan of sales_tickets, and performs worse than doing one
single date query for each date.  My 'explain' for one such query is
below.

Background: I have two or three tables involved in a query.  One table
is holds stores (7 rows at present), one holds sales tickets (about 5
million) and one holds line items (about 10 million).  It's test data
that I've generated and loaded using '\copy from'.  Each has a primary
key, and line items have two dates, written and delivered, that are
indexed individually.  Store has a market id; a market contains
multiple stores (in my case, 2 or 3).  Each sales ticket has 1-3 line
items.

Is there a way to tell postgres to use an index scan on sales_tickets?

Curiously, in response to recent postings in the "Low Performance for
big hospital server" thread, when I flatten the tables by putting
storeid into line_items, it runs somewhat faster in all cases, and
much faster in some; (I have times, if anyone is interested).

Thanks,
Dave


mydb=> explain select * from line_items t, sales_tickets s where
writtenDate >= '12/01/2002' and writtenDate <= '12/31/2002' and
t.ticketId = s.ticketId and s.storeId = 1;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Hash Join  (cost=93865.46..114054.74 rows=19898 width=28)
   Hash Cond: ("outer".ticketId = "inner".ticketId)
   ->  Index Scan using line_items_written on line_items t
(cost=0.00..3823.11 rows=158757 width=16)
         Index Cond: ((writtendate >= '2002-12-01'::date) AND
(writtendate <= '2002-12-31'::date))
   ->  Hash  (cost=89543.50..89543.50 rows=626783 width=12)
         ->  Seq Scan on sales_tickets s  (cost=0.00..89543.50
rows=626783 width=12)
               Filter: (storeid = 1)
(7 rows)

mydb=> explain select * from line_items t, sales_tickets s where
writtenDate = '12/01/2002' and t.ticketId = s.ticketId and s.storeid =
1;
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..16942.25 rows=697 width=28)
   ->  Index Scan using line_items_written on soldtrx t
(cost=0.00..121.97 rows=5554 width=16)
         Index Cond: (writtendate = '2002-12-01'::date)
   ->  Index Scan using sales_tickets_pkey on sales_tickets s
(cost=0.00..3.02 rows=1 width=12)
         Index Cond: ("outer".ticketId = s.ticketId)
         Filter: (storeid = 1)
(6 rows)


The tables:

create table stores   -- 7 rows
(
    storeId  integer not null,
    marketId integer not null
);

create table sales_tickets   -- 500,000 rows
(
    ticketId integer primary key,
    storeId  integer not null,
    custId   integer not null
);

create table line_items   -- 1,000,000 rows
(
    lineItemId    integer primary key,
    ticketId      integer references sales_tickets,
    writtenDate   date    not null,
    deliveredDate date    not null
);

create index line_items_written   on line_items (writtenDate);
create index line_items_delivered on line_items (deliveredDate);

pgsql-performance by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Low Performance for big hospital server ..
Next
From: Josh Berkus
Date:
Subject: Re: Query across a date range