Re: Random Page Cost and Planner - Mailing list pgsql-performance

From Alexey Klyukin
Subject Re: Random Page Cost and Planner
Date
Msg-id 1B94E0EC-D665-4059-83DC-5D1B1E2B1009@commandprompt.com
Whole thread Raw
In response to Re: Random Page Cost and Planner  (David Jarvis <thangalin@gmail.com>)
Responses Re: Random Page Cost and Planner  (David Jarvis <thangalin@gmail.com>)
List pgsql-performance
On May 26, 2010, at 6:50 AM, David Jarvis wrote:
>
> That said, when using the following condition, the query is fast (1 second):
>
>     extract(YEAR FROM sc.taken_start) >= 1963 AND
>     extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                    ->  Index Scan using measurement_013_stc_idx on measurement_013 m  (cost=0.00..511.00 rows=511
width=15)(actual time=0.018..3.601 rows=3356 loops=104)" 
> "                          Index Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken
<=sc.taken_end) AND (m.category_id = 7))" 
>
> This condition makes it slow (13 seconds on first run, 8 seconds thereafter):
>
>     extract(YEAR FROM sc.taken_start) >= 1900 AND
>     extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> "                          Filter: (category_id = 7)"
> "                    ->  Seq Scan on measurement_013 m  (cost=0.00..359704.80 rows=18118464 width=15) (actual
time=0.008..4025.692rows=18118395 loops=1)" 
>
> At this point, I'm tempted to write a stored procedure that iterates over each station category for all the years of
eachstation. My guess is that the planner's estimate for the number of rows that will be returned by extract(YEAR FROM
sc.taken_start)>= 1900 is incorrect and so it chooses a full table scan for all rows.  

Nope, it appears that the planner estimate is correct (it estimates 18118464 vs 18118464 real rows). I think what's
happeningthere is  that 18M rows is large enough part of the total table rows that it makes sense to scan it
sequentially(eliminating random access costs).  Try SET enable_seqsan = false and repeat the query - there is a chance
thatthe index scan would be even slower. 

> The part I am having trouble with is convincing PG to use the index for the station ID and the date range for when
thestation was active. Each station has a unique ID; the data in the measurement table is ordered by measurement date
thenby station. 
>
> Should I add a clustered index by station then by date?
>
> Any other suggestions are very much appreciated.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is a possibility to aggregate results
from the past years if they are constant.

Regards,
--
Alexey Klyukin <alexk@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.



pgsql-performance by date:

Previous
From: David Jarvis
Date:
Subject: Re: Random Page Cost and Planner
Next
From: Cédric Villemain
Date:
Subject: Re: shared_buffers advice