Thread: database performance and query performance question
Our database has slowed right down. We are not getting any performance from our biggest table "forecastelement". The table has 93,218,671 records in it and climbing. The index is on 4 columns, origianlly it was on 3. I added another to see if it improve performance. It did not. Should there be less columns in the index? How can we improve database performance? How should I improve my query? PWFPM_DEV=# \d forecastelement Table "public.forecastelement" Column | Type | Modifiers ----------------+-----------------------------+----------- version | character varying(99) | not null origin | character varying(10) | not null timezone | character varying(99) | not null region_id | character varying(20) | not null wx_element | character varying(99) | not null value | character varying(99) | not null flag | character(3) | not null units | character varying(99) | not null valid_time | timestamp without time zone | not null issue_time | timestamp without time zone | not null next_forecast | timestamp without time zone | not null reception_time | timestamp without time zone | not null Indexes: "forecastelement_vrwi_idx" btree (valid_time,region_id.wx_element.issue_time) explain analyze select DISTINCT ON (valid_time) to_char(valid_time,'YYYYMMDDHH24MISS') as valid_time,value from (select valid_time,value,"time"(valid_time) as hour,reception_time, issue_time from forecastelement where valid_time between '2002-09-02 04:00:00' and '2002-09-07 03:59:59' and region_id = 'PU-REG-WTO-00200' and wx_element = 'TEMP_VALEUR1' and issue_time between '2002-09-02 05:00:00' and '2002-09-06 05:00:00' and origin = 'REGIONAL' and "time"(issue_time) = '05:00:00' order by issue_time,reception_time DESC,valid_time) as foo where (date(valid_time) = date(issue_time)+1 -1 or date(valid_time) = date(issue_time)+1 or (valid_time between '2002-09-07 00:00:00' and '2002-09-07 03:59:59' and issue_time = '2002-09-06 05:00:00')) order by valid_time ,issue_time DESC; USING INDEX "forecastelement_vrwi_idx" btree (valid_time, region_id, wx_element, issue_time) Unique (cost=116.75..116.76 rows=1 width=83) (actual time=9469.088..9470.002 rows=115 loops=1) -> Sort (cost=116.75..116.75 rows=1 width=83) (actual time=9469.085..9469.308 rows=194 loops=1) Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time -> Subquery Scan foo (cost=116.72..116.74 rows=1 width=83) (actual time=9465.979..9467.735 rows=194 loops=1) -> Sort (cost=116.72..116.73 rows=1 width=30) (actual time=9440.756..9440.981 rows=194 loops=1) Sort Key: issue_time, reception_time, valid_time -> Index Scan using forecastelement_vrwi_idx on forecastelement (cost=0.00..116.71 rows=1 width=30) (actual time=176.510..9439.470 rows=194 loops=1) Index Cond: ((valid_time >= '2002-09-02 04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND ((region_id)::text = 'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text) AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND (issue_time <= '2002-09-06 05:00:00'::timestamp without time zone)) Filter: (((origin)::text = 'REGIONAL'::text) AND ("time"(issue_time) = '05:00:00'::time without time zone) AND ((date(valid_time) = ((date(issue_time) + 1) - 1)) OR (date(valid_time) = (date(issue_time) + 1)) OR ((valid_time >= '2002-09-07 00:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND (issue_time = '2002-09-06 05:00:00'::timestamp without time zone)))) Total runtime: 9470.404 ms We are running postgresql-7.4-0.5PGDG.i386.rpm . on a Dell Poweredge 6650. system OS RHAS 3.0 cpu 4 memory 3.6 GB disk 270 GB raid 5 postgresql.conf max_connections = 64 shared_buffers = 4000 vacuum_mem = 32768 effective_cache_size = 312500 random_page_cost = 2
Dan, > Should there be less columns in the index? > How can we improve database performance? > How should I improve my query? Your query plan isn't the problem. It's a good plan, and a reasonably efficient query. Under other circumstances, the SELECT DISTINCT with the to_char could be a performance-killer, but it's not in that result set. Overall, you're taking 9 seconds to scan 93 million records. Is this the time the first time you run the query, or the 2nd and successive times? When did you last run VACUUM ANALYZE on the table? Have you tried increasing the ANALYZE statistics on the index columns to, say, 500? Your disks are RAID 5. How many drives? In RAID5, more drives improves the speed of large scans. And what's your sort_mem setting? You didn't mention it. Why is your effective cache size only 300mb when you have 3 GB of RAM? It's not affecting this query, but it could affect others. -- -Josh Berkus Aglio Database Solutions San Francisco
Dan, > Why is your effective cache size only 300mb when you have 3 GB of RAM? It's > not affecting this query, but it could affect others. Ignore this last question, I dropped a zero from my math. Sorry! -- -Josh Berkus Aglio Database Solutions San Francisco
-----Original Message----- From: Josh Berkus [mailto:josh@agliodbs.com] Sent: Thursday, January 22, 2004 3:01 PM To: Shea,Dan [CIS]; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Dan, > Should there be less columns in the index? > How can we improve database performance? > How should I improve my query? >>Your query plan isn't the problem. It's a good plan, and a reasonably >>efficient query. Under other circumstances, the SELECT DISTINCT with the >>to_char could be a performance-killer, but it's not in that result set. >>Overall, you're taking 9 seconds to scan 93 million records. Is this the time >>the first time you run the query, or the 2nd and successive times? This is actually the second time. The first query took more time. Concerning the number of columns for an index, I switched the index to have only one column and tried the same query. It is below. >>When did you last run VACUUM ANALYZE on the table? Have you tried increasing >>the ANALYZE statistics on the index columns to, say, 500? It is run nightly. But last night's did not complete. It was taking quite some time and I cancelled it, over 4 hours. I will try increasing the ANALYZE statistics to 500. >>Your disks are RAID 5. How many drives? In RAID5, more drives improves the >>speed of large scans. There are 4 drives in this raid 5. We are using lvm with ext3 filesystem. Will be moving the database to a SAN within the next month. And what's your sort_mem setting? You didn't mention it. >>The sort_mem is the default PWFPM_DEV=# show sort_mem; sort_mem ---------- 1024 Why is your effective cache size only 300mb when you have 3 GB of RAM? It's not affecting this query, but it could affect others. >> Oh, I thought I had it set for 2.5 GB of RAM. 312500 * 8k = 2.5 GB QUERY WITH 1 column in index. Unique (cost=717633.28..717633.29 rows=1 width=83) (actual time=62922.399..62923.334 rows=115 loops=1) -> Sort (cost=717633.28..717633.29 rows=1 width=83) (actual time=62922.395..62922.615 rows=194 loops=1) Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time -> Subquery Scan foo (cost=717633.26..717633.27 rows=1 width=83) (actual time=62918.232..62919.989 rows=194 loops=1) -> Sort (cost=717633.26..717633.26 rows=1 width=30) (actual time=62902.378..62902.601 rows=194 loops=1) Sort Key: issue_time, reception_time, valid_time -> Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..717633.25 rows=1 width=30) (actual time=1454.974..62900.752 rows=194 loops=1) Index Cond: ((valid_time >= '2002-09-02 04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone)) Filter: (((region_id)::text = 'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text) AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND (issue_time <= '2002-09-06 05:00:00'::timestamp without time zone) AND ((origin)::text = 'REGIONAL'::text) AND ("time"(issue_time) = '05:00:00'::time without time zone) AND ((date(valid_time) = ((date(issue_time) + 1) - 1)) OR (date(valid_time) = (date(issue_time) + 1)) OR ((valid_time >= '2002-09-07 00:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND (issue_time = '2002-09-06 05:00:00'::timestamp without time zone)))) Total runtime: 62923.723 ms (10 rows) PWFPM_DEV=# expalin analyze 312500 PWFPM_DEV=# explain analyze select DISTINCT ON (valid_time) to_char(valid_time,'YYYYMMDDHH24MISS') as valid_time,value from PWFPM_DEV-# (select valid_time,value,"time"(valid_time) as hour,reception_time, PWFPM_DEV(# issue_time from forecastelement where PWFPM_DEV(# valid_time between '2002-09-02 04:00:00' and PWFPM_DEV(# '2002-09-07 03:59:59' and region_id = 'PU-REG-WTO-00200' PWFPM_DEV(# and wx_element = 'TEMP_VALEUR1' and issue_time between PWFPM_DEV(# '2002-09-02 05:00:00' and '2002-09-06 05:00:00' PWFPM_DEV(# and origin = 'REGIONAL' and "time"(issue_time) = '05:00:00' PWFPM_DEV(# order by issue_time,reception_time DESC,valid_time) as foo where PWFPM_DEV-# (date(valid_time) = date(issue_time)+1 -1 or date(valid_time) = date(issue_time)+1 or PWFPM_DEV(# (valid_time between '2002-09-07 00:00:00' and '2002-09-07 03:59:59' PWFPM_DEV(# and issue_time = '2002-09-06 05:00:00')) order by valid_time ,issue_time DESC; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -------- Unique (cost=717633.28..717633.29 rows=1 width=83) (actual time=21468.227..21469.164 rows=115 loops=1) -> Sort (cost=717633.28..717633.29 rows=1 width=83) (actual time=21468.223..21468.452 rows=194 loops=1) Sort Key: to_char(valid_time, 'YYYYMMDDHH24MISS'::text), issue_time -> Subquery Scan foo (cost=717633.26..717633.27 rows=1 width=83) (actual time=21465.274..21467.006 rows=194 loops=1) -> Sort (cost=717633.26..717633.26 rows=1 width=30) (actual time=21465.228..21465.452 rows=194 loops=1) Sort Key: issue_time, reception_time, valid_time -> Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..717633.25 rows=1 width=30) (actual time=1479.649..21463.779 rows=194 loops=1) Index Cond: ((valid_time >= '2002-09-02 04:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone)) Filter: (((region_id)::text = 'PU-REG-WTO-00200'::text) AND ((wx_element)::text = 'TEMP_VALEUR1'::text) AND (issue_time >= '2002-09-02 05:00:00'::timestamp without time zone) AND (issue_time <= '2002-09-06 05:00:00'::timestamp without time zone) AND ((origin)::text = 'REGIONAL'::text) AND ("time"(issue_time) = '05:00:00'::time without time zone) AND ((date(valid_time) = ((date(issue_time) + 1) - 1)) OR (date(valid_time) = (date(issue_time) + 1)) OR ((valid_time >= '2002-09-07 00:00:00'::timestamp without time zone) AND (valid_time <= '2002-09-07 03:59:59'::timestamp without time zone) AND (issue_time = '2002-09-06 05:00:00'::timestamp without time zone)))) Total runtime: 21469.485 ms (10 rows) PWFPM_DEV=# -- -Josh Berkus Aglio Database Solutions San Francisco
Something that I do not understand is why if you use a valid_time = '2004-01-22 00:00:00' the query will use the index but if you do a valid_time > '2004-01-22 00:00:00' it does not use the index? PWFPM_DEV=# explain analyze select * from forecastelement where valid_time > date '2004-01-23'::date limit 10; QUERY PLAN ---------------------------------------------------------------------------- ------------------------------------------------------------ Limit (cost=0.00..3.82 rows=10 width=129) (actual time=199550.388..199550.783 rows=10 loops=1) -> Seq Scan on forecastelement (cost=0.00..2722898.40 rows=7131102 width=129) (actual time=199550.382..199550.757 rows=10 loops=1) Filter: (valid_time > '2004-01-23 00:00:00'::timestamp without time zone) Total runtime: 199550.871 ms (4 rows) PWFPM_DEV=# explain analyze select * from forecastelement where valid_time = date '2004-01-23'::date limit 10; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -------- Limit (cost=0.00..18.76 rows=10 width=129) (actual time=176.141..276.577 rows=10 loops=1) -> Index Scan using forecastelement_vrwi_idx on forecastelement (cost=0.00..160770.98 rows=85707 width=129) (actual time=176.133..276.494 rows=10 loops=1) Index Cond: (valid_time = '2004-01-23 00:00:00'::timestamp without time zone) Total runtime: 276.721 ms (4 rows) -----Original Message----- From: Josh Berkus [mailto:josh@agliodbs.com] Sent: Thursday, January 22, 2004 3:01 PM To: Shea,Dan [CIS]; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Dan, > Should there be less columns in the index? > How can we improve database performance? > How should I improve my query? Your query plan isn't the problem. It's a good plan, and a reasonably efficient query. Under other circumstances, the SELECT DISTINCT with the to_char could be a performance-killer, but it's not in that result set. Overall, you're taking 9 seconds to scan 93 million records. Is this the time the first time you run the query, or the 2nd and successive times? When did you last run VACUUM ANALYZE on the table? Have you tried increasing the ANALYZE statistics on the index columns to, say, 500? Your disks are RAID 5. How many drives? In RAID5, more drives improves the speed of large scans. And what's your sort_mem setting? You didn't mention it. Why is your effective cache size only 300mb when you have 3 GB of RAM? It's not affecting this query, but it could affect others. -- -Josh Berkus Aglio Database Solutions San Francisco
Dan, > Something that I do not understand is why if you use a valid_time = > '2004-01-22 00:00:00' the query will use the index but if you do a > valid_time > '2004-01-22 00:00:00' it does not use the index? Because of the expected number of rows to be returned. Take a look at the row estimates on the forecastleelement scans. You can improve these estimates by increasing the ANALYZE stats and/or running ANALYZE more often. Of course, increasing the stats makes analyze run slower ... -- -Josh Berkus Aglio Database Solutions San Francisco
Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > Something that I do not understand is why if you use a valid_time = > '2004-01-22 00:00:00' the query will use the index but if you do a > valid_time > '2004-01-22 00:00:00' it does not use the index? It probably can't tell if > is selective enough to justify using index. Together with "limit 10" it may be. You could try explain analyze select * from forecastelement where valid_time between '2004-01-22'::date and '2004-01-22'::date limit 10; to see if this is considered good enough. -------------- Hannu
Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
This sure speed up the query, it is fast. PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2003-01-12'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- --- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..159607.11 rows=466094 width=129) (actual time=49.504..49.504 rows=0 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2003-01-12 00:00:00'::timestamp without time zone)) Total runtime: 49.589 ms (3 rows) -----Original Message----- From: Hannu Krosing [mailto:hannu@tm.ee] Sent: Thursday, January 22, 2004 3:54 PM To: Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
The end date in the previous example was actually invalid between '2004-01-12'::date and '2003-01-12'::date; There have been multiple inserts since I recreated the index but it took quite some time to complete the following PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2004-01-13'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..832139.81 rows=2523119 width=129) (actual time=0.519..467159.658 rows=2940600 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2004-01-13 00:00:00'::timestamp without time zone)) Total runtime: 472627.148 ms (3 rows) -----Original Message----- From: Shea,Dan [CIS] Sent: Thursday, January 22, 2004 4:10 PM To: 'Hannu Krosing'; Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: RE: [PERFORM] database performance and query performance question This sure speed up the query, it is fast. PWFPM_DEV=# explain analyze select * from forecastelement where valid_time between '2004-01-12'::date and '2003-01-12'::date; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- --- Index Scan using forecastelement_v_idx on forecastelement (cost=0.00..159607.11 rows=466094 width=129) (actual time=49.504..49.504 rows=0 loops=1) Index Cond: ((valid_time >= '2004-01-12 00:00:00'::timestamp without time zone) AND (valid_time <= '2003-01-12 00:00:00'::timestamp without time zone)) Total runtime: 49.589 ms (3 rows) -----Original Message----- From: Hannu Krosing [mailto:hannu@tm.ee] Sent: Thursday, January 22, 2004 3:54 PM To: Shea,Dan [CIS] Cc: 'josh@agliodbs.com'; pgsql-performance@postgresql.org Subject: Re: [PERFORM] database performance and query performance question Hannu Krosing kirjutas N, 22.01.2004 kell 22:46: > Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 22:35: > > Something that I do not understand is why if you use a valid_time = > > '2004-01-22 00:00:00' the query will use the index but if you do a > > valid_time > '2004-01-22 00:00:00' it does not use the index? > > It probably can't tell if > is selective enough to justify using index. > > Together with "limit 10" it may be. > > You could try > > explain analyze select * from forecastelement where valid_time between > '2004-01-22'::date and '2004-01-22'::date limit 10; Sorry, that should have been: between '2004-01-22'::date and '2004-01-23'::date > to see if this is considered good enough. > > -------------- > Hannu > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Dan, Of course it took forever. You're retrieving 2.9 million rows! > Index Scan using forecastelement_v_idx on forecastelement > (cost=0.00..832139.81 rows=2523119 width=129) (actual time=0.519..467159.658 > rows=2940600 loops=1) -- -Josh Berkus Aglio Database Solutions San Francisco
Shea,Dan [CIS] kirjutas N, 22.01.2004 kell 23:32: > The end date in the previous example was actually invalid between > '2004-01-12'::date and '2003-01-12'::date; > There have been multiple inserts since I recreated the index but it took > quite some time to complete the following > PWFPM_DEV=# explain analyze select * from forecastelement where valid_time > between '2004-01-12'::date and '2004-01-13'::date; You could try ORDER BY to bias the optimiser towards using an index: explain analyze select * from forecastelement where valid_time > '2004-01-12'::date order by valid_time limit 10; This also may be more close to what you are expecting :) ------------------ Hannu
"Shea,Dan [CIS]" <Dan.Shea@ec.gc.ca> writes: > Indexes: > "forecastelement_vrwi_idx" btree (valid_time,region_id.wx_element.issue_time) > > explain analyze > SELECT DISTINCT ON (valid_time) > to_char(valid_time,'YYYYMMDDHH24MISS') AS valid_time, > value > from ( > SELECT valid_time,value, "time"(valid_time) AS hour, reception_time, issue_time > FROM forecastelement > WHERE valid_time BETWEEN '2002-09-02 04:00:00' AND '2002-09-07 03:59:59' > AND region_id = 'PU-REG-WTO-00200' > AND wx_element = 'TEMP_VALEUR1' > AND issue_time BETWEEN '2002-09-02 05:00:00' AND '2002-09-06 05:00:00' > AND origin = 'REGIONAL' > AND "time"(issue_time) = '05:00:00' > ORDER BY issue_time,reception_time DESC,valid_time > ) AS foo > WHERE > ( date(valid_time) = date(issue_time)+1 -1 > OR date(valid_time) = date(issue_time)+1 > OR ( valid_time BETWEEN '2002-09-07 00:00:00' AND '2002-09-07 03:59:59' > AND issue_time = '2002-09-06 05:00:00' > ) > ) > ORDER BY valid_time ,issue_time DESC; Incidentally, I find it easier to analyze queries when they've been formatted well. This makes what's going on much clearer. From this it's clear your index doesn't match the query. Adding more columns will be useless because only the leading column "valid_time" will be used at all. Since you're fetching a whole range of valid_times the remaining columns are all irrelevant. They only serve to bloat the index and require reading a lot more data. You could either try creating an index just on valid_time, or create an index on (region_id,wx_element,valid_time) or (region_id,wx_element,issue_time) whichever is more selective. You could put wx_element first if it's more selective than region_id. Moreover, what purpose does the inner ORDER BY clause serve? It's only going to be re-sorted again by the outer ORDER BY. -- greg