Re: cost and actual time - Mailing list pgsql-performance
From | Chantal Ackermann |
---|---|
Subject | Re: cost and actual time |
Date | |
Msg-id | 3E520AD8.2000304@biomax.de Whole thread Raw |
In response to | Re: cost and actual time (Manfred Koizar <mkoi-pg@aon.at>) |
Responses |
Re: cost and actual time
|
List | pgsql-performance |
hello Manfred, hello Josh, hello Tom, I followed your advices. Using the new query with explicit joins, combined with the function that retrieves the gene_id, the estimated row count is now far more realistic. Still, the same query using different gene names takes sometimes less than a second, sometimes several minutes, obviously due to (not) caching. In the resulting query plan, there are still a Seq Scan, a Nested Loop and a Hash Join that take up most of the cost. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ In details: I have created the following function: CREATE OR REPLACE FUNCTION get_gene_id(TEXT) RETURNS INT AS 'SELECT gene_id FROM gene WHERE gene_name = $1' LANGUAGE SQL IMMUTABLE STRICT; Then I ran some queries with explain/explain analyze. For example: 1. the old query, leaving out the table gene and setting gene_occurrences.gene_id to a certain gene_id, or the function get_gene_id, respectively. (This is the query you suggested, Manfred.) EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT disease.disease_name, disease_occurrences.sentence_id FROM disease, gene_occurrences, disease_occurrences WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene_occurrences.gene_id=get_gene_id('igm') AND disease.disease_id=disease_occurrences.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; 'igm' occures about 54.000 times in gene_occurrences. QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=308065.26..308069.67 rows=882 width=57) (actual time=53107.46..53108.13 rows=2326 loops=1) Sort Key: count(disease_name) -> Aggregate (cost=307846.61..307978.94 rows=882 width=57) (actual time=53011.97..53100.58 rows=2326 loops=1) -> Group (cost=307846.61..307934.83 rows=8822 width=57) (actual time=53011.94..53079.74 rows=16711 loops=1) -> Sort (cost=307846.61..307890.72 rows=8822 width=57) (actual time=53011.93..53020.32 rows=16711 loops=1) Sort Key: disease_name -> Subquery Scan tmp (cost=305367.08..306690.35 rows=8822 width=57) (actual time=52877.87..52958.72 rows=16711 loops=1) -> Unique (cost=305367.08..306690.35 rows=8822 width=57) (actual time=52877.85..52915.20 rows=16711 loops=1) -> Sort (cost=305367.08..305808.17 rows=88218 width=57) (actual time=52877.85..52886.47 rows=16711 loops=1) Sort Key: disease.disease_name, disease_occurrences.sentence_id -> Hash Join (cost=4610.17..290873.90 rows=88218 width=57) (actual time=388.53..52752.92 rows=16711 loops=1) Hash Cond: ("outer".disease_id = "inner".disease_id) -> Nested Loop (cost=0.00..282735.01 rows=88218 width=20) (actual time=0.25..52184.26 rows=16711 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..57778.26 rows=54692 width=8) (actual time=0.07..17455.52 rows=54677 loops=1) Index Cond: (gene_id = 70764) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..4.09 rows=2 width=12) (actual time=0.63..0.63 rows=0 loops=54677) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Hash (cost=2500.23..2500.23 rows=129923 width=37) (actual time=387.45..387.45 rows=0 loops=1) -> Seq Scan on disease (cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..207.71 rows=129923 loops=1) Total runtime: 53118.81 msec (20 rows) What takes up most of the runtime the Nested Loop (for the join of disease and disease_occurrences, or rather for joining both occurrences tables? I'm not sure which rows belong together in the explain output). The cost for 'igg' is higher: estimation of pages by explain: 584729.76. actual runtime: 693210.99 msec. The query plan is the same. The Nested Loop takes up most of the runtime (-> Nested Loop (cost=0.00..538119.44 rows=176211 width=20) (actual time=0.28..691474.74 rows=30513 loops=1)) 2. The new query, same changes (gene left out, subselect replaced with get_gene_id): EXPLAIN ANALYZE SELECT disease.disease_name, count(disease.disease_name) AS cnt FROM ((SELECT gene_occurrences.sentence_id FROM gene_occurrences WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp JOIN disease_occurrences USING (sentence_id)) as tmp2 NATURAL JOIN disease GROUP BY disease.disease_name ORDER BY cnt DESC; 'csf' occurres about 55.000 times in gene_occurrences. QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=323834.95..323881.54 rows=9318 width=57) (actual time=146975.89..146976.60 rows=2383 loops=1) Sort Key: count(disease.disease_name) -> Aggregate (cost=321208.63..322606.31 rows=9318 width=57) (actual time=146840.89..146968.58 rows=2383 loops=1) -> Group (cost=321208.63..322140.42 rows=93179 width=57) (actual time=146840.87..146941.60 rows=24059 loops=1) -> Sort (cost=321208.63..321674.53 rows=93179 width=57) (actual time=146840.85..146852.92 rows=24059 loops=1) Sort Key: disease.disease_name -> Hash Join (cost=4485.78..305826.96 rows=93179 width=57) (actual time=544.79..146651.05 rows=24059 loops=1) Hash Cond: ("outer".disease_id = "inner".disease_id) -> Nested Loop (cost=0.00..297614.04 rows=93179 width=20) (actual time=105.85..145936.47 rows=24059 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..60007.96 rows=57768 width=8) (actual time=41.86..47116.68 rows=55752 loops=1) Index Cond: (gene_id = 29877) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..4.09 rows=2 width=12) (actual time=1.76..1.77 rows=0 loops=55752) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Hash (cost=2500.23..2500.23 rows=129923 width=37) (actual time=438.16..438.16 rows=0 loops=1) -> Seq Scan on disease (cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..236.78 rows=129923 loops=1) Total runtime: 146986.12 msec (16 rows) This query is obviously not as good as the old one, though I don't understand where the explicit joins are worse than what the optimizer choses. There is still the Nested Loop that takes up the biggest part. When I set enable_nestloop to false, explain outputs this plan: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=2146887.90..2146934.49 rows=9318 width=57) Sort Key: count(disease.disease_name) -> Aggregate (cost=2144261.59..2145659.27 rows=9318 width=57) -> Group (cost=2144261.59..2145193.37 rows=93179 width=57) -> Sort (cost=2144261.59..2144727.48 rows=93179 width=57) Sort Key: disease.disease_name -> Merge Join (cost=2122513.18..2128879.92 rows=93179 width=57) Merge Cond: ("outer".disease_id = "inner".disease_id) -> Index Scan using disease_pkey on disease (cost=0.00..3388.03 rows=129923 width=37) -> Sort (cost=2122513.18..2122979.08 rows=93179 width=20) Sort Key: disease_occurrences.disease_id -> Merge Join (cost=69145.63..2107131.52 rows=93179 width=20) Merge Cond: ("outer".sentence_id = "inner".sentence_id) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..1960817.45 rows=15079045 width=12) -> Sort (cost=69145.63..69434.47 rows=57768 width=8) Sort Key: gene_occurrences.sentence_id -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..60007.96 rows=57768 width=8) Index Cond: (gene_id = 29877) (18 rows) Most of the runtime is used up by the index scan to join the occurrences tables, while the index scan for joining diesease and disease_occurrences is fast. At the moment my settings concering the query planner are: effective_cache_size = 80000 # typically 8KB each, default 1000 random_page_cost = 1.5 # units are one sequential page fetch cost cpu_tuple_cost = 0.01 # (same), default 0.01 cpu_index_tuple_cost = 0.00001 # (same), default 0.001 cpu_operator_cost = 0.005 # (same), default 0.0025 I am still having problems to read the output of explain, especially to know which rows belong together, and what strategy applies to which join. Is there some documentation that describes the format of explain, other than the one in the main manual that comes with the postgres installation? Just some short explanation or example on how to interpret indents and arrows. Thank you for your help! Chantal
pgsql-performance by date: