Thread: cost and actual time
hello all, I am still fiddling around with my "big" database. System: RAM: 2GB CPU: 1,6 MHz (cache: 256 Kb) single disc: 120 GB :-( I have a query that joins to relatively large tables (10 - 15 Mio rows), or part of these tables (explain analyze: rows=46849) respectively. long story short: allover cost estimated in pages by explain is: cost=6926.59..6926.60 actual time is from explain analyze is: actual time=275461.91..275462.44 most of it is consumed by a nested loop (surprise!) this is the respective output: Sort Key: disease.disease_name, disease_occurrences.sentence_id -> Nested Loop (cost=0.00..6922.38 rows=98 width=64) (actual time=61.49..275047.46 rows=18910 loops=1) -> Nested Loop (cost=0.00..6333.23 rows=98 width=28) (actual time=61.42..274313.87 rows=18910 loops=1) -> Nested Loop (cost=0.00..5894.04 rows=64 width=16) (actual time=32.00..120617.26 rows=46849 loops=1) I tried to tweak the conf settings, but I think I already reached quite a good value concerning shared buffers and sort mem. the database is vacuum full analyzed. indexes seem fine. could one of you smart guys point me into a direction I might not have considered? - I know that the hardware is the minimum. nevertheless - if you have suggestions what exactely to add to the hardware to boost the database up (more RAM or more discs - even a RAID) - this would be a good argument for my boss. Thank you a lot Chantal
Chantal, > Sort Key: disease.disease_name, disease_occurrences.sentence_id > -> Nested Loop (cost=0.00..6922.38 rows=98 width=64) (actual > time=61.49..275047.46 rows=18910 loops=1) > -> Nested Loop (cost=0.00..6333.23 rows=98 width=28) (actual > time=61.42..274313.87 rows=18910 loops=1) > -> Nested Loop (cost=0.00..5894.04 rows=64 width=16) (actual > time=32.00..120617.26 rows=46849 loops=1) > > I tried to tweak the conf settings, but I think I already reached > quite a good value concerning shared buffers and sort mem. the > database is vacuum full analyzed. indexes seem fine. You *sure* that you've vacuum analyzed recently? The planner above is choosing a bad plan because its row estimates are way off ... if the subquery was actually returning 98 rows, the plan above would make sense ... but with 18,000 rows being returned, a Nested Loop is suicidal. Perhaps you could post the full text of the query? If some of your criteria are coming from volatile functions, then that could explain why the planner is so far off ... -Josh Berkus
hello Josh, thank you for your fast answer. (I had some days off.) This posting is quite long, I apologize. But I wanted to provide enough information to outline the problem. I did some vacuums analyze on all 4 tables concerned (gene, disease, gene_occurrences, disease_occurrences) to be sure the planner is up to date - but that did not minimize the difference between the estimation of resulting rows and the actual result. I changed the settings for default_statistics_target to 1000 (default 10). The estimation goes up to 102 rows which is little more than before, and still far away from the actual result. The effective_cache_size is at 80000. To be sure I didn't change it to be worse, I checked again with the default_statistics_target set to 10 and a cache size of 1000 (ran vacuum afterwards). the estimation is at 92 rows. so there's not a really big difference. I wonder, if I can set some geqo or planner settings in the postgresql.conf file to make the planner estimate better? The database is exclusively for reading, so it's ok if the time for analyzing the tables increases. The query I am testing with is: EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT disease.disease_name, disease_occurrences.sentence_id FROM gene, disease, gene_occurrences, disease_occurrences WHERE gene.gene_name='igg' AND gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene.gene_id=gene_occurrences.gene_id AND disease.disease_id=disease_occurrences.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; Row counts are: 'gene': 164657 'disease': 129923 'gene_occurrences': 10484141 'disease_occurrences': 15079045 the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is the most frequent. the index on gene_occurences(sentence_id) and disease_occurrences(sentence_id) is clustered. I have an alternative query which I am testing to see whether it is better than the first one: explain analyze SELECT disease.disease_name, count(disease.disease_name) AS cnt FROM ((SELECT gene_occurrences.sentence_id FROM gene_occurrences WHERE gene_occurrences.gene_id=(SELECT gene.gene_id FROM gene WHERE gene.gene_name='igg')) AS tmp JOIN disease_occurrences USING (sentence_id)) as tmp2 NATURAL JOIN disease GROUP BY disease.disease_name ORDER BY cnt DESC; the cost estimated by the planner is much higher, thus I thought this query is worse than the first. However - maybe it's just more accurate? this is its explain-output (default settings for default_statistics_target while 'vacuum analyze'): --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=126690.02..126691.47 rows=581 width=57) (actual time=8066.05..8067.05 rows=3364 loops=1) Sort Key: count(disease.disease_name) InitPlan -> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1 width=4) (actual time=0.19..0.20 rows=1 loops=1) Index Cond: (gene_name = 'igg'::text) -> Aggregate (cost=126619.79..126663.35 rows=581 width=57) (actual time=7894.33..8055.43 rows=3364 loops=1) -> Group (cost=126619.79..126648.83 rows=5809 width=57) (actual time=7894.31..8020.00 rows=30513 loops=1) -> Sort (cost=126619.79..126634.31 rows=5809 width=57) (actual time=7894.30..7910.08 rows=30513 loops=1) Sort Key: disease.disease_name -> Merge Join (cost=119314.93..126256.64 rows=5809 width=57) (actual time=6723.92..7732.94 rows=30513 loops=1) Merge Cond: ("outer".disease_id = "inner".disease_id) -> Index Scan using disease_pkey on disease (cost=0.00..6519.14 rows=129923 width=37) (actual time=0.04..742.20 rows=129872 loops=1) -> Sort (cost=119314.93..119329.45 rows=5809 width=20) (actual time=6723.74..6740.24 rows=30513 loops=1) Sort Key: disease_occurrences.disease_id -> Nested Loop (cost=0.00..118951.78 rows=5809 width=20) (actual time=1.19..6558.67 rows=30513 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..15700.31 rows=4039 width=8) (actual time=0.36..1404.64 rows=110637 loops=1) Index Cond: (gene_id = $0) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..25.47 rows=8 width=12) (actual time=0.04..0.04 rows=0 loops=110637) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) Total runtime: 8086.87 msec (20 rows) strangely, the estimation is far worse after running vacuum analyze again with a default_statistics_target of 1000: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=12521.37..12521.61 rows=96 width=57) (actual time=124967.47..124968.47 rows=3364 loops=1) Sort Key: count(disease.disease_name) InitPlan -> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1 width=4) (actual time=20.27..20.28 rows=1 loops=1) Index Cond: (gene_name = 'igg'::text) -> Aggregate (cost=12510.99..12518.20 rows=96 width=57) (actual time=124788.71..124956.77 rows=3364 loops=1) -> Group (cost=12510.99..12515.80 rows=961 width=57) (actual time=124788.68..124920.10 rows=30513 loops=1) -> Sort (cost=12510.99..12513.39 rows=961 width=57) (actual time=124788.66..124804.74 rows=30513 loops=1) Sort Key: disease.disease_name -> Nested Loop (cost=0.00..12463.35 rows=961 width=57) (actual time=164.11..124529.76 rows=30513 loops=1) -> Nested Loop (cost=0.00..6671.06 rows=961 width=20) (actual time=148.34..120295.52 rows=30513 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..2407.09 rows=602 width=8) (actual time=20.63..1613.99 rows=110637 loops=1) Index Cond: (gene_id = $0) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..7.06 rows=2 width=12) (actual time=1.07..1.07 rows=0 loops=110637) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Index Scan using disease_pkey on disease (cost=0.00..6.01 rows=1 width=37) (actual time=0.13..0.13 rows=1 loops=30513) Index Cond: ("outer".disease_id = disease.disease_id) Total runtime: 124981.15 msec (18 rows) There again is the estimation of 961 rows and the decision to choose a Nested Loop while the actual result includes 30513 rows. Thank you for taking the time to read my postings! Chantal Josh Berkus wrote: > Chantal, > > >>Sort Key: disease.disease_name, disease_occurrences.sentence_id >>-> Nested Loop (cost=0.00..6922.38 rows=98 width=64) (actual >>time=61.49..275047.46 rows=18910 loops=1) >> -> Nested Loop (cost=0.00..6333.23 rows=98 width=28) (actual >>time=61.42..274313.87 rows=18910 loops=1) >> -> Nested Loop (cost=0.00..5894.04 rows=64 width=16) (actual >>time=32.00..120617.26 rows=46849 loops=1) >> >>I tried to tweak the conf settings, but I think I already reached >>quite a good value concerning shared buffers and sort mem. the >>database is vacuum full analyzed. indexes seem fine. > > > You *sure* that you've vacuum analyzed recently? The planner above is > choosing a bad plan because its row estimates are way off ... if the > subquery was actually returning 98 rows, the plan above would make > sense ... but with 18,000 rows being returned, a Nested Loop is > suicidal. > > Perhaps you could post the full text of the query? If some of your > criteria are coming from volatile functions, then that could explain > why the planner is so far off ... > > -Josh Berkus > >
Chantal Ackermann <chantal.ackermann@biomax.de> writes: > the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is > the most frequent. I think the problem here is that the planner doesn't know that (and probably can't without some kind of cross-table statistics apparatus). It's generating a plan based on the average frequency of gene_ids, which is a loser for this outlier. Probably the most convenient way to do better is to structure things so that the reduction from gene name to gene_id is done before the planner starts to develop a plan. Instead of joining to gene, consider this: create function get_gene_id (text) returns int as -- adjust types as needed 'select gene_id from gene where gene_name = $1' language sql immutable strict; -- in 7.2, instead say "with (isCachable, isStrict)" EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT disease.disease_name, disease_occurrences.sentence_id FROM disease, gene_occurrences, disease_occurrences WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id AND get_gene_id('igg')=gene_occurrences.gene_id AND disease.disease_id=disease_occurrences.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; Now get_gene_id() isn't really immutable (unless you never change the gene table) but you have to lie and pretend that it is, so that the function call will be constant-folded during planner startup. The planner will then see something like gene_occurrences.gene_id = 42 and it will have a much better shot at determining the number of rows this matches. regards, tom lane
On Mon, 17 Feb 2003 11:51:56 +0100, Chantal Ackermann <chantal.ackermann@biomax.de> wrote: >the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is >the most frequent. Chantal, could you try EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT dd.disease_name, d_o.sentence_id FROM disease d, gene_occurrences g_o, disease_occurrences d_o WHERE g_o.sentence_id=d_o.sentence_id AND g_o.gene_id=4711 AND d.disease_id=d_o.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; replacing 4711 by the result of SELECT gene_id FROM gene WHERE gene_name='igg' and show us the plan you get? Servus Manfred
hello Manfred, hello Josh, hello Tom, I followed your advices. Using the new query with explicit joins, combined with the function that retrieves the gene_id, the estimated row count is now far more realistic. Still, the same query using different gene names takes sometimes less than a second, sometimes several minutes, obviously due to (not) caching. In the resulting query plan, there are still a Seq Scan, a Nested Loop and a Hash Join that take up most of the cost. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ In details: I have created the following function: CREATE OR REPLACE FUNCTION get_gene_id(TEXT) RETURNS INT AS 'SELECT gene_id FROM gene WHERE gene_name = $1' LANGUAGE SQL IMMUTABLE STRICT; Then I ran some queries with explain/explain analyze. For example: 1. the old query, leaving out the table gene and setting gene_occurrences.gene_id to a certain gene_id, or the function get_gene_id, respectively. (This is the query you suggested, Manfred.) EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT disease.disease_name, disease_occurrences.sentence_id FROM disease, gene_occurrences, disease_occurrences WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene_occurrences.gene_id=get_gene_id('igm') AND disease.disease_id=disease_occurrences.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; 'igm' occures about 54.000 times in gene_occurrences. QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=308065.26..308069.67 rows=882 width=57) (actual time=53107.46..53108.13 rows=2326 loops=1) Sort Key: count(disease_name) -> Aggregate (cost=307846.61..307978.94 rows=882 width=57) (actual time=53011.97..53100.58 rows=2326 loops=1) -> Group (cost=307846.61..307934.83 rows=8822 width=57) (actual time=53011.94..53079.74 rows=16711 loops=1) -> Sort (cost=307846.61..307890.72 rows=8822 width=57) (actual time=53011.93..53020.32 rows=16711 loops=1) Sort Key: disease_name -> Subquery Scan tmp (cost=305367.08..306690.35 rows=8822 width=57) (actual time=52877.87..52958.72 rows=16711 loops=1) -> Unique (cost=305367.08..306690.35 rows=8822 width=57) (actual time=52877.85..52915.20 rows=16711 loops=1) -> Sort (cost=305367.08..305808.17 rows=88218 width=57) (actual time=52877.85..52886.47 rows=16711 loops=1) Sort Key: disease.disease_name, disease_occurrences.sentence_id -> Hash Join (cost=4610.17..290873.90 rows=88218 width=57) (actual time=388.53..52752.92 rows=16711 loops=1) Hash Cond: ("outer".disease_id = "inner".disease_id) -> Nested Loop (cost=0.00..282735.01 rows=88218 width=20) (actual time=0.25..52184.26 rows=16711 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..57778.26 rows=54692 width=8) (actual time=0.07..17455.52 rows=54677 loops=1) Index Cond: (gene_id = 70764) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..4.09 rows=2 width=12) (actual time=0.63..0.63 rows=0 loops=54677) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Hash (cost=2500.23..2500.23 rows=129923 width=37) (actual time=387.45..387.45 rows=0 loops=1) -> Seq Scan on disease (cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..207.71 rows=129923 loops=1) Total runtime: 53118.81 msec (20 rows) What takes up most of the runtime the Nested Loop (for the join of disease and disease_occurrences, or rather for joining both occurrences tables? I'm not sure which rows belong together in the explain output). The cost for 'igg' is higher: estimation of pages by explain: 584729.76. actual runtime: 693210.99 msec. The query plan is the same. The Nested Loop takes up most of the runtime (-> Nested Loop (cost=0.00..538119.44 rows=176211 width=20) (actual time=0.28..691474.74 rows=30513 loops=1)) 2. The new query, same changes (gene left out, subselect replaced with get_gene_id): EXPLAIN ANALYZE SELECT disease.disease_name, count(disease.disease_name) AS cnt FROM ((SELECT gene_occurrences.sentence_id FROM gene_occurrences WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp JOIN disease_occurrences USING (sentence_id)) as tmp2 NATURAL JOIN disease GROUP BY disease.disease_name ORDER BY cnt DESC; 'csf' occurres about 55.000 times in gene_occurrences. QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=323834.95..323881.54 rows=9318 width=57) (actual time=146975.89..146976.60 rows=2383 loops=1) Sort Key: count(disease.disease_name) -> Aggregate (cost=321208.63..322606.31 rows=9318 width=57) (actual time=146840.89..146968.58 rows=2383 loops=1) -> Group (cost=321208.63..322140.42 rows=93179 width=57) (actual time=146840.87..146941.60 rows=24059 loops=1) -> Sort (cost=321208.63..321674.53 rows=93179 width=57) (actual time=146840.85..146852.92 rows=24059 loops=1) Sort Key: disease.disease_name -> Hash Join (cost=4485.78..305826.96 rows=93179 width=57) (actual time=544.79..146651.05 rows=24059 loops=1) Hash Cond: ("outer".disease_id = "inner".disease_id) -> Nested Loop (cost=0.00..297614.04 rows=93179 width=20) (actual time=105.85..145936.47 rows=24059 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..60007.96 rows=57768 width=8) (actual time=41.86..47116.68 rows=55752 loops=1) Index Cond: (gene_id = 29877) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..4.09 rows=2 width=12) (actual time=1.76..1.77 rows=0 loops=55752) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Hash (cost=2500.23..2500.23 rows=129923 width=37) (actual time=438.16..438.16 rows=0 loops=1) -> Seq Scan on disease (cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..236.78 rows=129923 loops=1) Total runtime: 146986.12 msec (16 rows) This query is obviously not as good as the old one, though I don't understand where the explicit joins are worse than what the optimizer choses. There is still the Nested Loop that takes up the biggest part. When I set enable_nestloop to false, explain outputs this plan: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=2146887.90..2146934.49 rows=9318 width=57) Sort Key: count(disease.disease_name) -> Aggregate (cost=2144261.59..2145659.27 rows=9318 width=57) -> Group (cost=2144261.59..2145193.37 rows=93179 width=57) -> Sort (cost=2144261.59..2144727.48 rows=93179 width=57) Sort Key: disease.disease_name -> Merge Join (cost=2122513.18..2128879.92 rows=93179 width=57) Merge Cond: ("outer".disease_id = "inner".disease_id) -> Index Scan using disease_pkey on disease (cost=0.00..3388.03 rows=129923 width=37) -> Sort (cost=2122513.18..2122979.08 rows=93179 width=20) Sort Key: disease_occurrences.disease_id -> Merge Join (cost=69145.63..2107131.52 rows=93179 width=20) Merge Cond: ("outer".sentence_id = "inner".sentence_id) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..1960817.45 rows=15079045 width=12) -> Sort (cost=69145.63..69434.47 rows=57768 width=8) Sort Key: gene_occurrences.sentence_id -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..60007.96 rows=57768 width=8) Index Cond: (gene_id = 29877) (18 rows) Most of the runtime is used up by the index scan to join the occurrences tables, while the index scan for joining diesease and disease_occurrences is fast. At the moment my settings concering the query planner are: effective_cache_size = 80000 # typically 8KB each, default 1000 random_page_cost = 1.5 # units are one sequential page fetch cost cpu_tuple_cost = 0.01 # (same), default 0.01 cpu_index_tuple_cost = 0.00001 # (same), default 0.001 cpu_operator_cost = 0.005 # (same), default 0.0025 I am still having problems to read the output of explain, especially to know which rows belong together, and what strategy applies to which join. Is there some documentation that describes the format of explain, other than the one in the main manual that comes with the postgres installation? Just some short explanation or example on how to interpret indents and arrows. Thank you for your help! Chantal
On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann <chantal.ackermann@biomax.de> wrote: >1. the old query, leaving out the table gene and setting >gene_occurrences.gene_id to a certain gene_id, or the function >get_gene_id, respectively. (This is the query you suggested, Manfred.) This was Tom's suggestion. I might have ended up there in a day or two :-) >What takes up most of the runtime the Nested Loop (for the join of >disease and disease_occurrences, or rather for joining both occurrences >tables? I'm not sure which rows belong together in the explain output). ... for joining both occurrences: The "-> Nested Loop" takes two tables (the "-> Index Scans") as input and produces one table as output which is again used as input for the "-> Hash Join" above it. >2. The new query, same changes (gene left out, subselect replaced with >get_gene_id): > >EXPLAIN ANALYZE > SELECT disease.disease_name, count(disease.disease_name) AS cnt > FROM > ((SELECT gene_occurrences.sentence_id > FROM gene_occurrences > WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp > JOIN disease_occurrences USING (sentence_id)) as tmp2 > NATURAL JOIN disease >GROUP BY disease.disease_name >ORDER BY cnt DESC; There is no DISTINCT here. This is equvalent to your first query, iff the following unique constraints are true: (gene_id, sentence_id) in gene_occurrences (disease_id, sentence_id) in disease_occurrences (disease_id) in disease If they are, you don't need a sub-select (unless I'm missing something, please double-check): EXPLAIN ANALYZE SELECT disease.disease_name, count(*) AS cnt FROM disease, gene_occurrences, disease_occurrences WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene_occurrences.gene_id=get_gene_id('igm') AND disease.disease_id=disease_occurrences.disease_id GROUP BY tmp.disease_name ORDER BY cnt DESC; Anyway, your problem boils down to EXPLAIN ANALYZE SELECT d.disease_id, d.sentence_id FROM gene_occurrences g, disease_occurrences d WHERE g.sentence_id = d.sentence_id AND g.gene_id = 'some constant value'; Play with enable_xxxx to find out which join method provides the best performance for various gene_ids. Then we can start to fiddle with run-time parameters to help the optimizer choose the right plan. >Most of the runtime is used up by the index scan to join the occurrences >tables [...] > >At the moment my settings concering the query planner are: > >effective_cache_size = 80000 # typically 8KB each, default 1000 >random_page_cost = 1.5 # units are one sequential page fetch cost Usually you set a low random_page_cost value (the default is 4) if you want to favour index scans where the optimizer tends to use sequential scans. Was this your intention? >cpu_tuple_cost = 0.01 # (same), default 0.01 >cpu_index_tuple_cost = 0.00001 # (same), default 0.001 >cpu_operator_cost = 0.005 # (same), default 0.0025 Just out of curiosity: Are these settings based on prior experience? Servus Manfred
hello Manfred, > ... for joining both occurrences: The "-> Nested Loop" takes two > tables (the "-> Index Scans") as input and produces one table as > output which is again used as input for the "-> Hash Join" above it. as I am testing with the most frequent gene names (= the gene_ids that are the most frequent in the occurrences tables) this is a very expensive join. whenever I try a less frequent gene_id the runtime is shorter (though I haven't tested especially with less frequent gene_ids, yet. my focus is on making the searches for the most frequent genes faster as these are probably the ones that are searched for a lot.) > There is no DISTINCT here. This is equvalent to your first query, iff > the following unique constraints are true: > (gene_id, sentence_id) in gene_occurrences > (disease_id, sentence_id) in disease_occurrences > (disease_id) in disease > > If they are, you don't need a sub-select (unless I'm missing > something, please double-check): yeah, I noticed the difference between the two queries. actually, I am afraid of dropping the distinct cause I had results with duplicate rows (though I shall recheck when this is really the case). These are the table declarations and constraints: relate=# \d gene Table "public.gene" Column | Type | Modifiers -------------+---------+----------- gene_id | integer | not null gene_name | text | not null gene_syn_id | integer | not null Indexes: gene_pkey primary key btree (gene_id), gene_name_uni unique btree (gene_name), gene_uni unique btree (gene_name, gene_syn_id), gene_syn_idx btree (gene_syn_id) (disease looks the same) relate_01=# \d gene_occurrences Table "public.gene_occurrences" Column | Type | Modifiers -------------+---------+----------- sentence_id | bigint | not null gene_id | integer | not null puid | integer | not null Indexes: gene_occ_uni unique btree (sentence_id, gene_id), gene_occ_id_i btree (gene_id) relate_01=# \d disease_occurrences Table "public.disease_occurrences" Column | Type | Modifiers -------------+---------+----------- sentence_id | bigint | not null disease_id | integer | not null puid | integer | not null Indexes: disease_occ_uni unique btree (sentence_id, disease_id), disease_occ_id_i btree (disease_id) sentence_id and gene/disease_id are connected in a n:m relation. as sentence_id is the primary key of a table with more than 50 million rows, we decided not to use a serial as primary key but to use a unique combination of two existing values. as this combination is to long for an ordinary int, we have to use bigint as type. is the join therefore such expensive? we had a primary key occurrence_id on the occurrences tables but we noticed that we don't use it, so we didn't recreate it in the new database. is it possible that the postgres could work with it internally? > Play with enable_xxxx to find out which join method provides the best > performance for various gene_ids. Then we can start to fiddle with > run-time parameters to help the optimizer choose the right plan. this would be VERY helpful! :-) I played around and this is the result: EXPLAIN ANALYZE SELECT d.disease_id, d.sentence_id FROM gene_occurrences g, disease_occurrences d WHERE g.sentence_id = d.sentence_id AND g.gene_id = get_gene_id([different very frequent gene names]); choice of the planner: Nested Loop Total runtime: 53508.86 msec set enable_nextloop to false; Merge Join: Total runtime: 113066.81 msec set enable_mergejoin to false; Hash Join: Total runtime: 439344.44 msec disabling the hash join results again in a Nested Loop with very high cost but low runtime - I'm not sure if the latter is the consequence of caching. I changed the gene name at every run to avoid the caching. So the Nested Loop is obiously the best way to go? For comparison: a less frequent gene (occurres 6717 times in gene_occurrences) outputs the following query plan: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..41658.69 rows=12119 width=20) (actual time=87.01..19076.62 rows=1371 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences g (cost=0.00..10754.08 rows=7514 width=8) (actual time=35.89..10149.14 rows=6717 loops=1) Index Cond: (gene_id = 16338) -> Index Scan using disease_occ_uni on disease_occurrences d (cost=0.00..4.09 rows=2 width=12) (actual time=1.32..1.32 rows=0 loops=6717) Index Cond: ("outer".sentence_id = d.sentence_id) Total runtime: 19078.48 msec > Usually you set a low random_page_cost value (the default is 4) if you > want to favour index scans where the optimizer tends to use sequential > scans. Was this your intention? No, not really. I found a posting in the archives where one would suggest reducing this parameter, so I tried it. I don't think it had any perceptiple effect. >>cpu_tuple_cost = 0.01 # (same), default 0.01 >>cpu_index_tuple_cost = 0.00001 # (same), default 0.001 >>cpu_operator_cost = 0.005 # (same), default 0.0025 > > > Just out of curiosity: Are these settings based on prior experience? Nope. Same as above. I changed these variables only two days ago for what I recall. Untill than I had them at their default. Regards, Chantal
Chantal, I'm short of time right now. So just a few quick notes and a request for more information. Next round tomorrow ... On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann <chantal.ackermann@biomax.de> wrote: >yeah, I noticed the difference between the two queries. actually, I am >afraid of dropping the distinct cause I had results with duplicate rows >(though I shall recheck when this is really the case). These are the >table declarations and constraints: [...] AFAICS there's no way to get duplicates, so no need for DISTINCT. > we have to use bigint as type. is the join therefore >such expensive? No. >we had a primary key occurrence_id on the occurrences tables but we >noticed that we don't use it, so we didn't recreate it in the new >database. is it possible that the postgres could work with it internally? No. > Nested Loop Total runtime: 53508.86 msec >Merge Join: Total runtime: 113066.81 msec >Hash Join: Total runtime: 439344.44 msec >I changed the gene name at every run to avoid the caching. You can't compare the runtimes unless you query for the same data. Either run each query twice to make sure everything is cached or do something like tar cf /dev/null /some/big/directory before each query to empty your disk cache. BTW, what is your shared_buffers setting. >disabling the hash join results again in a Nested Loop with very high >cost but low runtime - I'm not sure if the latter is the consequence of >caching. Please send EXPLAIN ANALYZE results for all queries. Send it to me off-list if you think its too long. Servus Manfred
On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann <chantal.ackermann@biomax.de> wrote: >Nested Loop: 53508.86 msec >Merge Join: 113066.81 msec >Hash Join: 439344.44 msec Chantal, you might have reached the limit of what Postgres (or any other database?) can do for you with these data structures. Time for something completely different: Try calculating the counts in advance. CREATE TABLE occ_stat ( did INT NOT NULL, gid INT NOT NULL, cnt INT NOT NULL ) WITHOUT OIDS; CREATE INDEX occ_stat_dg ON occ_stat(did, gid); CREATE INDEX occ_stat_gd ON occ_stat(gid, did); There is *no* UNIQUE constraint on (did, gid). You get the numbers you're after by SELECT did, sum(cnt) AS cnt FROM occ_stat WHERE gid = 'whatever' GROUP BY did ORDER BY cnt DESC; occ_stat is initially loaded by INSERT INTO occ_stat SELECT did, gid, count(*) FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid) GROUP BY did, gid; Doing it in chunks WHERE sid BETWEEN a::bigint AND b::bigint might be faster. You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o while you do the initial load. If it takes too long, see below for how to do it in the background; hopefully the load task will catch up some day :-) Keeping occ_stat current: CREATE RULE d_o_i AS ON INSERT TO d_o DO ( INSERT INTO occ_stat SELECT NEW.did, g_o.gid, 1 FROM g_o WHERE g_o.sid = NEW.sid); CREATE RULE d_o_d AS ON DELETE TO d_o DO ( INSERT INTO occ_stat SELECT OLD.did, g_o.gid, -1 FROM g_o WHERE g_o.sid = OLD.sid); On UPDATE do both. Create a set of similar rules for g_o. These rules will create a lot of duplicates on (did, gid) in occ_stat. Updating existing rows and inserting only new combinations might seem obvious, but this method has concurrency problems (cf. the thread "Hard problem with concurrency" on -hackers). So occ_stat calls for reorganisation from time to time: BEGIN; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS; INSERT INTO t SELECT did, gid, sum(cnt) FROM occ_stat GROUP BY did, gid HAVING count(*) > 1; DELETE FROM occ_stat WHERE t.did = occ_stat.did AND t.gid = occ_stat.gid; INSERT INTO occ_stat SELECT * FROM t; DROP TABLE t; COMMIT; VACUUM ANALYZE occ_stat; -- very important!! Now this should work, but the rules could kill INSERT/UPDATE/DELETE performance. Depending on your rate of modifications you might be forced to push the statistics calculation to the background. CREATE TABLE d_o_change ( sid BIGINT NOT NULL, did INT NOT NULL, cnt INT NOT NULL ) WITHOUT OIDS; ... ON INSERT TO d_o DO ( INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1)); ... ON DELETE TO d_o DO ( INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1)); ... ON UPDATE TO d_o WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did DO both And the same for g_o. You need a task that periodically scans [dg]_o_change and does ... BEGIN; SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT <any row (or some rows) from x_o_change>; INSERT INTO occ_stat <see above>; DELETE <the selected row(s) from x_o_change>; COMMIT; Don't forget to VACUUM! If you invest a little more work, I guess you can combine the reorganisation into the loader task ... I have no idea whether this approach is better than what you have now. With a high INSERT/UPDATE/DELETE rate it may lead to a complete performance disaster. You have to try ... Servus Manfred
I nominate Manfred for support response award of the week! Chris ----- Original Message ----- From: "Manfred Koizar" <mkoi-pg@aon.at> To: "Chantal Ackermann" <chantal.ackermann@biomax.de> Cc: "Josh Berkus" <josh@agliodbs.com>; <pgsql-performance@postgresql.org>; <tgl@sss.pgh.pa.us> Sent: Thursday, February 20, 2003 6:00 PM Subject: Re: [PERFORM] cost and actual time > On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann > <chantal.ackermann@biomax.de> wrote: > >Nested Loop: 53508.86 msec > >Merge Join: 113066.81 msec > >Hash Join: 439344.44 msec > > Chantal, > > you might have reached the limit of what Postgres (or any other > database?) can do for you with these data structures. Time for > something completely different: Try calculating the counts in > advance. > > CREATE TABLE occ_stat ( > did INT NOT NULL, > gid INT NOT NULL, > cnt INT NOT NULL > ) WITHOUT OIDS; > > CREATE INDEX occ_stat_dg ON occ_stat(did, gid); > CREATE INDEX occ_stat_gd ON occ_stat(gid, did); > > There is *no* UNIQUE constraint on (did, gid). You get the numbers > you're after by > SELECT did, sum(cnt) AS cnt > FROM occ_stat > WHERE gid = 'whatever' > GROUP BY did > ORDER BY cnt DESC; > > occ_stat is initially loaded by > > INSERT INTO occ_stat > SELECT did, gid, count(*) > FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid) > GROUP BY did, gid; > > Doing it in chunks > WHERE sid BETWEEN a::bigint AND b::bigint > might be faster. > > You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o > while you do the initial load. If it takes too long, see below for > how to do it in the background; hopefully the load task will catch up > some day :-) > > Keeping occ_stat current: > > CREATE RULE d_o_i AS ON INSERT > TO d_o DO ( > INSERT INTO occ_stat > SELECT NEW.did, g_o.gid, 1 > FROM g_o > WHERE g_o.sid = NEW.sid); > > CREATE RULE d_o_d AS ON DELETE > TO d_o DO ( > INSERT INTO occ_stat > SELECT OLD.did, g_o.gid, -1 > FROM g_o > WHERE g_o.sid = OLD.sid); > > On UPDATE do both. Create a set of similar rules for g_o. > > These rules will create a lot of duplicates on (did, gid) in occ_stat. > Updating existing rows and inserting only new combinations might seem > obvious, but this method has concurrency problems (cf. the thread > "Hard problem with concurrency" on -hackers). So occ_stat calls for > reorganisation from time to time: > > BEGIN; > SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; > CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS; > > INSERT INTO t > SELECT did, gid, sum(cnt) > FROM occ_stat > GROUP BY did, gid > HAVING count(*) > 1; > > DELETE FROM occ_stat > WHERE t.did = occ_stat.did > AND t.gid = occ_stat.gid; > > INSERT INTO occ_stat SELECT * FROM t; > > DROP TABLE t; > COMMIT; > VACUUM ANALYZE occ_stat; -- very important!! > > Now this should work, but the rules could kill INSERT/UPDATE/DELETE > performance. Depending on your rate of modifications you might be > forced to push the statistics calculation to the background. > > CREATE TABLE d_o_change ( > sid BIGINT NOT NULL, > did INT NOT NULL, > cnt INT NOT NULL > ) WITHOUT OIDS; > > ... ON INSERT TO d_o DO ( > INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1)); > > ... ON DELETE TO d_o DO ( > INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1)); > > ... ON UPDATE TO d_o > WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did > DO both > > And the same for g_o. > > You need a task that periodically scans [dg]_o_change and does ... > > BEGIN; > SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; > SELECT <any row (or some rows) from x_o_change>; > INSERT INTO occ_stat <see above>; > DELETE <the selected row(s) from x_o_change>; > COMMIT; > > Don't forget to VACUUM! > > If you invest a little more work, I guess you can combine the > reorganisation into the loader task ... > > I have no idea whether this approach is better than what you have now. > With a high INSERT/UPDATE/DELETE rate it may lead to a complete > performance disaster. You have to try ... > > Servus > Manfred > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >