Re: cost and actual time - Mailing list pgsql-performance
From | Chantal Ackermann |
---|---|
Subject | Re: cost and actual time |
Date | |
Msg-id | 3E5350AE.7080408@biomax.de Whole thread Raw |
In response to | Re: cost and actual time (Manfred Koizar <mkoi-pg@aon.at>) |
Responses |
Re: cost and actual time
Re: cost and actual time |
List | pgsql-performance |
hello Manfred, > ... for joining both occurrences: The "-> Nested Loop" takes two > tables (the "-> Index Scans") as input and produces one table as > output which is again used as input for the "-> Hash Join" above it. as I am testing with the most frequent gene names (= the gene_ids that are the most frequent in the occurrences tables) this is a very expensive join. whenever I try a less frequent gene_id the runtime is shorter (though I haven't tested especially with less frequent gene_ids, yet. my focus is on making the searches for the most frequent genes faster as these are probably the ones that are searched for a lot.) > There is no DISTINCT here. This is equvalent to your first query, iff > the following unique constraints are true: > (gene_id, sentence_id) in gene_occurrences > (disease_id, sentence_id) in disease_occurrences > (disease_id) in disease > > If they are, you don't need a sub-select (unless I'm missing > something, please double-check): yeah, I noticed the difference between the two queries. actually, I am afraid of dropping the distinct cause I had results with duplicate rows (though I shall recheck when this is really the case). These are the table declarations and constraints: relate=# \d gene Table "public.gene" Column | Type | Modifiers -------------+---------+----------- gene_id | integer | not null gene_name | text | not null gene_syn_id | integer | not null Indexes: gene_pkey primary key btree (gene_id), gene_name_uni unique btree (gene_name), gene_uni unique btree (gene_name, gene_syn_id), gene_syn_idx btree (gene_syn_id) (disease looks the same) relate_01=# \d gene_occurrences Table "public.gene_occurrences" Column | Type | Modifiers -------------+---------+----------- sentence_id | bigint | not null gene_id | integer | not null puid | integer | not null Indexes: gene_occ_uni unique btree (sentence_id, gene_id), gene_occ_id_i btree (gene_id) relate_01=# \d disease_occurrences Table "public.disease_occurrences" Column | Type | Modifiers -------------+---------+----------- sentence_id | bigint | not null disease_id | integer | not null puid | integer | not null Indexes: disease_occ_uni unique btree (sentence_id, disease_id), disease_occ_id_i btree (disease_id) sentence_id and gene/disease_id are connected in a n:m relation. as sentence_id is the primary key of a table with more than 50 million rows, we decided not to use a serial as primary key but to use a unique combination of two existing values. as this combination is to long for an ordinary int, we have to use bigint as type. is the join therefore such expensive? we had a primary key occurrence_id on the occurrences tables but we noticed that we don't use it, so we didn't recreate it in the new database. is it possible that the postgres could work with it internally? > Play with enable_xxxx to find out which join method provides the best > performance for various gene_ids. Then we can start to fiddle with > run-time parameters to help the optimizer choose the right plan. this would be VERY helpful! :-) I played around and this is the result: EXPLAIN ANALYZE SELECT d.disease_id, d.sentence_id FROM gene_occurrences g, disease_occurrences d WHERE g.sentence_id = d.sentence_id AND g.gene_id = get_gene_id([different very frequent gene names]); choice of the planner: Nested Loop Total runtime: 53508.86 msec set enable_nextloop to false; Merge Join: Total runtime: 113066.81 msec set enable_mergejoin to false; Hash Join: Total runtime: 439344.44 msec disabling the hash join results again in a Nested Loop with very high cost but low runtime - I'm not sure if the latter is the consequence of caching. I changed the gene name at every run to avoid the caching. So the Nested Loop is obiously the best way to go? For comparison: a less frequent gene (occurres 6717 times in gene_occurrences) outputs the following query plan: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..41658.69 rows=12119 width=20) (actual time=87.01..19076.62 rows=1371 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences g (cost=0.00..10754.08 rows=7514 width=8) (actual time=35.89..10149.14 rows=6717 loops=1) Index Cond: (gene_id = 16338) -> Index Scan using disease_occ_uni on disease_occurrences d (cost=0.00..4.09 rows=2 width=12) (actual time=1.32..1.32 rows=0 loops=6717) Index Cond: ("outer".sentence_id = d.sentence_id) Total runtime: 19078.48 msec > Usually you set a low random_page_cost value (the default is 4) if you > want to favour index scans where the optimizer tends to use sequential > scans. Was this your intention? No, not really. I found a posting in the archives where one would suggest reducing this parameter, so I tried it. I don't think it had any perceptiple effect. >>cpu_tuple_cost = 0.01 # (same), default 0.01 >>cpu_index_tuple_cost = 0.00001 # (same), default 0.001 >>cpu_operator_cost = 0.005 # (same), default 0.0025 > > > Just out of curiosity: Are these settings based on prior experience? Nope. Same as above. I changed these variables only two days ago for what I recall. Untill than I had them at their default. Regards, Chantal
pgsql-performance by date: