Re: cost and actual time - Mailing list pgsql-performance
From | Manfred Koizar |
---|---|
Subject | Re: cost and actual time |
Date | |
Msg-id | ajo45v4o0op0f3qnqh1le2rs79008fqp5u@4ax.com Whole thread Raw |
In response to | Re: cost and actual time (Chantal Ackermann <chantal.ackermann@biomax.de>) |
Responses |
Re: cost and actual time
|
List | pgsql-performance |
On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann <chantal.ackermann@biomax.de> wrote: >1. the old query, leaving out the table gene and setting >gene_occurrences.gene_id to a certain gene_id, or the function >get_gene_id, respectively. (This is the query you suggested, Manfred.) This was Tom's suggestion. I might have ended up there in a day or two :-) >What takes up most of the runtime the Nested Loop (for the join of >disease and disease_occurrences, or rather for joining both occurrences >tables? I'm not sure which rows belong together in the explain output). ... for joining both occurrences: The "-> Nested Loop" takes two tables (the "-> Index Scans") as input and produces one table as output which is again used as input for the "-> Hash Join" above it. >2. The new query, same changes (gene left out, subselect replaced with >get_gene_id): > >EXPLAIN ANALYZE > SELECT disease.disease_name, count(disease.disease_name) AS cnt > FROM > ((SELECT gene_occurrences.sentence_id > FROM gene_occurrences > WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp > JOIN disease_occurrences USING (sentence_id)) as tmp2 > NATURAL JOIN disease >GROUP BY disease.disease_name >ORDER BY cnt DESC; There is no DISTINCT here. This is equvalent to your first query, iff the following unique constraints are true: (gene_id, sentence_id) in gene_occurrences (disease_id, sentence_id) in disease_occurrences (disease_id) in disease If they are, you don't need a sub-select (unless I'm missing something, please double-check): EXPLAIN ANALYZE SELECT disease.disease_name, count(*) AS cnt FROM disease, gene_occurrences, disease_occurrences WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene_occurrences.gene_id=get_gene_id('igm') AND disease.disease_id=disease_occurrences.disease_id GROUP BY tmp.disease_name ORDER BY cnt DESC; Anyway, your problem boils down to EXPLAIN ANALYZE SELECT d.disease_id, d.sentence_id FROM gene_occurrences g, disease_occurrences d WHERE g.sentence_id = d.sentence_id AND g.gene_id = 'some constant value'; Play with enable_xxxx to find out which join method provides the best performance for various gene_ids. Then we can start to fiddle with run-time parameters to help the optimizer choose the right plan. >Most of the runtime is used up by the index scan to join the occurrences >tables [...] > >At the moment my settings concering the query planner are: > >effective_cache_size = 80000 # typically 8KB each, default 1000 >random_page_cost = 1.5 # units are one sequential page fetch cost Usually you set a low random_page_cost value (the default is 4) if you want to favour index scans where the optimizer tends to use sequential scans. Was this your intention? >cpu_tuple_cost = 0.01 # (same), default 0.01 >cpu_index_tuple_cost = 0.00001 # (same), default 0.001 >cpu_operator_cost = 0.005 # (same), default 0.0025 Just out of curiosity: Are these settings based on prior experience? Servus Manfred
pgsql-performance by date: