Re: cost and actual time - Mailing list pgsql-performance

From Manfred Koizar
Subject Re: cost and actual time
Date
Msg-id ajo45v4o0op0f3qnqh1le2rs79008fqp5u@4ax.com
Whole thread Raw
In response to Re: cost and actual time  (Chantal Ackermann <chantal.ackermann@biomax.de>)
Responses Re: cost and actual time
List pgsql-performance
On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann
<chantal.ackermann@biomax.de> wrote:
>1. the old query, leaving out the table gene and setting
>gene_occurrences.gene_id to a certain gene_id, or the function
>get_gene_id, respectively. (This is the query you suggested, Manfred.)

This was Tom's suggestion.  I might have ended up there in a day or
two :-)

>What takes up most of the runtime the Nested Loop (for the join of
>disease and disease_occurrences, or rather for joining both occurrences
>tables? I'm not sure which rows belong together in the explain output).

... for joining both occurrences:  The "-> Nested Loop" takes two
tables (the "-> Index Scans") as input and produces one table as
output which is again used as input for the "-> Hash Join" above it.

>2. The new query, same changes (gene left out, subselect replaced with
>get_gene_id):
>
>EXPLAIN ANALYZE
>   SELECT disease.disease_name, count(disease.disease_name) AS cnt
>   FROM
>     ((SELECT gene_occurrences.sentence_id
>         FROM gene_occurrences
>           WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
>       JOIN disease_occurrences USING (sentence_id)) as tmp2
>     NATURAL JOIN disease
>GROUP BY disease.disease_name
>ORDER BY cnt DESC;

There is no DISTINCT here.  This is equvalent to your first query, iff
the following unique constraints are true:
    (gene_id, sentence_id) in gene_occurrences
    (disease_id, sentence_id) in disease_occurrences
    (disease_id) in disease

If they are, you don't need a sub-select (unless I'm missing
something, please double-check):

EXPLAIN ANALYZE
   SELECT disease.disease_name, count(*) AS cnt
     FROM disease, gene_occurrences, disease_occurrences
    WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
      AND gene_occurrences.gene_id=get_gene_id('igm')
      AND disease.disease_id=disease_occurrences.disease_id
    GROUP BY tmp.disease_name
    ORDER BY cnt DESC;

Anyway, your problem boils down to

EXPLAIN ANALYZE
    SELECT d.disease_id, d.sentence_id
      FROM gene_occurrences g, disease_occurrences d
     WHERE g.sentence_id = d.sentence_id
       AND g.gene_id = 'some constant value';

Play with enable_xxxx to find out which join method provides the best
performance for various gene_ids.  Then we can start to fiddle with
run-time parameters to help the optimizer choose the right plan.

>Most of the runtime is used up by the index scan to join the occurrences
>tables [...]
>
>At the moment my settings concering the query planner are:
>
>effective_cache_size = 80000    # typically 8KB each, default 1000
>random_page_cost = 1.5          # units are one sequential page fetch cost

Usually you set a low random_page_cost value (the default is 4) if you
want to favour index scans where the optimizer tends to use sequential
scans.  Was this your intention?

>cpu_tuple_cost = 0.01           # (same), default 0.01
>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>cpu_operator_cost = 0.005       # (same), default 0.0025

Just out of curiosity:  Are these settings based on prior experience?

Servus
 Manfred

pgsql-performance by date:

Previous
From: Chantal Ackermann
Date:
Subject: Re: cost and actual time
Next
From: Josh Berkus
Date:
Subject: Re: Questions about indexes?