Re: cost and actual time - Mailing list pgsql-performance

From Chantal Ackermann
Subject Re: cost and actual time
Date
Msg-id 3E5350AE.7080408@biomax.de
Whole thread Raw
In response to Re: cost and actual time  (Manfred Koizar <mkoi-pg@aon.at>)
Responses Re: cost and actual time
Re: cost and actual time
List pgsql-performance
hello Manfred,

> ... for joining both occurrences:  The "-> Nested Loop" takes two
> tables (the "-> Index Scans") as input and produces one table as
> output which is again used as input for the "-> Hash Join" above it.

as I am testing with the most frequent gene names (= the gene_ids that
are the most frequent in the occurrences tables) this is a very
expensive join. whenever I try a less frequent gene_id the runtime is
shorter (though I haven't tested especially with less frequent gene_ids,
yet. my focus is on making the searches for the most frequent genes
faster as these are probably the ones that are searched for a lot.)

> There is no DISTINCT here.  This is equvalent to your first query, iff
> the following unique constraints are true:
>     (gene_id, sentence_id) in gene_occurrences
>     (disease_id, sentence_id) in disease_occurrences
>     (disease_id) in disease
>
> If they are, you don't need a sub-select (unless I'm missing
> something, please double-check):

yeah, I noticed the difference between the two queries. actually, I am
afraid of dropping the distinct cause I had results with duplicate rows
(though I shall recheck when this is really the case).  These are the
table declarations and constraints:

relate=# \d gene
         Table "public.gene"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  gene_id     | integer | not null
  gene_name   | text    | not null
  gene_syn_id | integer | not null
Indexes: gene_pkey primary key btree (gene_id),
          gene_name_uni unique btree (gene_name),
          gene_uni unique btree (gene_name, gene_syn_id),
          gene_syn_idx btree (gene_syn_id)

(disease looks the same)

relate_01=# \d gene_occurrences
   Table "public.gene_occurrences"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  sentence_id | bigint  | not null
  gene_id     | integer | not null
  puid        | integer | not null
Indexes: gene_occ_uni unique btree (sentence_id, gene_id),
          gene_occ_id_i btree (gene_id)

relate_01=# \d disease_occurrences
Table "public.disease_occurrences"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  sentence_id | bigint  | not null
  disease_id  | integer | not null
  puid        | integer | not null
Indexes: disease_occ_uni unique btree (sentence_id, disease_id),
          disease_occ_id_i btree (disease_id)

sentence_id and gene/disease_id are connected in a n:m relation.
as sentence_id is the primary key of a table with more than 50 million
rows, we decided not to use a serial as primary key but to use a unique
combination of two existing values. as this combination is to long for
an ordinary int, we have to use bigint as type. is the join therefore
such expensive?

we had a primary key occurrence_id on the occurrences tables but we
noticed that we don't use it, so we didn't recreate it in the new
database. is it possible that the postgres could work with it internally?

> Play with enable_xxxx to find out which join method provides the best
> performance for various gene_ids.  Then we can start to fiddle with
> run-time parameters to help the optimizer choose the right plan.

this would be VERY helpful! :-)

I played around and this is the result:

EXPLAIN ANALYZE
SELECT d.disease_id, d.sentence_id
    FROM gene_occurrences g, disease_occurrences d
   WHERE g.sentence_id = d.sentence_id
     AND g.gene_id = get_gene_id([different very frequent gene names]);

choice of the planner: Nested Loop
  Total runtime: 53508.86 msec

set enable_nextloop to false;
Merge Join:  Total runtime: 113066.81 msec

set enable_mergejoin to false;
Hash Join: Total runtime: 439344.44 msec

disabling the hash join results again in a Nested Loop with very high
cost but low runtime - I'm not sure if the latter is the consequence of
caching. I changed the gene name at every run to avoid the caching.

So the Nested Loop is obiously the best way to go?

For comparison: a less frequent gene (occurres 6717 times in
gene_occurrences)
outputs the following query plan:


QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=0.00..41658.69 rows=12119 width=20) (actual
time=87.01..19076.62 rows=1371 loops=1)
    ->  Index Scan using gene_occ_id_i on gene_occurrences g
(cost=0.00..10754.08 rows=7514 width=8) (actual time=35.89..10149.14
rows=6717 loops=1)
          Index Cond: (gene_id = 16338)
    ->  Index Scan using disease_occ_uni on disease_occurrences d
(cost=0.00..4.09 rows=2 width=12) (actual time=1.32..1.32 rows=0 loops=6717)
          Index Cond: ("outer".sentence_id = d.sentence_id)
  Total runtime: 19078.48 msec

> Usually you set a low random_page_cost value (the default is 4) if you
> want to favour index scans where the optimizer tends to use sequential
> scans.  Was this your intention?

No, not really. I found a posting in the archives where one would
suggest reducing this parameter, so I tried it. I don't think it had any
perceptiple effect.

>>cpu_tuple_cost = 0.01           # (same), default 0.01
>>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>>cpu_operator_cost = 0.005       # (same), default 0.0025
>
>
> Just out of curiosity:  Are these settings based on prior experience?

Nope. Same as above. I changed these variables only two days ago for
what I recall. Untill than I had them at their default.

Regards,
Chantal


pgsql-performance by date:

Previous
From: Ryan Bradetich
Date:
Subject: Re: Questions about indexes?
Next
From: Manfred Koizar
Date:
Subject: Re: cost and actual time