Thread: cost and actual time

cost and actual time

From

Chantal Ackermann

Date:

13 February 2003, 13:06:17

hello all,

I am still fiddling around with my "big" database.

System:
RAM: 2GB
CPU: 1,6 MHz (cache: 256 Kb)
single disc: 120 GB :-(

I have a query that joins to relatively large tables (10 - 15 Mio rows),
or part of these tables (explain analyze: rows=46849) respectively.

long story short:

allover cost estimated in pages by explain is:
cost=6926.59..6926.60

actual time is from explain analyze is:
actual time=275461.91..275462.44

most of it is consumed by a nested loop (surprise!)
this is the respective output:

Sort Key: disease.disease_name, disease_occurrences.sentence_id
->  Nested Loop  (cost=0.00..6922.38 rows=98 width=64) (actual
time=61.49..275047.46 rows=18910 loops=1)
   ->  Nested Loop  (cost=0.00..6333.23 rows=98 width=28) (actual
time=61.42..274313.87 rows=18910 loops=1)
     ->  Nested Loop  (cost=0.00..5894.04 rows=64 width=16) (actual
time=32.00..120617.26 rows=46849 loops=1)

I tried to tweak the conf settings, but I think I already reached quite
a good value concerning shared buffers and sort mem. the database is
vacuum full analyzed. indexes seem fine.

could one of you smart guys point me into a direction I might not have
considered? - I know that the hardware is the minimum. nevertheless - if
you have suggestions what exactely to add to the hardware to boost the
database up (more RAM or more discs - even a RAID) - this would be a
good argument for my boss.

Thank you a lot
Chantal

Re: cost and actual time

From

"Josh Berkus"

Date:

13 February 2003, 15:08:17

Chantal,

> Sort Key: disease.disease_name, disease_occurrences.sentence_id
> ->  Nested Loop  (cost=0.00..6922.38 rows=98 width=64) (actual
> time=61.49..275047.46 rows=18910 loops=1)
>   ->  Nested Loop  (cost=0.00..6333.23 rows=98 width=28) (actual
> time=61.42..274313.87 rows=18910 loops=1)
>     ->  Nested Loop  (cost=0.00..5894.04 rows=64 width=16) (actual
> time=32.00..120617.26 rows=46849 loops=1)
>
> I tried to tweak the conf settings, but I think I already reached
> quite a good value concerning shared buffers and sort mem. the
> database is vacuum full analyzed. indexes seem fine.

You *sure* that you've vacuum analyzed recently?   The planner above is
choosing a bad plan because its row estimates are way off ... if the
subquery was actually returning 98 rows, the plan above would make
sense ... but with 18,000 rows being returned, a Nested Loop is
suicidal.

Perhaps you could post the full text of the query?  If some of your
criteria are coming from volatile functions, then that could explain
why the planner is so far off ...

-Josh Berkus

Re: cost and actual time

From

Chantal Ackermann

Date:

17 February 2003, 05:52:41

hello Josh,

thank you for your fast answer. (I had some days off.)

This posting is quite long, I apologize. But I wanted to provide enough
information to outline the problem.

I did some vacuums analyze on all 4 tables concerned (gene, disease,
gene_occurrences, disease_occurrences) to be sure the planner is up to
date - but that did not minimize the difference between the estimation
of resulting rows and the actual result.

I changed the settings for default_statistics_target to 1000 (default
10). The estimation goes up to 102 rows which is little more than
before, and still far away from the actual result.

The effective_cache_size is at 80000.

To be sure I didn't change it to be worse, I checked again with the
default_statistics_target set to 10 and a cache size of 1000 (ran vacuum
afterwards). the estimation is at 92 rows. so there's not a really big
difference.

I wonder, if I can set some geqo or planner settings in the
postgresql.conf file to make the planner estimate better? The database
is exclusively for reading, so it's ok if the time for analyzing the
tables increases.

The query I am testing with is:

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
    FROM (SELECT DISTINCT disease.disease_name,
disease_occurrences.sentence_id FROM gene, disease, gene_occurrences,
disease_occurrences
        WHERE gene.gene_name='igg'
        AND gene_occurrences.sentence_id=disease_occurrences.sentence_id
        AND gene.gene_id=gene_occurrences.gene_id
        AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name
ORDER BY cnt DESC;

Row counts are:
'gene': 164657
'disease': 129923
'gene_occurrences': 10484141
'disease_occurrences': 15079045

the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is
the most frequent.

the index on gene_occurences(sentence_id) and
disease_occurrences(sentence_id) is clustered.

I have an alternative query which I am testing to see whether it is
better than the first one:


explain analyze SELECT disease.disease_name, count(disease.disease_name)
AS cnt FROM
    ((SELECT gene_occurrences.sentence_id FROM gene_occurrences
         WHERE gene_occurrences.gene_id=(SELECT gene.gene_id FROM gene
WHERE gene.gene_name='igg')) AS tmp
    JOIN disease_occurrences USING (sentence_id)) as tmp2
    NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

the cost estimated by the planner is much higher, thus I thought this
query is worse than the first. However - maybe it's just more accurate?

this is its explain-output (default settings for
default_statistics_target while 'vacuum analyze'):


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=126690.02..126691.47 rows=581 width=57) (actual
time=8066.05..8067.05 rows=3364 loops=1)
    Sort Key: count(disease.disease_name)
    InitPlan
      ->  Index Scan using gene_uni on gene  (cost=0.00..5.26 rows=1
width=4) (actual time=0.19..0.20 rows=1 loops=1)
            Index Cond: (gene_name = 'igg'::text)
    ->  Aggregate  (cost=126619.79..126663.35 rows=581 width=57) (actual
time=7894.33..8055.43 rows=3364 loops=1)
          ->  Group  (cost=126619.79..126648.83 rows=5809 width=57)
(actual time=7894.31..8020.00 rows=30513 loops=1)
                ->  Sort  (cost=126619.79..126634.31 rows=5809 width=57)
(actual time=7894.30..7910.08 rows=30513 loops=1)
                      Sort Key: disease.disease_name
                      ->  Merge Join  (cost=119314.93..126256.64
rows=5809 width=57) (actual time=6723.92..7732.94 rows=30513 loops=1)
                            Merge Cond: ("outer".disease_id =
"inner".disease_id)
                            ->  Index Scan using disease_pkey on disease
  (cost=0.00..6519.14 rows=129923 width=37) (actual time=0.04..742.20
rows=129872 loops=1)
                            ->  Sort  (cost=119314.93..119329.45
rows=5809 width=20) (actual time=6723.74..6740.24 rows=30513 loops=1)
                                  Sort Key: disease_occurrences.disease_id
                                  ->  Nested Loop  (cost=0.00..118951.78
rows=5809 width=20) (actual time=1.19..6558.67 rows=30513 loops=1)
                                        ->  Index Scan using
gene_occ_id_i on gene_occurrences  (cost=0.00..15700.31 rows=4039
width=8) (actual time=0.36..1404.64 rows=110637 loops=1)
                                              Index Cond: (gene_id = $0)
                                        ->  Index Scan using
disease_occ_uni on disease_occurrences  (cost=0.00..25.47 rows=8
width=12) (actual time=0.04..0.04 rows=0 loops=110637)
                                              Index Cond:
("outer".sentence_id = disease_occurrences.sentence_id)
  Total runtime: 8086.87 msec
(20 rows)


strangely, the estimation is far worse after running vacuum analyze
again with a default_statistics_target of 1000:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=12521.37..12521.61 rows=96 width=57) (actual
time=124967.47..124968.47 rows=3364 loops=1)
    Sort Key: count(disease.disease_name)
    InitPlan
      ->  Index Scan using gene_uni on gene  (cost=0.00..5.26 rows=1
width=4) (actual time=20.27..20.28 rows=1 loops=1)
            Index Cond: (gene_name = 'igg'::text)
    ->  Aggregate  (cost=12510.99..12518.20 rows=96 width=57) (actual
time=124788.71..124956.77 rows=3364 loops=1)
          ->  Group  (cost=12510.99..12515.80 rows=961 width=57) (actual
time=124788.68..124920.10 rows=30513 loops=1)
                ->  Sort  (cost=12510.99..12513.39 rows=961 width=57)
(actual time=124788.66..124804.74 rows=30513 loops=1)
                      Sort Key: disease.disease_name
                      ->  Nested Loop  (cost=0.00..12463.35 rows=961
width=57) (actual time=164.11..124529.76 rows=30513 loops=1)
                            ->  Nested Loop  (cost=0.00..6671.06
rows=961 width=20) (actual time=148.34..120295.52 rows=30513 loops=1)
                                  ->  Index Scan using gene_occ_id_i on
gene_occurrences  (cost=0.00..2407.09 rows=602 width=8) (actual
time=20.63..1613.99 rows=110637 loops=1)
                                        Index Cond: (gene_id = $0)
                                  ->  Index Scan using disease_occ_uni
on disease_occurrences  (cost=0.00..7.06 rows=2 width=12) (actual
time=1.07..1.07 rows=0 loops=110637)
                                        Index Cond: ("outer".sentence_id
= disease_occurrences.sentence_id)
                            ->  Index Scan using disease_pkey on disease
  (cost=0.00..6.01 rows=1 width=37) (actual time=0.13..0.13 rows=1
loops=30513)
                                  Index Cond: ("outer".disease_id =
disease.disease_id)
  Total runtime: 124981.15 msec
(18 rows)

There again is the estimation of 961 rows and the decision to choose a
Nested Loop while the actual result includes 30513 rows.


Thank you for taking the time to read my postings!
Chantal



Josh Berkus wrote:
> Chantal,
>
>
>>Sort Key: disease.disease_name, disease_occurrences.sentence_id
>>->  Nested Loop  (cost=0.00..6922.38 rows=98 width=64) (actual
>>time=61.49..275047.46 rows=18910 loops=1)
>>  ->  Nested Loop  (cost=0.00..6333.23 rows=98 width=28) (actual
>>time=61.42..274313.87 rows=18910 loops=1)
>>    ->  Nested Loop  (cost=0.00..5894.04 rows=64 width=16) (actual
>>time=32.00..120617.26 rows=46849 loops=1)
>>
>>I tried to tweak the conf settings, but I think I already reached
>>quite a good value concerning shared buffers and sort mem. the
>>database is vacuum full analyzed. indexes seem fine.
>
>
> You *sure* that you've vacuum analyzed recently?   The planner above is
> choosing a bad plan because its row estimates are way off ... if the
> subquery was actually returning 98 rows, the plan above would make
> sense ... but with 18,000 rows being returned, a Nested Loop is
> suicidal.
>
> Perhaps you could post the full text of the query?  If some of your
> criteria are coming from volatile functions, then that could explain
> why the planner is so far off ...
>
> -Josh Berkus
>
>

Re: cost and actual time

From

Tom Lane

Date:

17 February 2003, 11:22:09

Chantal Ackermann <chantal.ackermann@biomax.de> writes:
> the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is
> the most frequent.

I think the problem here is that the planner doesn't know that (and
probably can't without some kind of cross-table statistics apparatus).
It's generating a plan based on the average frequency of gene_ids, which
is a loser for this outlier.

Probably the most convenient way to do better is to structure things so
that the reduction from gene name to gene_id is done before the planner
starts to develop a plan.  Instead of joining to gene, consider this:

create function get_gene_id (text) returns int as -- adjust types as needed
'select gene_id from gene where gene_name = $1' language sql
immutable strict;  -- in 7.2, instead say "with (isCachable, isStrict)"

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
    FROM (SELECT DISTINCT disease.disease_name,
disease_occurrences.sentence_id FROM disease, gene_occurrences,
disease_occurrences
        WHERE
        gene_occurrences.sentence_id=disease_occurrences.sentence_id
        AND get_gene_id('igg')=gene_occurrences.gene_id
        AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name
ORDER BY cnt DESC;

Now get_gene_id() isn't really immutable (unless you never change the
gene table) but you have to lie and pretend that it is, so that the
function call will be constant-folded during planner startup.  The
planner will then see something like gene_occurrences.gene_id = 42
and it will have a much better shot at determining the number of rows
this matches.

            regards, tom lane

Re: cost and actual time

From

Manfred Koizar

Date:

17 February 2003, 11:49:52

On Mon, 17 Feb 2003 11:51:56 +0100, Chantal Ackermann
<chantal.ackermann@biomax.de> wrote:
>the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is
>the most frequent.

Chantal, could you try

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
  FROM (SELECT DISTINCT dd.disease_name, d_o.sentence_id
          FROM disease d,
               gene_occurrences g_o,
               disease_occurrences d_o
         WHERE g_o.sentence_id=d_o.sentence_id
           AND g_o.gene_id=4711
           AND d.disease_id=d_o.disease_id) AS tmp
 GROUP BY tmp.disease_name
 ORDER BY cnt DESC;

replacing 4711 by the result of
    SELECT gene_id FROM gene WHERE gene_name='igg'

and show us the plan you get?

Servus
 Manfred

Re: cost and actual time

From

Chantal Ackermann

Date:

18 February 2003, 05:29:59

hello Manfred, hello Josh, hello Tom,

I followed your advices. Using the new query with explicit joins,
combined with the function that retrieves the gene_id, the estimated row
count is now far more realistic. Still, the same query using different
gene names takes sometimes less than a second, sometimes several
minutes, obviously due to (not) caching. In the resulting query plan,
there are still a Seq Scan, a Nested Loop and a Hash Join that take up
most of the cost.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In details:

I have created the following function:
CREATE OR REPLACE FUNCTION get_gene_id(TEXT) RETURNS INT AS
         'SELECT gene_id FROM gene WHERE gene_name = $1'
LANGUAGE SQL
IMMUTABLE STRICT;

Then I ran some queries with explain/explain analyze. For example:

1. the old query, leaving out the table gene and setting
gene_occurrences.gene_id to a certain gene_id, or the function
get_gene_id, respectively. (This is the query you suggested, Manfred.)

EXPLAIN ANALYZE
   SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
     FROM
       (SELECT DISTINCT disease.disease_name,
                        disease_occurrences.sentence_id
        FROM disease, gene_occurrences, disease_occurrences
        WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
          AND gene_occurrences.gene_id=get_gene_id('igm')
          AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name ORDER BY cnt DESC;

'igm' occures about 54.000 times in gene_occurrences.

                       QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=308065.26..308069.67 rows=882 width=57) (actual
time=53107.46..53108.13 rows=2326 loops=1)
    Sort Key: count(disease_name)
    ->  Aggregate  (cost=307846.61..307978.94 rows=882 width=57) (actual
time=53011.97..53100.58 rows=2326 loops=1)
          ->  Group  (cost=307846.61..307934.83 rows=8822 width=57)
(actual time=53011.94..53079.74 rows=16711 loops=1)
                ->  Sort  (cost=307846.61..307890.72 rows=8822 width=57)
(actual time=53011.93..53020.32 rows=16711 loops=1)
                      Sort Key: disease_name
                      ->  Subquery Scan tmp  (cost=305367.08..306690.35
rows=8822 width=57) (actual time=52877.87..52958.72 rows=16711 loops=1)
                            ->  Unique  (cost=305367.08..306690.35
rows=8822 width=57) (actual time=52877.85..52915.20 rows=16711 loops=1)
                                  ->  Sort  (cost=305367.08..305808.17
rows=88218 width=57) (actual time=52877.85..52886.47 rows=16711 loops=1)
                                        Sort Key: disease.disease_name,
disease_occurrences.sentence_id
                                        ->  Hash Join
(cost=4610.17..290873.90 rows=88218 width=57) (actual
time=388.53..52752.92 rows=16711 loops=1)
                                              Hash Cond:
("outer".disease_id = "inner".disease_id)
                                              ->  Nested Loop
(cost=0.00..282735.01 rows=88218 width=20) (actual time=0.25..52184.26
rows=16711 loops=1)
                                                    ->  Index Scan using
gene_occ_id_i on gene_occurrences  (cost=0.00..57778.26 rows=54692
width=8) (actual time=0.07..17455.52 rows=54677 loops=1)
                                                          Index Cond:
(gene_id = 70764)
                                                    ->  Index Scan using
disease_occ_uni on disease_occurrences  (cost=0.00..4.09 rows=2
width=12) (actual time=0.63..0.63 rows=0 loops=54677)
                                                          Index Cond:
("outer".sentence_id = disease_occurrences.sentence_id)
                                              ->  Hash
(cost=2500.23..2500.23 rows=129923 width=37) (actual time=387.45..387.45
rows=0 loops=1)
                                                    ->  Seq Scan on
disease  (cost=0.00..2500.23 rows=129923 width=37) (actual
time=0.02..207.71 rows=129923 loops=1)
  Total runtime: 53118.81 msec
(20 rows)

What takes up most of the runtime the Nested Loop (for the join of
disease and disease_occurrences, or rather for joining both occurrences
tables? I'm not sure which rows belong together in the explain output).

The cost for 'igg' is higher:
estimation of pages by explain: 584729.76.
actual runtime: 693210.99 msec.
The query plan is the same. The Nested Loop takes up most of the runtime
(->  Nested Loop  (cost=0.00..538119.44 rows=176211 width=20) (actual
time=0.28..691474.74 rows=30513 loops=1))



2. The new query, same changes (gene left out, subselect replaced with
get_gene_id):

EXPLAIN ANALYZE
   SELECT disease.disease_name, count(disease.disease_name) AS cnt
   FROM
     ((SELECT gene_occurrences.sentence_id
         FROM gene_occurrences
           WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
       JOIN disease_occurrences USING (sentence_id)) as tmp2
     NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

'csf' occurres about 55.000 times in gene_occurrences.

              QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=323834.95..323881.54 rows=9318 width=57) (actual
time=146975.89..146976.60 rows=2383 loops=1)
    Sort Key: count(disease.disease_name)
    ->  Aggregate  (cost=321208.63..322606.31 rows=9318 width=57)
(actual time=146840.89..146968.58 rows=2383 loops=1)
          ->  Group  (cost=321208.63..322140.42 rows=93179 width=57)
(actual time=146840.87..146941.60 rows=24059 loops=1)
                ->  Sort  (cost=321208.63..321674.53 rows=93179
width=57) (actual time=146840.85..146852.92 rows=24059 loops=1)
                      Sort Key: disease.disease_name
                      ->  Hash Join  (cost=4485.78..305826.96 rows=93179
width=57) (actual time=544.79..146651.05 rows=24059 loops=1)
                            Hash Cond: ("outer".disease_id =
"inner".disease_id)
                            ->  Nested Loop  (cost=0.00..297614.04
rows=93179 width=20) (actual time=105.85..145936.47 rows=24059 loops=1)
                                  ->  Index Scan using gene_occ_id_i on
gene_occurrences  (cost=0.00..60007.96 rows=57768 width=8) (actual
time=41.86..47116.68 rows=55752 loops=1)
                                        Index Cond: (gene_id = 29877)
                                  ->  Index Scan using disease_occ_uni
on disease_occurrences  (cost=0.00..4.09 rows=2 width=12) (actual
time=1.76..1.77 rows=0 loops=55752)
                                        Index Cond: ("outer".sentence_id
= disease_occurrences.sentence_id)
                            ->  Hash  (cost=2500.23..2500.23 rows=129923
width=37) (actual time=438.16..438.16 rows=0 loops=1)
                                  ->  Seq Scan on disease
(cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..236.78
rows=129923 loops=1)
  Total runtime: 146986.12 msec
(16 rows)

This query is obviously not as good as the old one, though I don't
understand where the explicit joins are worse than what the optimizer
choses. There is still the Nested Loop that takes up the biggest part.

When I set enable_nestloop to false, explain outputs this plan:

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=2146887.90..2146934.49 rows=9318 width=57)
    Sort Key: count(disease.disease_name)
    ->  Aggregate  (cost=2144261.59..2145659.27 rows=9318 width=57)
          ->  Group  (cost=2144261.59..2145193.37 rows=93179 width=57)
                ->  Sort  (cost=2144261.59..2144727.48 rows=93179 width=57)
                      Sort Key: disease.disease_name
                      ->  Merge Join  (cost=2122513.18..2128879.92
rows=93179 width=57)
                            Merge Cond: ("outer".disease_id =
"inner".disease_id)
                            ->  Index Scan using disease_pkey on disease
  (cost=0.00..3388.03 rows=129923 width=37)
                            ->  Sort  (cost=2122513.18..2122979.08
rows=93179 width=20)
                                  Sort Key: disease_occurrences.disease_id
                                  ->  Merge Join
(cost=69145.63..2107131.52 rows=93179 width=20)
                                        Merge Cond: ("outer".sentence_id
= "inner".sentence_id)
                                        ->  Index Scan using
disease_occ_uni on disease_occurrences  (cost=0.00..1960817.45
rows=15079045 width=12)
                                        ->  Sort
(cost=69145.63..69434.47 rows=57768 width=8)
                                              Sort Key:
gene_occurrences.sentence_id
                                              ->  Index Scan using
gene_occ_id_i on gene_occurrences  (cost=0.00..60007.96 rows=57768 width=8)
                                                    Index Cond: (gene_id
= 29877)
(18 rows)

Most of the runtime is used up by the index scan to join the occurrences
tables, while the index scan for joining diesease and
disease_occurrences is fast.




At the moment my settings concering the query planner are:

effective_cache_size = 80000    # typically 8KB each, default 1000
random_page_cost = 1.5          # units are one sequential page fetch cost
cpu_tuple_cost = 0.01           # (same), default 0.01
cpu_index_tuple_cost = 0.00001  # (same), default 0.001
cpu_operator_cost = 0.005       # (same), default 0.0025



I am still having problems to read the output of explain, especially to
know which rows belong together, and what strategy applies to which
join. Is there some documentation that describes the format of explain,
other than the one in the main manual that comes with the postgres
installation? Just some short explanation or example on how to interpret
indents and arrows.

Thank you for your help!
Chantal

Re: cost and actual time

From

Manfred Koizar

Date:

18 February 2003, 12:32:21

On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann
<chantal.ackermann@biomax.de> wrote:
>1. the old query, leaving out the table gene and setting
>gene_occurrences.gene_id to a certain gene_id, or the function
>get_gene_id, respectively. (This is the query you suggested, Manfred.)

This was Tom's suggestion.  I might have ended up there in a day or
two :-)

>What takes up most of the runtime the Nested Loop (for the join of
>disease and disease_occurrences, or rather for joining both occurrences
>tables? I'm not sure which rows belong together in the explain output).

... for joining both occurrences:  The "-> Nested Loop" takes two
tables (the "-> Index Scans") as input and produces one table as
output which is again used as input for the "-> Hash Join" above it.

>2. The new query, same changes (gene left out, subselect replaced with
>get_gene_id):
>
>EXPLAIN ANALYZE
>   SELECT disease.disease_name, count(disease.disease_name) AS cnt
>   FROM
>     ((SELECT gene_occurrences.sentence_id
>         FROM gene_occurrences
>           WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
>       JOIN disease_occurrences USING (sentence_id)) as tmp2
>     NATURAL JOIN disease
>GROUP BY disease.disease_name
>ORDER BY cnt DESC;

There is no DISTINCT here.  This is equvalent to your first query, iff
the following unique constraints are true:
    (gene_id, sentence_id) in gene_occurrences
    (disease_id, sentence_id) in disease_occurrences
    (disease_id) in disease

If they are, you don't need a sub-select (unless I'm missing
something, please double-check):

EXPLAIN ANALYZE
   SELECT disease.disease_name, count(*) AS cnt
     FROM disease, gene_occurrences, disease_occurrences
    WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
      AND gene_occurrences.gene_id=get_gene_id('igm')
      AND disease.disease_id=disease_occurrences.disease_id
    GROUP BY tmp.disease_name
    ORDER BY cnt DESC;

Anyway, your problem boils down to

EXPLAIN ANALYZE
    SELECT d.disease_id, d.sentence_id
      FROM gene_occurrences g, disease_occurrences d
     WHERE g.sentence_id = d.sentence_id
       AND g.gene_id = 'some constant value';

Play with enable_xxxx to find out which join method provides the best
performance for various gene_ids.  Then we can start to fiddle with
run-time parameters to help the optimizer choose the right plan.

>Most of the runtime is used up by the index scan to join the occurrences
>tables [...]
>
>At the moment my settings concering the query planner are:
>
>effective_cache_size = 80000    # typically 8KB each, default 1000
>random_page_cost = 1.5          # units are one sequential page fetch cost

Usually you set a low random_page_cost value (the default is 4) if you
want to favour index scans where the optimizer tends to use sequential
scans.  Was this your intention?

>cpu_tuple_cost = 0.01           # (same), default 0.01
>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>cpu_operator_cost = 0.005       # (same), default 0.0025

Just out of curiosity:  Are these settings based on prior experience?

Servus
 Manfred

Re: cost and actual time

From

Chantal Ackermann

Date:

19 February 2003, 04:39:34

hello Manfred,

> ... for joining both occurrences:  The "-> Nested Loop" takes two
> tables (the "-> Index Scans") as input and produces one table as
> output which is again used as input for the "-> Hash Join" above it.

as I am testing with the most frequent gene names (= the gene_ids that
are the most frequent in the occurrences tables) this is a very
expensive join. whenever I try a less frequent gene_id the runtime is
shorter (though I haven't tested especially with less frequent gene_ids,
yet. my focus is on making the searches for the most frequent genes
faster as these are probably the ones that are searched for a lot.)

> There is no DISTINCT here.  This is equvalent to your first query, iff
> the following unique constraints are true:
>     (gene_id, sentence_id) in gene_occurrences
>     (disease_id, sentence_id) in disease_occurrences
>     (disease_id) in disease
>
> If they are, you don't need a sub-select (unless I'm missing
> something, please double-check):

yeah, I noticed the difference between the two queries. actually, I am
afraid of dropping the distinct cause I had results with duplicate rows
(though I shall recheck when this is really the case).  These are the
table declarations and constraints:

relate=# \d gene
         Table "public.gene"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  gene_id     | integer | not null
  gene_name   | text    | not null
  gene_syn_id | integer | not null
Indexes: gene_pkey primary key btree (gene_id),
          gene_name_uni unique btree (gene_name),
          gene_uni unique btree (gene_name, gene_syn_id),
          gene_syn_idx btree (gene_syn_id)

(disease looks the same)

relate_01=# \d gene_occurrences
   Table "public.gene_occurrences"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  sentence_id | bigint  | not null
  gene_id     | integer | not null
  puid        | integer | not null
Indexes: gene_occ_uni unique btree (sentence_id, gene_id),
          gene_occ_id_i btree (gene_id)

relate_01=# \d disease_occurrences
Table "public.disease_occurrences"
    Column    |  Type   | Modifiers
-------------+---------+-----------
  sentence_id | bigint  | not null
  disease_id  | integer | not null
  puid        | integer | not null
Indexes: disease_occ_uni unique btree (sentence_id, disease_id),
          disease_occ_id_i btree (disease_id)

sentence_id and gene/disease_id are connected in a n:m relation.
as sentence_id is the primary key of a table with more than 50 million
rows, we decided not to use a serial as primary key but to use a unique
combination of two existing values. as this combination is to long for
an ordinary int, we have to use bigint as type. is the join therefore
such expensive?

we had a primary key occurrence_id on the occurrences tables but we
noticed that we don't use it, so we didn't recreate it in the new
database. is it possible that the postgres could work with it internally?

> Play with enable_xxxx to find out which join method provides the best
> performance for various gene_ids.  Then we can start to fiddle with
> run-time parameters to help the optimizer choose the right plan.

this would be VERY helpful! :-)

I played around and this is the result:

EXPLAIN ANALYZE
SELECT d.disease_id, d.sentence_id
    FROM gene_occurrences g, disease_occurrences d
   WHERE g.sentence_id = d.sentence_id
     AND g.gene_id = get_gene_id([different very frequent gene names]);

choice of the planner: Nested Loop
  Total runtime: 53508.86 msec

set enable_nextloop to false;
Merge Join:  Total runtime: 113066.81 msec

set enable_mergejoin to false;
Hash Join: Total runtime: 439344.44 msec

disabling the hash join results again in a Nested Loop with very high
cost but low runtime - I'm not sure if the latter is the consequence of
caching. I changed the gene name at every run to avoid the caching.

So the Nested Loop is obiously the best way to go?

For comparison: a less frequent gene (occurres 6717 times in
gene_occurrences)
outputs the following query plan:


QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=0.00..41658.69 rows=12119 width=20) (actual
time=87.01..19076.62 rows=1371 loops=1)
    ->  Index Scan using gene_occ_id_i on gene_occurrences g
(cost=0.00..10754.08 rows=7514 width=8) (actual time=35.89..10149.14
rows=6717 loops=1)
          Index Cond: (gene_id = 16338)
    ->  Index Scan using disease_occ_uni on disease_occurrences d
(cost=0.00..4.09 rows=2 width=12) (actual time=1.32..1.32 rows=0 loops=6717)
          Index Cond: ("outer".sentence_id = d.sentence_id)
  Total runtime: 19078.48 msec

> Usually you set a low random_page_cost value (the default is 4) if you
> want to favour index scans where the optimizer tends to use sequential
> scans.  Was this your intention?

No, not really. I found a posting in the archives where one would
suggest reducing this parameter, so I tried it. I don't think it had any
perceptiple effect.

>>cpu_tuple_cost = 0.01           # (same), default 0.01
>>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>>cpu_operator_cost = 0.005       # (same), default 0.0025
>
>
> Just out of curiosity:  Are these settings based on prior experience?

Nope. Same as above. I changed these variables only two days ago for
what I recall. Untill than I had them at their default.

Regards,
Chantal

Re: cost and actual time

From

Manfred Koizar

Date:

19 February 2003, 05:34:37

Chantal,

I'm short of time right now.  So just a few quick notes and a request
for more information.  Next round tomorrow ...

On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
<chantal.ackermann@biomax.de> wrote:
>yeah, I noticed the difference between the two queries. actually, I am
>afraid of dropping the distinct cause I had results with duplicate rows
>(though I shall recheck when this is really the case).  These are the
>table declarations and constraints: [...]

AFAICS there's no way to get duplicates, so no need for DISTINCT.

> we have to use bigint as type. is the join therefore
>such expensive?

No.

>we had a primary key occurrence_id on the occurrences tables but we
>noticed that we don't use it, so we didn't recreate it in the new
>database. is it possible that the postgres could work with it internally?

No.

> Nested Loop Total runtime:  53508.86 msec
>Merge Join:  Total runtime: 113066.81 msec
>Hash Join:   Total runtime: 439344.44 msec
>I changed the gene name at every run to avoid the caching.

You can't compare the runtimes unless you query for the same data.
Either run each query twice to make sure everything is cached or do
something like
    tar cf /dev/null /some/big/directory
before each query to empty your disk cache.  BTW, what is your
shared_buffers setting.

>disabling the hash join results again in a Nested Loop with very high
>cost but low runtime - I'm not sure if the latter is the consequence of
>caching.

Please send EXPLAIN ANALYZE results for all queries.  Send it to me
off-list if you think its too long.

Servus
 Manfred

Re: cost and actual time

From

Manfred Koizar

Date:

20 February 2003, 05:01:58

On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
<chantal.ackermann@biomax.de> wrote:
>Nested Loop: 53508.86 msec
>Merge Join: 113066.81 msec
>Hash Join:  439344.44 msec

Chantal,

you might have reached the limit of what Postgres (or any other
database?) can do for you with these data structures.  Time for
something completely different:  Try calculating the counts in
advance.

    CREATE TABLE occ_stat (
        did INT NOT NULL,
        gid INT NOT NULL,
        cnt INT NOT NULL
    ) WITHOUT OIDS;

    CREATE INDEX occ_stat_dg ON occ_stat(did, gid);
    CREATE INDEX occ_stat_gd ON occ_stat(gid, did);

There is *no* UNIQUE constraint on (did, gid).  You get the numbers
you're after by
    SELECT did, sum(cnt) AS cnt
      FROM occ_stat
     WHERE gid = 'whatever'
     GROUP BY did
     ORDER BY cnt DESC;

occ_stat is initially loaded by

    INSERT INTO occ_stat
    SELECT did, gid, count(*)
      FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid)
     GROUP BY did, gid;

Doing it in chunks
    WHERE sid BETWEEN a::bigint AND b::bigint
might be faster.

You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o
while you do the initial load.  If it takes too long, see below for
how to do it in the background; hopefully the load task will catch up
some day :-)

Keeping occ_stat current:

    CREATE RULE d_o_i AS ON INSERT
        TO d_o DO (
            INSERT INTO occ_stat
            SELECT NEW.did, g_o.gid, 1
              FROM g_o
             WHERE g_o.sid = NEW.sid);

    CREATE RULE d_o_d AS ON DELETE
        TO d_o DO (
            INSERT INTO occ_stat
            SELECT OLD.did, g_o.gid, -1
              FROM g_o
             WHERE g_o.sid = OLD.sid);

On UPDATE do both.  Create a set of similar rules for g_o.

These rules will create a lot of duplicates on (did, gid) in occ_stat.
Updating existing rows and inserting only new combinations might seem
obvious, but this method has concurrency problems (cf. the thread
"Hard problem with concurrency" on -hackers).  So occ_stat calls for
reorganisation from time to time:

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS;

    INSERT INTO t
    SELECT did, gid, sum(cnt)
      FROM occ_stat
     GROUP BY did, gid
    HAVING count(*) > 1;

    DELETE FROM occ_stat
     WHERE t.did = occ_stat.did
       AND t.gid = occ_stat.gid;

    INSERT INTO occ_stat SELECT * FROM t;

    DROP TABLE t;
    COMMIT;
    VACUUM ANALYZE occ_stat;  -- very important!!

Now this should work, but the rules could kill INSERT/UPDATE/DELETE
performance.  Depending on your rate of modifications you might be
forced to push the statistics calculation to the background.

    CREATE TABLE d_o_change (
        sid BIGINT NOT NULL,
        did INT NOT NULL,
        cnt INT NOT NULL
    ) WITHOUT OIDS;

    ... ON INSERT TO d_o DO (
        INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1));

    ... ON DELETE TO d_o DO (
        INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1));

    ... ON UPDATE TO d_o
        WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did
        DO both

And the same for g_o.

You need a task that periodically scans [dg]_o_change and does ...

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    SELECT <any row (or some rows) from x_o_change>;
    INSERT INTO occ_stat <see above>;
    DELETE <the selected row(s) from x_o_change>;
    COMMIT;

Don't forget to VACUUM!

If you invest a little more work, I guess you can combine the
reorganisation into the loader task ...

I have no idea whether this approach is better than what you have now.
With a high INSERT/UPDATE/DELETE rate it may lead to a complete
performance disaster.  You have to try ...

Servus
 Manfred

Re: cost and actual time

From

"Christopher Kings-Lynne"

Date:

21 February 2003, 14:51:02

I nominate Manfred for support response award of the week!

Chris

----- Original Message -----
From: "Manfred Koizar" <mkoi-pg@aon.at>
To: "Chantal Ackermann" <chantal.ackermann@biomax.de>
Cc: "Josh Berkus" <josh@agliodbs.com>; <pgsql-performance@postgresql.org>;
<tgl@sss.pgh.pa.us>
Sent: Thursday, February 20, 2003 6:00 PM
Subject: Re: [PERFORM] cost and actual time


> On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
> <chantal.ackermann@biomax.de> wrote:
> >Nested Loop: 53508.86 msec
> >Merge Join: 113066.81 msec
> >Hash Join:  439344.44 msec
>
> Chantal,
>
> you might have reached the limit of what Postgres (or any other
> database?) can do for you with these data structures.  Time for
> something completely different:  Try calculating the counts in
> advance.
>
>     CREATE TABLE occ_stat (
>         did INT NOT NULL,
>         gid INT NOT NULL,
>         cnt INT NOT NULL
>     ) WITHOUT OIDS;
>
>     CREATE INDEX occ_stat_dg ON occ_stat(did, gid);
>     CREATE INDEX occ_stat_gd ON occ_stat(gid, did);
>
> There is *no* UNIQUE constraint on (did, gid).  You get the numbers
> you're after by
>     SELECT did, sum(cnt) AS cnt
>       FROM occ_stat
>      WHERE gid = 'whatever'
>      GROUP BY did
>      ORDER BY cnt DESC;
>
> occ_stat is initially loaded by
>
>     INSERT INTO occ_stat
>     SELECT did, gid, count(*)
>       FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid)
>      GROUP BY did, gid;
>
> Doing it in chunks
>     WHERE sid BETWEEN a::bigint AND b::bigint
> might be faster.
>
> You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o
> while you do the initial load.  If it takes too long, see below for
> how to do it in the background; hopefully the load task will catch up
> some day :-)
>
> Keeping occ_stat current:
>
>     CREATE RULE d_o_i AS ON INSERT
>         TO d_o DO (
>             INSERT INTO occ_stat
>             SELECT NEW.did, g_o.gid, 1
>               FROM g_o
>              WHERE g_o.sid = NEW.sid);
>
>     CREATE RULE d_o_d AS ON DELETE
>         TO d_o DO (
>             INSERT INTO occ_stat
>             SELECT OLD.did, g_o.gid, -1
>               FROM g_o
>              WHERE g_o.sid = OLD.sid);
>
> On UPDATE do both.  Create a set of similar rules for g_o.
>
> These rules will create a lot of duplicates on (did, gid) in occ_stat.
> Updating existing rows and inserting only new combinations might seem
> obvious, but this method has concurrency problems (cf. the thread
> "Hard problem with concurrency" on -hackers).  So occ_stat calls for
> reorganisation from time to time:
>
>     BEGIN;
>     SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
>     CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS;
>
>     INSERT INTO t
>     SELECT did, gid, sum(cnt)
>       FROM occ_stat
>      GROUP BY did, gid
>     HAVING count(*) > 1;
>
>     DELETE FROM occ_stat
>      WHERE t.did = occ_stat.did
>        AND t.gid = occ_stat.gid;
>
>     INSERT INTO occ_stat SELECT * FROM t;
>
>     DROP TABLE t;
>     COMMIT;
>     VACUUM ANALYZE occ_stat;  -- very important!!
>
> Now this should work, but the rules could kill INSERT/UPDATE/DELETE
> performance.  Depending on your rate of modifications you might be
> forced to push the statistics calculation to the background.
>
>     CREATE TABLE d_o_change (
>         sid BIGINT NOT NULL,
>         did INT NOT NULL,
>         cnt INT NOT NULL
>     ) WITHOUT OIDS;
>
>     ... ON INSERT TO d_o DO (
>         INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1));
>
>     ... ON DELETE TO d_o DO (
>         INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1));
>
>     ... ON UPDATE TO d_o
>         WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did
>         DO both
>
> And the same for g_o.
>
> You need a task that periodically scans [dg]_o_change and does ...
>
>     BEGIN;
>     SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
>     SELECT <any row (or some rows) from x_o_change>;
>     INSERT INTO occ_stat <see above>;
>     DELETE <the selected row(s) from x_o_change>;
>     COMMIT;
>
> Don't forget to VACUUM!
>
> If you invest a little more work, I guess you can combine the
> reorganisation into the loader task ...
>
> I have no idea whether this approach is better than what you have now.
> With a high INSERT/UPDATE/DELETE rate it may lead to a complete
> performance disaster.  You have to try ...
>
> Servus
>  Manfred
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>