Re: cost and actual time - Mailing list pgsql-performance
From | Chantal Ackermann |
---|---|
Subject | Re: cost and actual time |
Date | |
Msg-id | 3E50BECC.9090107@biomax.de Whole thread Raw |
In response to | Re: cost and actual time ("Josh Berkus" <josh@agliodbs.com>) |
Responses |
Re: cost and actual time
Re: cost and actual time |
List | pgsql-performance |
hello Josh, thank you for your fast answer. (I had some days off.) This posting is quite long, I apologize. But I wanted to provide enough information to outline the problem. I did some vacuums analyze on all 4 tables concerned (gene, disease, gene_occurrences, disease_occurrences) to be sure the planner is up to date - but that did not minimize the difference between the estimation of resulting rows and the actual result. I changed the settings for default_statistics_target to 1000 (default 10). The estimation goes up to 102 rows which is little more than before, and still far away from the actual result. The effective_cache_size is at 80000. To be sure I didn't change it to be worse, I checked again with the default_statistics_target set to 10 and a cache size of 1000 (ran vacuum afterwards). the estimation is at 92 rows. so there's not a really big difference. I wonder, if I can set some geqo or planner settings in the postgresql.conf file to make the planner estimate better? The database is exclusively for reading, so it's ok if the time for analyzing the tables increases. The query I am testing with is: EXPLAIN ANALYZE SELECT tmp.disease_name, count(tmp.disease_name) AS cnt FROM (SELECT DISTINCT disease.disease_name, disease_occurrences.sentence_id FROM gene, disease, gene_occurrences, disease_occurrences WHERE gene.gene_name='igg' AND gene_occurrences.sentence_id=disease_occurrences.sentence_id AND gene.gene_id=gene_occurrences.gene_id AND disease.disease_id=disease_occurrences.disease_id) AS tmp GROUP BY tmp.disease_name ORDER BY cnt DESC; Row counts are: 'gene': 164657 'disease': 129923 'gene_occurrences': 10484141 'disease_occurrences': 15079045 the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is the most frequent. the index on gene_occurences(sentence_id) and disease_occurrences(sentence_id) is clustered. I have an alternative query which I am testing to see whether it is better than the first one: explain analyze SELECT disease.disease_name, count(disease.disease_name) AS cnt FROM ((SELECT gene_occurrences.sentence_id FROM gene_occurrences WHERE gene_occurrences.gene_id=(SELECT gene.gene_id FROM gene WHERE gene.gene_name='igg')) AS tmp JOIN disease_occurrences USING (sentence_id)) as tmp2 NATURAL JOIN disease GROUP BY disease.disease_name ORDER BY cnt DESC; the cost estimated by the planner is much higher, thus I thought this query is worse than the first. However - maybe it's just more accurate? this is its explain-output (default settings for default_statistics_target while 'vacuum analyze'): --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=126690.02..126691.47 rows=581 width=57) (actual time=8066.05..8067.05 rows=3364 loops=1) Sort Key: count(disease.disease_name) InitPlan -> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1 width=4) (actual time=0.19..0.20 rows=1 loops=1) Index Cond: (gene_name = 'igg'::text) -> Aggregate (cost=126619.79..126663.35 rows=581 width=57) (actual time=7894.33..8055.43 rows=3364 loops=1) -> Group (cost=126619.79..126648.83 rows=5809 width=57) (actual time=7894.31..8020.00 rows=30513 loops=1) -> Sort (cost=126619.79..126634.31 rows=5809 width=57) (actual time=7894.30..7910.08 rows=30513 loops=1) Sort Key: disease.disease_name -> Merge Join (cost=119314.93..126256.64 rows=5809 width=57) (actual time=6723.92..7732.94 rows=30513 loops=1) Merge Cond: ("outer".disease_id = "inner".disease_id) -> Index Scan using disease_pkey on disease (cost=0.00..6519.14 rows=129923 width=37) (actual time=0.04..742.20 rows=129872 loops=1) -> Sort (cost=119314.93..119329.45 rows=5809 width=20) (actual time=6723.74..6740.24 rows=30513 loops=1) Sort Key: disease_occurrences.disease_id -> Nested Loop (cost=0.00..118951.78 rows=5809 width=20) (actual time=1.19..6558.67 rows=30513 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..15700.31 rows=4039 width=8) (actual time=0.36..1404.64 rows=110637 loops=1) Index Cond: (gene_id = $0) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..25.47 rows=8 width=12) (actual time=0.04..0.04 rows=0 loops=110637) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) Total runtime: 8086.87 msec (20 rows) strangely, the estimation is far worse after running vacuum analyze again with a default_statistics_target of 1000: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=12521.37..12521.61 rows=96 width=57) (actual time=124967.47..124968.47 rows=3364 loops=1) Sort Key: count(disease.disease_name) InitPlan -> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1 width=4) (actual time=20.27..20.28 rows=1 loops=1) Index Cond: (gene_name = 'igg'::text) -> Aggregate (cost=12510.99..12518.20 rows=96 width=57) (actual time=124788.71..124956.77 rows=3364 loops=1) -> Group (cost=12510.99..12515.80 rows=961 width=57) (actual time=124788.68..124920.10 rows=30513 loops=1) -> Sort (cost=12510.99..12513.39 rows=961 width=57) (actual time=124788.66..124804.74 rows=30513 loops=1) Sort Key: disease.disease_name -> Nested Loop (cost=0.00..12463.35 rows=961 width=57) (actual time=164.11..124529.76 rows=30513 loops=1) -> Nested Loop (cost=0.00..6671.06 rows=961 width=20) (actual time=148.34..120295.52 rows=30513 loops=1) -> Index Scan using gene_occ_id_i on gene_occurrences (cost=0.00..2407.09 rows=602 width=8) (actual time=20.63..1613.99 rows=110637 loops=1) Index Cond: (gene_id = $0) -> Index Scan using disease_occ_uni on disease_occurrences (cost=0.00..7.06 rows=2 width=12) (actual time=1.07..1.07 rows=0 loops=110637) Index Cond: ("outer".sentence_id = disease_occurrences.sentence_id) -> Index Scan using disease_pkey on disease (cost=0.00..6.01 rows=1 width=37) (actual time=0.13..0.13 rows=1 loops=30513) Index Cond: ("outer".disease_id = disease.disease_id) Total runtime: 124981.15 msec (18 rows) There again is the estimation of 961 rows and the decision to choose a Nested Loop while the actual result includes 30513 rows. Thank you for taking the time to read my postings! Chantal Josh Berkus wrote: > Chantal, > > >>Sort Key: disease.disease_name, disease_occurrences.sentence_id >>-> Nested Loop (cost=0.00..6922.38 rows=98 width=64) (actual >>time=61.49..275047.46 rows=18910 loops=1) >> -> Nested Loop (cost=0.00..6333.23 rows=98 width=28) (actual >>time=61.42..274313.87 rows=18910 loops=1) >> -> Nested Loop (cost=0.00..5894.04 rows=64 width=16) (actual >>time=32.00..120617.26 rows=46849 loops=1) >> >>I tried to tweak the conf settings, but I think I already reached >>quite a good value concerning shared buffers and sort mem. the >>database is vacuum full analyzed. indexes seem fine. > > > You *sure* that you've vacuum analyzed recently? The planner above is > choosing a bad plan because its row estimates are way off ... if the > subquery was actually returning 98 rows, the plan above would make > sense ... but with 18,000 rows being returned, a Nested Loop is > suicidal. > > Perhaps you could post the full text of the query? If some of your > criteria are coming from volatile functions, then that could explain > why the planner is so far off ... > > -Josh Berkus > >
pgsql-performance by date: