Re: Abysmal hash join - Mailing list pgsql-performance
From | Florian Weimer |
---|---|
Subject | Re: Abysmal hash join |
Date | |
Msg-id | 82wt8awi21.fsf@mid.bfk.de Whole thread Raw |
In response to | Re: Abysmal hash join (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Abysmal hash join
|
List | pgsql-performance |
* Tom Lane: > Florian Weimer <fweimer@bfk.de> writes: >> -> Bitmap Index Scan on large_rel_1_field_1 (cost=0.00..2003.09 rows=193739 width=0) (actual time=0.148..0.148rows=12 loops=1) >> Index Cond: (n.field_1 = "outer".field_2) > > What you need to look into is why that rowcount estimate is off by four > orders of magnitude. Ah, thanks. > The estimate on the smaller table is only off by a factor of 75 but > that's still pretty darn awful. Are the statistics up to date? Seems so. Running ANALYZE only increased the row estimate, instead of decreasing it. 8-( > Maybe larger stats targets would help. I've set default_statistics_target to 100 and rerun ANALYZE on that table. The estimate went down to 43108 (and the hash join is still the preferred plan). ANALZE with default_statistics_target = 200 (which seems pretty large to me) is down to 26050 and the bitmap scan plan is chosen. PostgreSQL seems to think that there are only very few distinct values for that column (with default_statistics_target = 100 and 200): EXPLAIN SELECT DISTINCT field_1 FROM large_rel; Unique (cost=82841534.37..84400982.21 rows=7235 width=24) -> Sort (cost=82841534.37..83621258.29 rows=311889568 width=24) Sort Key: field_1 -> Seq Scan on large_rel (cost=0.00..6863066.68 rows=311889568 width=24) Unique (cost=82733282.28..84290654.92 rows=11957 width=24) -> Sort (cost=82733282.28..83511968.60 rows=311474528 width=24) Sort Key: field_1 -> Seq Scan on large_rel (cost=0.00..6858916.28 rows=311474528 width=24) I don't know the exact value, but it's closer to a few millions. The distribution is quite odd. A large sample of the column (10 million rows) looks like this: SELECT cnt, COUNT(*) FROM (SELECT COUNT(*) AS cnt FROM (SELECT field_1 FROM large_rel LIMIT 10000000) x GROUP BY field_1) y GROUP BY cnt ORDER BY cnt; cnt | count --------+-------- 1 | 258724 2 | 85685 3 | 46215 4 | 29333 5 | 20512 6 | 15276 7 | 11444 8 | 9021 [...] 59379 | 1 59850 | 1 111514 | 1 111783 | 1 111854 | 1 112259 | 1 112377 | 1 116379 | 1 116473 | 1 116681 | 1 Maybe I'm just screwed with such a distribution, but it's still rather unfortunate. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Durlacher Allee 47 tel: +49-721-96201-1 D-76131 Karlsruhe fax: +49-721-96201-99
pgsql-performance by date: