Re: Abysmal hash join - Mailing list pgsql-performance

From Florian Weimer
Subject Re: Abysmal hash join
Date
Msg-id 82wt8awi21.fsf@mid.bfk.de
Whole thread Raw
In response to Re: Abysmal hash join  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Abysmal hash join  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
* Tom Lane:

> Florian Weimer <fweimer@bfk.de> writes:
>>          ->  Bitmap Index Scan on large_rel_1_field_1  (cost=0.00..2003.09 rows=193739 width=0) (actual
time=0.148..0.148rows=12 loops=1) 
>>                Index Cond: (n.field_1 = "outer".field_2)
>
> What you need to look into is why that rowcount estimate is off by four
> orders of magnitude.

Ah, thanks.

> The estimate on the smaller table is only off by a factor of 75 but
> that's still pretty darn awful.  Are the statistics up to date?

Seems so.  Running ANALYZE only increased the row estimate, instead of
decreasing it. 8-(

> Maybe larger stats targets would help.

I've set default_statistics_target to 100 and rerun ANALYZE on that
table.  The estimate went down to 43108 (and the hash join is still
the preferred plan).  ANALZE with default_statistics_target = 200
(which seems pretty large to me) is down to 26050 and the bitmap scan
plan is chosen.

PostgreSQL seems to think that there are only very few distinct values
for that column (with default_statistics_target = 100 and 200):

EXPLAIN SELECT DISTINCT field_1 FROM large_rel;

 Unique  (cost=82841534.37..84400982.21 rows=7235 width=24)
   ->  Sort  (cost=82841534.37..83621258.29 rows=311889568 width=24)
         Sort Key: field_1
         ->  Seq Scan on large_rel  (cost=0.00..6863066.68 rows=311889568 width=24)

 Unique  (cost=82733282.28..84290654.92 rows=11957 width=24)
   ->  Sort  (cost=82733282.28..83511968.60 rows=311474528 width=24)
         Sort Key: field_1
         ->  Seq Scan on large_rel  (cost=0.00..6858916.28 rows=311474528 width=24)

I don't know the exact value, but it's closer to a few millions.  The
distribution is quite odd.  A large sample of the column (10 million
rows) looks like this:

SELECT cnt, COUNT(*) FROM
  (SELECT COUNT(*) AS cnt FROM
     (SELECT field_1 FROM large_rel LIMIT 10000000) x GROUP BY field_1) y
  GROUP BY cnt ORDER BY cnt;

  cnt   | count
--------+--------
      1 | 258724
      2 |  85685
      3 |  46215
      4 |  29333
      5 |  20512
      6 |  15276
      7 |  11444
      8 |   9021
[...]
  59379 |      1
  59850 |      1
 111514 |      1
 111783 |      1
 111854 |      1
 112259 |      1
 112377 |      1
 116379 |      1
 116473 |      1
 116681 |      1

Maybe I'm just screwed with such a distribution, but it's still rather
unfortunate.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Durlacher Allee 47            tel: +49-721-96201-1
D-76131 Karlsruhe             fax: +49-721-96201-99

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Abysmal hash join
Next
From: Tom Lane
Date:
Subject: Re: Abysmal hash join