Thread: Abysmal hash join

Abysmal hash join

From

Florian Weimer

Date:

11 September 2006, 10:48:20

Hi,

for this simple join of two tables,

SELECT * FROM large_rel n, smaller_rel a
  WHERE n.field_1 = a.field_2 AND a.key = '127.0.0.1';

PostgreSQL 8.1.4 chooses an extremely bad query plan:

 Hash Join  (cost=283.45..8269374.38 rows=14137 width=94)
   Hash Cond: ("outer".field_1 = "inner".field_2)
   ->  Seq Scan on large_rel n  (cost=0.00..6760690.04 rows=301651904 width=52)
   ->  Hash  (cost=283.27..283.27 rows=74 width=42)
         ->  Bitmap Heap Scan on smaller_rel a  (cost=2.26..283.27 rows=74 width=42)
               Recheck Cond: (key = '127.0.0.1'::inet)
               ->  Bitmap Index Scan on smaller_rel_1_key  (cost=0.00..2.26 rows=74 width=0)
                     Index Cond: (key = '127.0.0.1'::inet)

Note the sequential scan over the whole large_rel table (and the
corresponding row estimate is roughly correct).

If I turn off hash joins, I get this plan, which actually completes in
finite time:

 Nested Loop  (cost=2005.35..46955689.59 rows=14137 width=94) (actual time=0.325..0.678 rows=12 loops=1)
   ->  Bitmap Heap Scan on smaller_rel a  (cost=2.26..283.27 rows=74 width=42) (actual time=0.132..0.133 rows=1
loops=1)
         Recheck Cond: (key = '127.0.0.1'::inet)
         ->  Bitmap Index Scan on smaller_rel_1_key  (cost=0.00..2.26 rows=74 width=0) (actual time=0.095..0.095 rows=1
loops=1)
               Index Cond: (key = '127.0.0.1'::inet)
   ->  Bitmap Heap Scan on large_rel n  (cost=2003.09..632110.78 rows=193739 width=52) (actual time=0.182..0.501
rows=12loops=1) 
         Recheck Cond: (n.field_1 = "outer".field_2)
         ->  Bitmap Index Scan on large_rel_1_field_1  (cost=0.00..2003.09 rows=193739 width=0) (actual
time=0.148..0.148rows=12 loops=1) 
               Index Cond: (n.field_1 = "outer".field_2)

The row estimate for

  SELECT * FROM smaller_rel a WHERE a.key = '127.0.0.1';

is somewhat off:

 Bitmap Heap Scan on smaller_rel a  (cost=2.26..283.27 rows=74 width=42) (actual time=0.134..0.135 rows=1 loops=1)
   Recheck Cond: (key = '127.0.0.1'::inet)
   ->  Bitmap Index Scan on smaller_rel_1_key  (cost=0.00..2.26 rows=74 width=0) (actual time=0.108..0.108 rows=1
loops=1)
         Index Cond: (key = '127.0.0.1'::inet)

However, I can't believe that the hash join would be faster even if
there where 74 matching rows in smaller_rel instead of just one.  The
estimate decreases when I increase the portion of smaller_rel which is
scanned by ANALYZE (to something like 10% of the table), but this
doesn't look like a solution.

Any suggestions?

(The queries have been pseudonzmized and may contain typos.)
--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Durlacher Allee 47            tel: +49-721-96201-1
D-76131 Karlsruhe             fax: +49-721-96201-99

Re: Abysmal hash join

From

Tom Lane

Date:

11 September 2006, 11:29:09

Florian Weimer <fweimer@bfk.de> writes:
>          ->  Bitmap Index Scan on large_rel_1_field_1  (cost=0.00..2003.09 rows=193739 width=0) (actual
time=0.148..0.148rows=12 loops=1) 
>                Index Cond: (n.field_1 = "outer".field_2)

What you need to look into is why that rowcount estimate is off by four
orders of magnitude.

The estimate on the smaller table is only off by a factor of 75 but
that's still pretty darn awful.  Are the statistics up to date?  Maybe
larger stats targets would help.

            regards, tom lane

Re: Abysmal hash join

From

Florian Weimer

Date:

11 September 2006, 12:15:51

* Tom Lane:

> Florian Weimer <fweimer@bfk.de> writes:
>>          ->  Bitmap Index Scan on large_rel_1_field_1  (cost=0.00..2003.09 rows=193739 width=0) (actual
time=0.148..0.148rows=12 loops=1) 
>>                Index Cond: (n.field_1 = "outer".field_2)
>
> What you need to look into is why that rowcount estimate is off by four
> orders of magnitude.

Ah, thanks.

> The estimate on the smaller table is only off by a factor of 75 but
> that's still pretty darn awful.  Are the statistics up to date?

Seems so.  Running ANALYZE only increased the row estimate, instead of
decreasing it. 8-(

> Maybe larger stats targets would help.

I've set default_statistics_target to 100 and rerun ANALYZE on that
table.  The estimate went down to 43108 (and the hash join is still
the preferred plan).  ANALZE with default_statistics_target = 200
(which seems pretty large to me) is down to 26050 and the bitmap scan
plan is chosen.

PostgreSQL seems to think that there are only very few distinct values
for that column (with default_statistics_target = 100 and 200):

EXPLAIN SELECT DISTINCT field_1 FROM large_rel;

 Unique  (cost=82841534.37..84400982.21 rows=7235 width=24)
   ->  Sort  (cost=82841534.37..83621258.29 rows=311889568 width=24)
         Sort Key: field_1
         ->  Seq Scan on large_rel  (cost=0.00..6863066.68 rows=311889568 width=24)

 Unique  (cost=82733282.28..84290654.92 rows=11957 width=24)
   ->  Sort  (cost=82733282.28..83511968.60 rows=311474528 width=24)
         Sort Key: field_1
         ->  Seq Scan on large_rel  (cost=0.00..6858916.28 rows=311474528 width=24)

I don't know the exact value, but it's closer to a few millions.  The
distribution is quite odd.  A large sample of the column (10 million
rows) looks like this:

SELECT cnt, COUNT(*) FROM
  (SELECT COUNT(*) AS cnt FROM
     (SELECT field_1 FROM large_rel LIMIT 10000000) x GROUP BY field_1) y
  GROUP BY cnt ORDER BY cnt;

  cnt   | count
--------+--------
      1 | 258724
      2 |  85685
      3 |  46215
      4 |  29333
      5 |  20512
      6 |  15276
      7 |  11444
      8 |   9021
[...]
  59379 |      1
  59850 |      1
 111514 |      1
 111783 |      1
 111854 |      1
 112259 |      1
 112377 |      1
 116379 |      1
 116473 |      1
 116681 |      1

Maybe I'm just screwed with such a distribution, but it's still rather
unfortunate.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Durlacher Allee 47            tel: +49-721-96201-1
D-76131 Karlsruhe             fax: +49-721-96201-99

Re: Abysmal hash join

From

Tom Lane

Date:

11 September 2006, 12:40:18

Florian Weimer <fweimer@bfk.de> writes:
>> Maybe larger stats targets would help.

> I've set default_statistics_target to 100 and rerun ANALYZE on that
> table.  The estimate went down to 43108 (and the hash join is still
> the preferred plan).  ANALZE with default_statistics_target = 200
> (which seems pretty large to me) is down to 26050 and the bitmap scan
> plan is chosen.

> PostgreSQL seems to think that there are only very few distinct values
> for that column (with default_statistics_target = 100 and 200):

Yeah, n_distinct estimation from a sample is inherently hard :-(.  Given
that you have such a long tail on the distribution, it might be worth
your while to crank the stats target for that column all the way to the
maximum (1000).  Also you need to experiment with extending the stats
for the smaller table.

I believe what's happening here is that the smaller table joins only to
less-frequent entries in the big table (correct?).  The hash join would
be appropriate if there were many rows joining to the very-frequent
entries, and the problem for the planner is to determine that that's not
so.  Given enough stats on the two joining columns, it should be able to
determine that.

Of course, large stats targets will slow down planning to some extent,
so you should also keep an eye on how long it takes to plan the query.

            regards, tom lane

Re: Abysmal hash join

From

Florian Weimer

Date:

11 September 2006, 13:29:03

* Tom Lane:

> Yeah, n_distinct estimation from a sample is inherently hard :-(.  Given
> that you have such a long tail on the distribution, it might be worth
> your while to crank the stats target for that column all the way to the
> maximum (1000).

I've done that.  Fortunately, ANALYZE time didn't increase by that
much, compared to the default (by just a factor of 10).  The bitmap
scan estimate is still way off (around 8000), but let's hope that it
won't matter in practice.

> Also you need to experiment with extending the stats for the smaller
> table.

Yeah, the situation is quite similar, but on a much smaller scale.

> I believe what's happening here is that the smaller table joins only to
> less-frequent entries in the big table (correct?).

Almost.  We won't select the rows based on these values, at least not
in queries of that type.  The reason is simply that the result set is
too large to be useful.

> Of course, large stats targets will slow down planning to some extent,
> so you should also keep an eye on how long it takes to plan the query.

These queries are mostly ad-hoc, so a delay of a couple of seconds
doesn't matter.  Only if you need to wait five minutes, it's a
different story.

It seems that the situation is under control now.  Thanks.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Durlacher Allee 47            tel: +49-721-96201-1
D-76131 Karlsruhe             fax: +49-721-96201-99

Re: Abysmal hash join

From

Gregory Stark

Date:

11 September 2006, 21:57:01

Florian Weimer <fweimer@bfk.de> writes:

> I've done that.  Fortunately, ANALYZE time didn't increase by that
> much, compared to the default (by just a factor of 10).

With really high stats times you also have to keep an eye on planning time.
The extra data in the stats table can cause planning to take longer.

--
greg

Bad plan for join to aggregate of join.

From

Mischa Sandberg

Date:

12 September 2006, 18:04:31

PG 8.0.3 is choosing a bad plan between a query.
I'm going to force the plan (by making one join into a function).

I'd like to know if this is unexpected; in general,
can PG see that a join on an grouped-by field
can be pushed down into the query as an indexable filter?

The query below joins a table "message", to an aggregate of
"message_recipient" joined to "recipient". The joins are all on
indexed PK-FK columns. "message_recipient" is an intersect table.

           message :<: message_recipient :>: recipient

In the query plan below, the right side of the join returns one row of "message",
and PG knows it.

The left side of the join compute the entire aggregate of "message_recipient"
(est 700K rows), then does a merge join against the single message row.

I would have hoped for a nested-loop join, where the message "id"
field would be used to index-scan "message_recipient",
which in turn would index-scan "recipient" by recipient "id".

This is PG 8.0.3. All tables have been (very) recently analyzed.
The query plans estimated rowcounts all look bang-on.
"message" and "message_recipient" are tables of about 3M rows each.

As usual, this is on a system to which I only have restricted access.
But I'd be happy to expand on the info below with anything short of
the pg_dump.

-----------------------------------========================================================

EXPLAIN
SELECT  message.id AS m_db_id, message.m_global_id AS id, m_global_id, m_queue_id, h_message_id,
        m_date AS c_date_iso, m_date, c_subject_utf8, message.reason_id AS reason_id,
        m_reason.name AS m_reason, m_spam_probability, m_spam_level, h_to, m_message_size,
        m_header_size, date_part('epoch', message.m_date) AS c_qdate_time,
        h_from_local || '@' || h_from_domain AS h_from,
        env_from_local || '@' || env_from_domain AS env_from,
        env_from_local || '@' || env_from_domain AS m_envelope_from, location_name AS location,
        m_milter_host, m_relay, virus_name AS m_virus_name, m_all_recipients
FROM message
JOIN m_reason ON message.reason_id = m_reason.reason_id
JOIN message_all_recipients ON message.id = message_all_recipients.m_id
WHERE message.m_global_id = '2211000-1';


                                                                 QUERY PLAN
-------------------------------------------------------------------------------------------
Nested Loop  (cost=254538.42..283378.44 rows=1 width=425)
 Join Filter: ("outer".reason_id = "inner".reason_id)
 ->  Merge Join  (cost=254538.42..283377.33 rows=1 width=416)
     Merge Cond: ("outer".m_id = "inner".id)
     ->  Subquery Scan message_all_recipients  (cost=254535.40..281604.95 rows=707735 width=40)
         ->  GroupAggregate  (cost=254535.40..274527.60 rows=707735 width=36)
             ->  Sort  (cost=254535.40..258250.57 rows=1486069 width=36)
                 Sort Key: message_recipient.message_id
                 ->  Merge Join  (cost=0.00..78970.52 rows=1486069 width=36)
                     Merge Cond: ("outer".id = "inner".recipient_id)
                     ->  Index Scan using pk_recipient on recipient  (cost=0.00..5150.65 rows=204514 width=36)
                     ->  Index Scan using pk_message_recipient on message_recipient  (cost=0.00..56818.25 rows=1486069
width=16)
                          Filter: (is_mapped = 1)
     ->  Sort  (cost=3.02..3.03 rows=1 width=384)
         Sort Key: message.id
         ->  Index Scan using unq_message_m_global_id on message  (cost=0.00..3.01 rows=1 width=384)
             Index Cond: ((m_global_id)::text = '2211000-1'::text)
 ->  Seq Scan on m_reason  (cost=0.00..1.04 rows=4 width=13)



----------------------------------- Relevant tables and view:

# \d message
                                           Table "public.message"
       Column       |            Type             |                        Modifiers
--------------------+-----------------------------+---------------------------------------------------------
 id                 | bigint                      | not null default nextval('public.message_id_seq'::text)
 m_global_id        | character varying(255)      | not null
 reason_id          | smallint                    | not null
 location_name      | character varying(255)      | not null
 m_date             | timestamp without time zone |
 m_queue_id         | character varying(255)      |
 h_message_id       | character varying(255)      |
 c_subject_utf8     | character varying(255)      |
 env_from_local     | character varying(255)      |
 env_from_domain    | character varying(255)      |
 h_from_local       | character varying(255)      |
 h_from_domain      | character varying(255)      |
 h_from             | character varying(255)      |
 h_to               | character varying(255)      |
 m_milter_host      | character varying(255)      |
 m_relay            | character varying(255)      |
 m_spam_probability | double precision            |
 m_message_size     | integer                     |
 m_header_size      | integer                     |
 m_spam_level       | character varying(255)      |
 virus_name         | text                        |
Indexes:
    "pk_message" PRIMARY KEY, btree (id)
    "unq_message_m_global_id" UNIQUE, btree (m_global_id)
    "message_h_message_id_index" btree (h_message_id)
    "message_m_date_index" btree (m_date)
    "message_m_queue_id_index" btree (m_queue_id)


# \d message_recipient
       Table "public.message_recipient"
    Column     |   Type   |     Modifiers
---------------+----------+--------------------
 recipient_id  | bigint   | not null
 message_id    | bigint   | not null
 is_mapped     | smallint | not null default 0
 is_calculated | smallint | not null default 0
 is_envelope   | smallint | not null default 0
 reason_id     | smallint | not null
 action        | smallint |
Indexes:
    "pk_message_recipient" PRIMARY KEY, btree (recipient_id, message_id)
    "message_recipient_message_id_index" btree (message_id)
Foreign-key constraints:
    "rc_rcpnt_map_msg_id" FOREIGN KEY (message_id) REFERENCES message(id) ON DELETE CASCADE


CREATE AGGREGATE catenate (
    BASETYPE = text,
    SFUNC = textcat,
    STYPE = text,
    INITCOND = ''
);



CREATE OR REPLACE VIEW message_all_recipients AS
SELECT  message_id AS m_id,
        substr(catenate(','||local||'@'||domain),2) AS m_all_recipients
FROM    message_recipient
JOIN    recipient ON id = recipient_id
WHERE   is_mapped = 1
GROUP BY message_id;

----------------------------------- pg_statistics info, problably not of much interest
Object                                   DiskIO    CacheIO      Ins     Upd      Del SeqScan      TupRead    IdxScan
IdxFetch
m_reason                                    308     599679        1       0        0  599985      2399935          0
      0 
message                                 4658766   14977816  2210967       0   933643    7299     81428503    5855900
8833404 
message.pk_~                             227834   31683671        0       0        0       0      3897054    5850229
3897054 
message.unq_~_m_global_id                252753    8591251        0       0        0       0         5552       5564
   5552 
message.~_h_~_id_index                  1879172    8496722        0       0        0       0            0          0
      0 
message.~_m_date_index                   245405    8526765        0       0        0       0      4930798        107
4930798 
message.~_m_queue_id_index               245719    8598360        0       0        0       0            0          0
      0 
message_recipient                      41862572   81546465  2703260     104  1144977       0            0    2648101
117192003
message_recipient.pk_~                  4541776   16430539        0       0        0       0    116042206    1710555
116042206
message_recipient.~_message_id_index     243379   14235956        0       0        0       0      1149797     937546
1149797 
recipient                              55288623  955926871   223057       0   112158  584592 103499192990    5726999
62036712
recipient.pk_~                           180080    1125073        0       0        0       0      7440446     117045
7440446 
recipient.unq_~                         2205366   21513447        0       0        0       0     54166857    5609472
54166857
recipient.~_domain_index                 191722     734683        0       0        0       0       429409        482
 429409 

----------------------------------- output of "pgdisk", showing actual disk space vs pg_class info:
..DISK-KB ..DATA-KB ...EST-KB .EST-ROWS ...OID.... NAME
  1625360   1021104    979000   1315620      17261 public.message
   369208    159200    159032   1558240      17272 public.message_recipient
    45752     16408     14552    181646      17293 public.recipient

Re: Bad plan for join to aggregate of join.

From

Tom Lane

Date:

12 September 2006, 20:42:54

Mischa Sandberg <mischa@ca.sophos.com> writes:
> can PG see that a join on an grouped-by field
> can be pushed down into the query as an indexable filter?

No.  The GROUP BY serves as a partial optimization fence.  If you're
concerned about the speed of this query, I recommend making a different
view in which 'message' is joined inside the GROUP BY.

            regards, tom lane

Re: Bad plan for join to aggregate of join.

From

Mischa Sandberg

Date:

12 September 2006, 21:40:18

Tom Lane wrote:
> Mischa Sandberg <mischa@ca.sophos.com> writes:
>> can PG see that a join on an grouped-by field
>> can be pushed down into the query as an indexable filter?
>
> No.  The GROUP BY serves as a partial optimization fence.  If you're
> concerned about the speed of this query, I recommend making a different
> view in which 'message' is joined inside the GROUP BY.

Thanks.