Re: bad estimation together with largework_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: bad estimation together with largework_mem generates terrible slow hash joins
Date
Msg-id 0d6b59f90fa1d653dfac3b5b4516d465@fuzzy.cz
Whole thread Raw
In response to bad estimation together with large work_mem generates terrible slow hash joins  (Pavel Stehule <pavel.stehule@gmail.com>)
Responses Re: bad estimation together with large work_mem generates terrible slow hash joins  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Hi,

Dne 2014-06-26 14:10, Pavel Stehule napsal:
> Hello all,
>
> today I had to work with one slow query - when I checked different
> scenarios I found a dependency on work_mem size - but it is atypical -
> bigger work_mem increased query execution 31 minutes (600MB work mem)
> and 1 minute (1MB).

The problem described in Pavel's emails (illustrated by the awful
explain plans) is in this part:

  ->  Hash  (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)

That is, the estimated number of rows is ~60k, but in reality it's
~3.3M.
This then leads to a hash table with small number of buckets (8192)
containing
large number of tuples (~400 in this case) in a linked list. Which
significantly
slows down the lookups during the hash join.

This issue is actually quite common - all it takes is a significant
underestimation of the hashed relation, either because it's a complex
join
(thus inherently difficult to estimate) or because the WHERE conditions
are
not independent (see the example below).

The attached patch (early WIP, after ~30 minutes of hacking) addresses
this by
increasing the number of bucket whenever the average number of tuples
per item
exceeds 1.5x NTUP_PER_BUCKET.


======= Example ========

create table table_a (id int, fk_id int);
create table table_b (fk_id int, val_a int, val_b int);

insert into table_a select i, i from generate_series(1,10000000) s(i);
insert into table_b select i, mod(i,1000), mod(i,1000) from
generate_series(1,10000000) s(i);

analyze table_a;
analyze table_b;

explain analyze select count(*) from table_a join table_b on (table_a.id
= table_b.fk_id) where val_a < 10 and val_b < 10;


without the patch:

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual
time=49543.263..49543.264 rows=1 loops=1)
    ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=923.751..49531.554 rows=100000 loops=1)
          Hash Cond: (table_a.id = table_b.fk_id)
          ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.104..967.090 rows=10000000 loops=1)
          ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=923.615..923.615 rows=100000 loops=1)
                Buckets: 1024  Batches: 1  Memory Usage: 3516kB
                ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.086..910.656 rows=100000 loops=1)
                      Filter: ((val_a < 10) AND (val_b < 10))
                      Rows Removed by Filter: 9900000
  Planning time: 0.271 ms
  Execution time: 49545.053 ms
(11 rows)


with the patch:

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual
time=9780.346..9780.347 rows=1 loops=1)
    ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=939.297..9772.256 rows=100000 loops=1)
          Hash Cond: (table_a.id = table_b.fk_id)
          ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.103..962.446 rows=10000000 loops=1)
          ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=939.172..939.172 rows=100000 loops=1)
                Buckets: 8192  Batches: 1  Memory Usage: 3516kB
                ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.064..909.191 rows=100000 loops=1)
                      Filter: ((val_a < 10) AND (val_b < 10))
                      Rows Removed by Filter: 9900000
  Planning time: 0.276 ms
  Execution time: 9782.295 ms
(11 rows)

Time: 9784.392 ms

So the duration improved significantly - from ~52 seconds to ~10
seconds.

The patch is not perfect, it needs a bit more love, as illustrated by
the FIXME/TODO items. Feel free to comment.

regards
Tomas


Attachment

pgsql-hackers by date:

Previous
From: "Kirk Roybal"
Date:
Subject: Re: Re: how can i prevent materialized views from refreshing during pg_restore
Next
From: Claudio Freire
Date:
Subject: Re: NUMA packaging and patch