Thread: bad estimation together with large work_mem generates terrible slow hash joins

bad estimation together with large work_mem generates terrible slow hash joins

From

Pavel Stehule

Date:

26 June 2014, 12:11:48

Hello all,

today I had to work with one slow query - when I checked different scenarios I found a dependency on work_mem size - but it is atypical - bigger work_mem increased query execution 31 minutes (600MB work mem) and 1 minute (1MB).

db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '600MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2145957.37..2145957.96 rows=59 width=12) (actual time=1864088.145..1864088.155 rows=31 loops=1)
   -> Hash Join (cost=882573.27..2142753.08 rows=213619 width=12) (actual time=9083.326..1859069.171 rows=7070141 loops=1)
         Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
         -> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.006..14271.239 rows=32211565 loops=1)
         -> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual time=9076.153..9076.153 rows=3310768 loops=1)
               Buckets: 8192 Batches: 1 Memory Usage: 129327kB
               -> Hash Join (cost=782.15..881801.58 rows=61735 width=8) (actual time=1.423..8086.228 rows=3310768 loops=1)
                     Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
                     -> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845420.42 rows=9328442 width=12) (actual time=0.015..4098.915 rows=9322096 loops=1)
                     -> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=1.400..1.400 rows=365 loops=1)
                           Buckets: 1024 Batches: 1 Memory Usage: 13kB
                           -> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.111..1.218 rows=365 loops=1)
                                 Recheck Cond: (2014 = id_year)
                                 -> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.068..0.068 rows=365 loops=1)
                                       Index Cond: (2014 = id_year)
Total runtime: 1864107.373 ms
(16 rows)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2275675.45..2275675.95 rows=50 width=12) (actual time=48438.407..48438.416 rows=31 loops=1)
   -> Hash Join (cost=882772.88..2272526.68 rows=209918 width=12) (actual time=9384.610..45211.963 rows=6988804 loops=1)
         Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
         -> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64 rows=31719464 width=12) (actual time=0.034..14299.338 rows=31653392 loops=1)
         -> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual time=9368.371..9368.371 rows=3310768 loops=1)
               Buckets: 4096 Batches: 256 (originally 4) Memory Usage: 1025kB
               -> Hash Join (cost=782.15..881759.44 rows=61715 width=8) (actual time=3.839..8223.097 rows=3310768 loops=1)
                     Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
                     -> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.017..4118.598 rows=9322096 loops=1)
                     -> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=3.804..3.804 rows=365 loops=1)
                           Buckets: 1024 Batches: 1 Memory Usage: 13kB
                           -> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.188..3.641 rows=365 loops=1)
                                 Recheck Cond: (2014 = id_year)
                                 -> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.115..0.115 rows=365 loops=1)
                                       Index Cond: (2014 = id_year)
Total runtime: 48439.958 ms
(16 rows)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '10MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2124026.77..2124027.27 rows=50 width=12) (actual time=100739.984..100739.998 rows=31 loops=1)
   -> Hash Join (cost=882530.88..2120878.00 rows=209918 width=12) (actual time=9467.702..97238.959 rows=6988804 loops=1)
         Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
         -> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64 rows=31719464 width=12) (actual time=0.015..9185.978 rows=31653392 loops=1)
         -> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual time=9466.440..9466.440 rows=3310768 loops=1)
               Buckets: 8192 Batches: 16 (originally 1) Memory Usage: 10241kB
               -> Hash Join (cost=782.15..881759.44 rows=61715 width=8) (actual time=3.681..8183.423 rows=3310768 loops=1)
                     Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
                     -> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.012..4064.044 rows=9322096 loops=1)
                     -> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=3.654..3.654 rows=365 loops=1)
                           Buckets: 1024 Batches: 1 Memory Usage: 13kB
                           -> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.129..3.449 rows=365 loops=1)
                                 Recheck Cond: (2014 = id_year)
                                 -> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.077..0.077 rows=365 loops=1)
                                       Index Cond: (2014 = id_year)
Total runtime: 100740.552 ms
(16 rows)

The basic problem is wrong estimation - but I am not able to fix it - I changed a statistics to 10000 without any change. I cannot to change SQL in application, but for debugging I can test it with fixed estimation:

db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN stehule on "f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2073456.55..2073457.23 rows=68 width=12) (actual time=58724.094..58724.106 rows=31 loops=1)
   -> Hash Join (cost=89142.28..1589276.13 rows=32278695 width=12) (actual time=2191.019..55499.331 rows=7070141 loops=1)
         Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
         -> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.018..8364.554 rows=32211565 loops=1)
         -> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual time=2188.858..2188.858 rows=3310768 loops=1)
               Buckets: 524288 Batches: 1 Memory Usage: 129327kB
               -> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768 width=8) (actual time=0.026..927.235 rows=3310768 loops=1)
Total runtime: 58741.224 ms
(8 rows)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN stehule on "f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2293499.45..2293500.13 rows=68 width=12) (actual time=37357.967..37357.976 rows=31 loops=1)
   -> Hash Join (cost=102075.28..1809319.02 rows=32278695 width=12) (actual time=1814.232..34290.818 rows=7070141 loops=1)
         Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
         -> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.007..14066.754 rows=32211565 loops=1)
         -> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual time=1806.453..1806.453 rows=3310768 loops=1)
               Buckets: 4096 Batches: 256 Memory Usage: 518kB
               -> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768 width=8) (actual time=0.007..781.672 rows=3310768 loops=1)
Total runtime: 37359.264 ms
(8 rows)

Still less work_mem performs better - but with our default 600MB it performs relatively well

db_kost07e2d9cdmg20b1takpqntobo6ghj=# select version();
                                                             version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.8 on x86_64-gdc-linux-gnu, with GoodData patches, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
(1 row)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ stehule
                     List of relations
Schema | Name   | Type | Owner   | Size | Description
--------+---------+-------+----------+--------+-------------
public | stehule | table | postgres | 114 MB |
(1 row)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_users_aax8qg0e23asx5h
                            List of relations
Schema |          Name           | Type | Owner | Size   | Description
--------+-------------------------+-------+-------+---------+-------------
public | f_users_aax8qg0e23asx5h | table | beard | 5878 MB |
(1 row)

db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_emails_aaekn159p5aw3t8
                             List of relations
Schema |           Name           | Type | Owner | Size   | Description
--------+--------------------------+-------+-------+---------+-------------
public | f_emails_aaekn159p5aw3t8 | table | beard | 4156 MB |
(1 row)

all data are in FS cache. There are 60GB RAM.

Is interesting so 31M nested loops is significantly faster, than bad hash join

db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=40076389.70..40076448.70 rows=59 width=12) (actual time=31124.643..31124.658 rows=31 loops=1)
   -> Nested Loop (cost=2773.92..40073185.42 rows=213619 width=12) (actual time=37.945..27670.473 rows=7070141 loops=1)
         -> Hash Join (cost=2773.92..10180068.58 rows=61735 width=8) (actual time=0.951..8934.170 rows=3310768 loops=1)
               Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
               -> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..10080578.00 rows=9328442 width=12) (actual time=0.004..4297.579 rows=9322096 loops=1)
               -> Hash (cost=2408.01..2408.01 rows=365 width=4) (actual time=0.919..0.919 rows=365 loops=1)
                     Buckets: 1024 Batches: 1 Memory Usage: 13kB
                     -> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=386.27..2408.01 rows=365 width=4) (actual time=0.104..0.733 rows=365 loops=1)
                           Recheck Cond: (2014 = id_year)
                           -> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..386.18 rows=365 width=0) (actual time=0.059..0.059 rows=365 loops=1)
                                 Index Cond: (2014 = id_year)
         -> Index Scan using f_emails_aaekn159p5aw3t8_userid_id_idx on f_emails_aaekn159p5aw3t8 (cost=0.00..396.22 rows=88 width=12) (actual time=0.002..0.004 rows=2 loops=3310768)
               Index Cond: (userid_id = f_users_aax8qg0e23asx5h.id)
Total runtime: 31125.975 ms

Regards

Pavel

Re: bad estimation together with largework_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

26 June 2014, 15:09:16

Hi,

Dne 2014-06-26 14:10, Pavel Stehule napsal:
> Hello all,
>
> today I had to work with one slow query - when I checked different
> scenarios I found a dependency on work_mem size - but it is atypical -
> bigger work_mem increased query execution 31 minutes (600MB work mem)
> and 1 minute (1MB).

The problem described in Pavel's emails (illustrated by the awful
explain plans) is in this part:

  ->  Hash  (cost=881801.58..881801.58 rows=61735 width=8) (actual
time=9076.153..9076.153 rows=3310768 loops=1)

That is, the estimated number of rows is ~60k, but in reality it's
~3.3M.
This then leads to a hash table with small number of buckets (8192)
containing
large number of tuples (~400 in this case) in a linked list. Which
significantly
slows down the lookups during the hash join.

This issue is actually quite common - all it takes is a significant
underestimation of the hashed relation, either because it's a complex
join
(thus inherently difficult to estimate) or because the WHERE conditions
are
not independent (see the example below).

The attached patch (early WIP, after ~30 minutes of hacking) addresses
this by
increasing the number of bucket whenever the average number of tuples
per item
exceeds 1.5x NTUP_PER_BUCKET.


======= Example ========

create table table_a (id int, fk_id int);
create table table_b (fk_id int, val_a int, val_b int);

insert into table_a select i, i from generate_series(1,10000000) s(i);
insert into table_b select i, mod(i,1000), mod(i,1000) from
generate_series(1,10000000) s(i);

analyze table_a;
analyze table_b;

explain analyze select count(*) from table_a join table_b on (table_a.id
= table_b.fk_id) where val_a < 10 and val_b < 10;


without the patch:

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual
time=49543.263..49543.264 rows=1 loops=1)
    ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=923.751..49531.554 rows=100000 loops=1)
          Hash Cond: (table_a.id = table_b.fk_id)
          ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.104..967.090 rows=10000000 loops=1)
          ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=923.615..923.615 rows=100000 loops=1)
                Buckets: 1024  Batches: 1  Memory Usage: 3516kB
                ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.086..910.656 rows=100000 loops=1)
                      Filter: ((val_a < 10) AND (val_b < 10))
                      Rows Removed by Filter: 9900000
  Planning time: 0.271 ms
  Execution time: 49545.053 ms
(11 rows)


with the patch:

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=385834.56..385834.57 rows=1 width=0) (actual
time=9780.346..9780.347 rows=1 loops=1)
    ->  Hash Join  (cost=204069.89..385831.16 rows=1359 width=0) (actual
time=939.297..9772.256 rows=100000 loops=1)
          Hash Cond: (table_a.id = table_b.fk_id)
          ->  Seq Scan on table_a  (cost=0.00..144247.77 rows=9999977
width=4) (actual time=0.103..962.446 rows=10000000 loops=1)
          ->  Hash  (cost=204052.90..204052.90 rows=1359 width=4) (actual
time=939.172..939.172 rows=100000 loops=1)
                Buckets: 8192  Batches: 1  Memory Usage: 3516kB
                ->  Seq Scan on table_b  (cost=0.00..204052.90 rows=1359
width=4) (actual time=0.064..909.191 rows=100000 loops=1)
                      Filter: ((val_a < 10) AND (val_b < 10))
                      Rows Removed by Filter: 9900000
  Planning time: 0.276 ms
  Execution time: 9782.295 ms
(11 rows)

Time: 9784.392 ms

So the duration improved significantly - from ~52 seconds to ~10
seconds.

The patch is not perfect, it needs a bit more love, as illustrated by
the FIXME/TODO items. Feel free to comment.

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v1.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

26 June 2014, 18:44:07

Attached is v2 of the patch, with some cleanups / minor improvements:

* improved comments, whitespace fixed / TODOs etc.

* tracking inital # of buckets (similar to initial # of batches)

* adding info about buckets to EXPLAIN ANALYZE, similar to batches - I
didn't want to make it overly complex, so the info about initial
bucket/batch count is added if at least one them is modified

* modified threshold triggering the growth, to get NTUP_PER_BUCKET on
average (see the NTUP_GROW_THRESHOLD comment nodeHash.c)

* there's a single FIXME, related to counting tuples in the

One thing that's important to note is the difference between # of
batches and # of buckets. While one # of batches is "global" the # of
buckets is 'within a batch'. So theoretically each batch can use
different number of buckets.

However the value is reused between batches, so it only grows. That
means this is possible:

  initial: 1024 buckets (before 1st batch)
  batch 1: 1024 buckets
  batch 2: 1024 buckets
  batch 3: 4096 buckets
  batch 4: 8192 buckets

while this is not:

  initial: 1024 buckets (before 1st batch)
  batch 1: 1024 buckets
  batch 2: 4096 buckets
  batch 3: 1024 buckets
  batch 4: 8192 buckets

However in practice I expect the first batch will to do all the work,
and the following batches will just reuse the same number of buckets.
This of course assumes the batches have similar tuple sizes etc.

So the first batch will do all the reshuffling the tables, and the
following batches will reuse the 'right' number of buckets from the start.

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v2.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

26 June 2014, 21:48:11

On 26.6.2014 20:43, Tomas Vondra wrote:
> Attached is v2 of the patch, with some cleanups / minor improvements:
>
> * there's a single FIXME, related to counting tuples in the

Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
patch. This just adds a proper 'batch tuples' counter to the hash table.

All comments, measurements on different queries etc. welcome. We'll
certainly do a lot of testing, because this was a big issue for us.

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v3.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

29 June 2014, 23:24:56

On 26.6.2014 23:48, Tomas Vondra wrote:
> On 26.6.2014 20:43, Tomas Vondra wrote:
>> Attached is v2 of the patch, with some cleanups / minor improvements:
>>
>> * there's a single FIXME, related to counting tuples in the
>
> Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
> patch. This just adds a proper 'batch tuples' counter to the hash table.
>
> All comments, measurements on different queries etc. welcome. We'll
> certainly do a lot of testing, because this was a big issue for us.

Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

===== SMALL WORK MEM (2MB) =====
          no dynamic buckets     dynamic buckets
query A   5945 ms                5969 ms
query B   6080 ms                5943 ms
query C   6531 ms                6822 ms
query D   6962 ms                6618 ms

===== MEDIUM WORK MEM (16MB) =====
          no dynamic buckets     dynamic buckets
query A   7955 ms                7944 ms
query B   9970 ms                7569 ms
query C   8643 ms                8560 ms
query D  33908 ms                7700 ms

===== LARGE WORK MEM (64MB) =====
          no dynamic buckets     dynamic buckets
query A   10235 ms               10233 ms
query B   32229 ms                9318 ms
query C   14500 ms               10554 ms
query D  213344 ms                9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
    caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
    caches), the differences are negligible, except for the last query
    with most severe underestimate. In that case the new behaviour is
    much faster.

(3) For large work_mem values, the speedup is pretty obvious and
    dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

30 June 2014, 21:12:24

Hi,

attached is v5 of the patch. The main change is that scaling the number
of buckets is done only once, after the initial hash table is build. The
main advantage of this is lower price. This also allowed some cleanup of
unecessary code.

However, this new patch causes warning like this:

    WARNING:  temporary file leak: File 231 still referenced

I've been eyeballing the code for a while now, but I still fail to see
where this comes from :-( Any ideas?

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v5-with-guc.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

30 June 2014, 23:24:39

On 30.6.2014 23:12, Tomas Vondra wrote:
> Hi,
>
> attached is v5 of the patch. The main change is that scaling the number
> of buckets is done only once, after the initial hash table is build. The
> main advantage of this is lower price. This also allowed some cleanup of
> unecessary code.
>
> However, this new patch causes warning like this:
>
>     WARNING:  temporary file leak: File 231 still referenced
>
> I've been eyeballing the code for a while now, but I still fail to see
> where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

  *bucketno = hashvalue & (nbuckets - 1);
  *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v6-with-guc.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Atri Sharma

Date:

03 July 2014, 13:42:27

On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 30.6.2014 23:12, Tomas Vondra wrote:
> Hi,
>
> attached is v5 of the patch. The main change is that scaling the number
> of buckets is done only once, after the initial hash table is build. The
> main advantage of this is lower price. This also allowed some cleanup of
> unecessary code.
>
> However, this new patch causes warning like this:
>
> WARNING: temporary file leak: File 231 still referenced
>
> I've been eyeballing the code for a while now, but I still fail to see
> where this comes from :-( Any ideas?

Meh, the patches are wrong - I haven't realized the tight coupling
between buckets/batches in ExecHashGetBucketAndBatch:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".

However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:

*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

I.e. using the top bits for batchno, low bits for bucketno (as before).

Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).

Hi,

I had a look at the patch and I was wondering if there is a way we can set a cap on the resized size, and instead increase the number of batches instead? Since this patch evaluates the growth of tuples vs increase of space, we could look at increasing the number of batches itself instead of number of buckets, if the resized number of buckets exceeds a threshold. It shouldnt be too hard, AIUI it would involve a call in MultiExecHash right after the new cost evaluation the patch adds.

It isnt a target of the current patch, but I think that it is a logical extension to the same.

Thoughts/Comments?

Regards,

Atri

Regards,

Atri

l'apprenant

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

03 July 2014, 17:36:38

On 1.7.2014 01:24, Tomas Vondra wrote:
> On 30.6.2014 23:12, Tomas Vondra wrote:
>> Hi,
>
> Hopefully I got it right this time. At least it seems to be working for
> cases that failed before (no file leaks, proper rowcounts so far).

Attached v7, fixing nbatch/ntuples in an assert.

regards
Tomas

Attachment

hashjoin-nbuckets-growth-v7.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

03 July 2014, 18:36:11

On 3.7.2014 15:42, Atri Sharma wrote:
> 
> 
> 
> On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
> 
>     On 30.6.2014 23:12, Tomas Vondra wrote:
>     > Hi,
>     >
>     > attached is v5 of the patch. The main change is that scaling the
>     number
>     > of buckets is done only once, after the initial hash table is
>     build. The
>     > main advantage of this is lower price. This also allowed some
>     cleanup of
>     > unecessary code.
>     >
>     > However, this new patch causes warning like this:
>     >
>     >     WARNING:  temporary file leak: File 231 still referenced
>     >
>     > I've been eyeballing the code for a while now, but I still fail to see
>     > where this comes from :-( Any ideas?
> 
>     Meh, the patches are wrong - I haven't realized the tight coupling
>     between buckets/batches in ExecHashGetBucketAndBatch:
> 
>       *bucketno = hashvalue & (nbuckets - 1);
>       *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
> 
>     The previous patches worked mostly by pure luck (the nbuckets was set
>     only in the first batch), but once I moved the code at the end, it
>     started failing. And by "worked" I mean "didn't throw an error, but
>     probably returned bogus results with (nbatch>1)".
> 
>     However, ISTM this nbuckets-nbatch coupling is not really necessary,
>     it's only constructed this way to assign independent batch/bucket. So I
>     went and changed the code like this:
> 
>       *bucketno = hashvalue & (nbuckets - 1);
>       *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
> 
>     I.e. using the top bits for batchno, low bits for bucketno (as before).
> 
>     Hopefully I got it right this time. At least it seems to be working for
>     cases that failed before (no file leaks, proper rowcounts so far).
> 
> 
> Hi,
> 
> I had a look at the patch and I was wondering if there is a way we can
> set a cap on the resized size, and instead increase the number of
> batches instead? Since this patch evaluates the growth of tuples vs
> increase of space, we could look at increasing the number of batches
> itself instead of number of buckets, if the resized number of buckets
> exceeds a threshold. It shouldnt be too hard, AIUI it would involve a
> call in MultiExecHash right after the new cost evaluation the patch adds.

So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET
(i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the
number of batches?

Hmmm, that'd be easy to implement. I don't think it's possible to do
that in MultiExecHash (that's too late). But it's a matter of changing
this condition in ExecHashTableInsert:
   if (hashtable->spaceUsed > hashtable->spaceAllowed)       ExecHashIncreaseNumBatches(hashtable);

to something like this
   if ((hashtable->spaceUsed > hashtable->spaceAllowed) ||       (hashtable->totalTuples / hashtable->nbatch
    == NTUP_PER_BUCKET * hashtable->nbuckets))       ExecHashIncreaseNumBatches(hashtable);
 

But I don't think increasing number of batches is a good approach, as it
means more rescans. I think using memory is more efficient (after all,
that's what work_mem is for, right?).

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

13 July 2014, 20:43:27

On 3.7.2014 19:36, Tomas Vondra wrote:
> On 1.7.2014 01:24, Tomas Vondra wrote:
>> On 30.6.2014 23:12, Tomas Vondra wrote:
>>> Hi,
>>
>> Hopefully I got it right this time. At least it seems to be working
>> for cases that failed before (no file leaks, proper rowcounts so
>> far).
>
> Attached v7, fixing nbatch/ntuples in an assert.
>
> regards
> Tomas

Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').

With the patch applied, the hash table works like this:

initial sizing (ExecChooseHashTableSize)
----------------------------------------
- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets

building the hash table
-----------------------
- while adding tuples, keep track of optimal number of buckets (for the
  current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
  number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
  initial nbuckets)
- do the resize

This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.

For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).


More batches or less buckets?
-----------------------------
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.

I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1], it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).


Changes since v7
----------------

  (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
  (b) set NTUP_PER_BUCKET=1
  (c) include buckets (optimal) when considering nbatch increase
  (d) track optimal number of buckets (for NTUP_PER_BUCKET)
  (e) after loading all the tuples, resize the hash table to get
      nbuckets_optimal (if needed)
  (f) renamed GUC to enable_hash_resize (makes a bit more sense)


Comments
--------

  (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
      misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
      not average) and is not really needed.

  (b) The value "1" gives the best performance, but also requires more
      memory for the buckets. This should be more than compensated by
      densely allocating the tuples (see hashjoin-alloc-v3.patch
      in the "tweaking NTUP_PER_BUCKET" thread [1]).

  (c,d) Originally, the memory needed for buckets was not included in
        'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
        not really acceptable (needs much more memory than before).
        So now it's included both in the initial sizing of the hash
        table, and when adding more tuples (nbuckets_optimal).

        Still, spaceUsed does not include palloc overhead (which may be
        a significant amount of memory). This is tackled by [1].

  (e) No change here.

  (f) A bit better GUC name. Anyway, this is for easier development
      only, and won't be included in the final patch.


regards
Tomas

[1] http://www.postgresql.org/message-id/53C2DECC.2010408@fuzzy.cz

Attachment

hashjoin-nbuckets-growth-v8.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

20 July 2014, 16:29:55

Attached v9 of the patch. Aside from a few minor fixes, the main  change
is that this is assumed to be combined with the "dense allocation" patch.

It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.

hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)

hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

regards
Tomas Vondra

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

09 August 2014, 13:14:07

On 20.7.2014 18:29, Tomas Vondra wrote:
> Attached v9 of the patch. Aside from a few minor fixes, the main  change
> is that this is assumed to be combined with the "dense allocation" patch.
> 
> It also rewrites the ExecHashIncreaseNumBuckets to follow the same
> pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
> instead of buckets). It's cleaner and more consistent.
> 
> hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
> to grab the hashjoin-alloc-v4.patch from a different thread and apply it
> first)
> 
> hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not "break" the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.

First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the "problem" section below.

The "old" ExecHashGetBucketAndBatch did this:
 *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this

31               23               15                7                0
|----------------|----------------|----------------|----------------||                  |   <- batches    |
buckets           |

This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '0000' would be split either into '00000' or '10000'.

With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:
  Assert(batchno > curbatch);

In other words "don't move tuples to already-processed batches". So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).

So what I did is this:

31               23               15                7                0
|----------------|----------------|----------------|----------------||   batches ->  |                   |         <-
buckets           |

and this is what happens in ExecHashGetBucketAndBatch:
 *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

So the "bucketno" expression is exactly the same (but now the nbuckets
may change dynamically), but "batchno" is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.

Problem
-------

The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).

It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.

The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '1111' gets split into '11111' and '11110'.

It's rougly equal to
  (batchno << 1) + K

where K is either 0 or 1, which is always >= than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).

But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.

At least that's how I understand it ...

Possible solutions
------------------

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):
tupsize = HJTUPLE_OVERHEAD +    MAXALIGN(sizeof(MinimalTupleData)) +    MAXALIGN(tupwidth);
       nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:
31               23     max       15                7                0
|----------------|----------------|----------------|----------------||      |   <- batches    |       free       | <-
  buckets       |

That might work, I guess. But maybe there's something better (using some
secret bitwise-operator-fu ...).

Further concerns
----------------

I'm wondering whether we should deploy some safe-guards against
exceeding the 32 bits in the hash. It probably was not a big issue
before, but with large work_mem values and NTUP_PER_BUCKET=1 this might
become a problem.

With work_mem=1GB, and 50B tuples (including overhead):
  buckets = 21474836

which means ~25 bits reserved for nbucketno. That leaves only 7 bits for
batch number. Yeah, that's ~128 batches (so ~128GB hash table), but
maybe it's not that far.

Also, what if the tupwidth estimate is too low. In that case the
nbuckets_bits_max would be artificially high (and we'd have to do more
batches).

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

11 August 2014, 18:25:29

On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Adding least-significant bit does not work, we need get back to adding
> the most-significant one. Not sure what's the least complex way to do
> that, though.
>
> I'm thinking about computing the nbuckets limit (how many buckets may
> fit into work_mem):
>
>         tupsize = HJTUPLE_OVERHEAD +
>                 MAXALIGN(sizeof(MinimalTupleData)) +
>                 MAXALIGN(tupwidth);
>
>         nbuckets_bits_max = my_log2(work_mem / tupsize)
>
> and then start with nbatches from this bit, like this:
>
>  31               23     max       15                7                0
>   |----------------|----------------|----------------|----------------|
>   |      |   <- batches    |       free       | <-      buckets       |

That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another.  In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets.  Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table.  Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it.  But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary.  So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.

Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard.  It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets.  I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

11 August 2014, 22:31:13

On 11.8.2014 20:25, Robert Haas wrote:
> On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Adding least-significant bit does not work, we need get back to adding
>> the most-significant one. Not sure what's the least complex way to do
>> that, though.
>>
>> I'm thinking about computing the nbuckets limit (how many buckets may
>> fit into work_mem):
>>
>>         tupsize = HJTUPLE_OVERHEAD +
>>                 MAXALIGN(sizeof(MinimalTupleData)) +
>>                 MAXALIGN(tupwidth);
>>
>>         nbuckets_bits_max = my_log2(work_mem / tupsize)
>>
>> and then start with nbatches from this bit, like this:
>>
>>  31               23     max       15                7                0
>>   |----------------|----------------|----------------|----------------|
>>   |      |   <- batches    |       free       | <-      buckets       |
> 
> That doesn't seem unreasonable provide there's enough bit space, but,
> as you point out, the day might not be far off when there isn't.

Thinking about this a bit more, I believe the danger is not that
imminent. 32 bits mean the Hash node should handle 2^32 tuples in total
(possibly split into multiple batches).

It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch
lowers this to 1 so it's "tuples" now.

But while we're we're a bit closer to exhausting the 32 bits, I think
it's not really that big issue. Not only hashing a table with ~4e9 rows
is going to be a hellish experience, but if we really want to support
it, we should probably switch to 64-bit hashes.

Because adding some checks is not going to help - it may detect an
issue, but it won't fix it. And actually, even if the two values get
overlap, it does not mean the hashjoin will stop working. There'll be
dependency between batches and buckets, causing non-uniform usage of the
buckets, but it won't blow up.

So I don't think we need to worry about exhausting the 32 bits for now.

> It also strikes me that when there's only 1 batch, the set of bits
> that map onto the batch number is zero-width, and one zero-width bit
> range is as good as another.  In other words, if you're only planning
> to do one batch, you can easily grow the number of buckets on the fly.
> Growing the number of buckets only becomes difficult once you have
> more than one batch.

Yes, that's true. It's also roughly what the patch does IMHO.

If you know you'll need more batches, you can start with the maximum
number of buckets right away (because you expect there's more data than
work_mem, so shoot for
   nbuckets = mylog2(work_mem/tuple_size)

or something like that. It's also true that if you're starting with a
single batch, you can do whatever you want with nbuckets until you need
to do (nbatches *= 2).

It's also true that once you're done with the first batch, all the other
batches will be just fine with the number of buckets, because the
batches be about equally sized.

But I think this is pretty much what the patch does, in the end.

> And I mention that because, in the scenario mentioned in your original
> post ("a hash table with small number of buckets (8192) containing
> large number of tuples [~3.3M]"), presumably what happens is you start
> out thinking you are going to have one batch with 8K buckets.  Then,
> as more tuples continue to pour in, you see the load factor rising and
> responding by bumping up the size of the hash table.  Now eventually
> you realize, gosh, this isn't even going to fit into work_mem, so you
> need to start batching it.  But by that point you've presumably done
> as much as you're going to do in terms of increasing the number of
> buckets; after that, you're just going to add more batches as
> necessary.  So not being able to further increase the number of
> buckets once the number of batches is no longer 1 wouldn't be a
> problem in that case.
> 
> Now if you start out planning for multiple batches, and then you
> discover you'd like to keep the same number of batches but have more
> buckets per batch, that's gonna be hard.  It's not impossible, because
> you could steal some of the unused high-order bits above the number of
> batches, and then concatenate them with the low-order bits that you
> already had, but that seems like kind of an ugly kludge, and AFAICS it
> only helps if the new tuples are narrower by enough to justify the
> cost of splitting all the buckets.  I won't say that couldn't happen,
> but I think it would be better to keep that complexity out of the
> patch unless we're absolutely certain it's justified.

I'm definitely in favor of keeping the patch as simple as possible.

I was considering using reversing the bits of the hash, because that's
pretty much the simplest solution. But I think you're right it might
actually work like this:
 * Are more batches needed?
     (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
     (no) => go ahead, until processing all tuples or hitting work_mem
             (work_mem) => meh, use the same nbuckets above
             (all tuples) => compute optimal nbuckets / resize

But I need to think about this a bit. So far it seems to me there's no
way additional batches might benefit from increasing nbuckets further.

Each batch should contain about the same number of tuples, or more
correctly, the same number of distinct hash values. But if there are
multiple tuples with the same hash value, increasing the number of
buckets is not going to help (it's still going to end up in the same
bucket).

And the number of tuples in a batch may only get lower, while adding
batches (e.g. if a batch somewhere in the middle gets a lot of tuples
with the same hash value, exceeding work_mem).

I need to think about this a bit more, but so far it seems like a good
clean solution to me.

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

16 August 2014, 13:31:20


On 12.8.2014 00:30, Tomas Vondra wrote:
> On 11.8.2014 20:25, Robert Haas wrote:
>> It also strikes me that when there's only 1 batch, the set of bits
>> that map onto the batch number is zero-width, and one zero-width bit
>> range is as good as another.  In other words, if you're only planning
>> to do one batch, you can easily grow the number of buckets on the fly.
>> Growing the number of buckets only becomes difficult once you have
>> more than one batch.

...

> I was considering using reversing the bits of the hash, because that's
> pretty much the simplest solution. But I think you're right it might
> actually work like this:
>
>   * Are more batches needed?
>
>       (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>
>       (no) => go ahead, until processing all tuples or hitting work_mem
>
>               (work_mem) => meh, use the same nbuckets above
>
>               (all tuples) => compute optimal nbuckets / resize
>
>
> But I need to think about this a bit. So far it seems to me there's no
> way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
   (using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
   the estimated number of tuples, and keep track of the optimal number
   as tuples are added to the hash table

   - if we discover we need to start batching, we keep the current
     optimal value (which should be the same as the max number of
     buckets) and don't mess with it anymore (making it possible to
     compute batch IDs just like before)

   - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
     is resized as part of the rebatch

   - if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

regards
Tomas

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

19 August 2014, 17:05:18

On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 12.8.2014 00:30, Tomas Vondra wrote:
>> On 11.8.2014 20:25, Robert Haas wrote:
>>> It also strikes me that when there's only 1 batch, the set of bits
>>> that map onto the batch number is zero-width, and one zero-width bit
>>> range is as good as another.  In other words, if you're only planning
>>> to do one batch, you can easily grow the number of buckets on the fly.
>>> Growing the number of buckets only becomes difficult once you have
>>> more than one batch.
>
> ...
>
>> I was considering using reversing the bits of the hash, because that's
>> pretty much the simplest solution. But I think you're right it might
>> actually work like this:
>>
>>   * Are more batches needed?
>>
>>       (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>
>>       (no) => go ahead, until processing all tuples or hitting work_mem
>>
>>               (work_mem) => meh, use the same nbuckets above
>>
>>               (all tuples) => compute optimal nbuckets / resize
>>
>>
>> But I need to think about this a bit. So far it seems to me there's no
>> way additional batches might benefit from increasing nbuckets further.
>
> I think this is a simple and solid solution, solving the batchno
> computation issues quite nicely. Attached is v10 patch (bare and
> combined with the dense allocation), that does this:
>
> 1) when we know we'll need batching, buckets are sized for full work_mem
>    (using the estimated tuple width, etc.)
>
> 2) without the batching, we estimate the 'right number of buckets' for
>    the estimated number of tuples, and keep track of the optimal number
>    as tuples are added to the hash table
>
>    - if we discover we need to start batching, we keep the current
>      optimal value (which should be the same as the max number of
>      buckets) and don't mess with it anymore (making it possible to
>      compute batch IDs just like before)
>
>    - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>      is resized as part of the rebatch
>
>    - if the hash build completes without batching, we do the resize
>
> I believe the patch is pretty much perfect. I plan to do more thorough
> testing on a wide range of queries in the next few days.
>
> I also removed the 'enable_hash_resize' GUC, because it would be more
> complex to implement this properly after doing the resize as part of
> rebatch etc.. So either it would make the patch more complex, or it
> wouldn't do what the name promises.

A variety of trivial comments on this:

PostgreSQL style is un-cuddled curly braces.  Also, multi-line
comments need to start with a line containing only /* and end with a
line containing only */.  In one place you've added curly braces
around a single-line block that is otherwise unmodified; please don't
do that.  In one place, you have "becase" instead of "because".  In
another place, you write "add if after it" but it should say "add it
after it" or maybe better "add the new one after it".  Avoid using
punctuation like "=>" in comments to illustrate the connection between
sentences; instead, use a connecting word like "then" or "therefore"
or whatever is appropriate; in this instance, a period followed by the
start of a new sentence seems sufficient.  Revert the removal of a
single line of whitespace near the top of nodeHash.c.

There are too many things marked XXX in this patch.  They should
either be fixed, if they are real problems, or they should be
commented in a way that doesn't give rise to the idea that they're
problems if they aren't.

OK, now on to some more substantive stuff:

1. It's not clear to me what the overall effect of this patch on
memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
to use, on the average, 10x as much bucket-header memory per tuple.
Specifically, I think it means we'll use about 8 bytes of
bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
tuples are narrow, that could be significant; concerns have been
expressed about that here in the past.  Increasing the number of
buckets could also increase memory usage.  On the other hand, the
dense allocation stuff probably saves a ton of memory, so maybe we end
up overall, but I'm not sure.  Your thoughts, and maybe some test
results with narrow and wide tuples, would be appreciated.

2. But, on the positive side, modulo the memory utilization questions
mentioned above, I would expect the impact on hash join performance to
be positive.  Going from 10 tuples per bucket to just 1 should help,
and on cases where the actual load factor would have ended up much
higher because of poor estimation, increasing the number of buckets on
the fly should help even more.  I haven't tested this, though.

I haven't had a chance to completely go through this yet, so these are
just some initial thoughts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

19 August 2014, 18:38:10

On 19.8.2014 19:05, Robert Haas wrote:
> On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 12.8.2014 00:30, Tomas Vondra wrote:
>>> On 11.8.2014 20:25, Robert Haas wrote:
>>>> It also strikes me that when there's only 1 batch, the set of bits
>>>> that map onto the batch number is zero-width, and one zero-width bit
>>>> range is as good as another.  In other words, if you're only planning
>>>> to do one batch, you can easily grow the number of buckets on the fly.
>>>> Growing the number of buckets only becomes difficult once you have
>>>> more than one batch.
>>
>> ...
>>
>>> I was considering using reversing the bits of the hash, because that's
>>> pretty much the simplest solution. But I think you're right it might
>>> actually work like this:
>>>
>>>   * Are more batches needed?
>>>
>>>       (yes) => just use nbuckets = my_log2(work_mem / tuple_size)
>>>
>>>       (no) => go ahead, until processing all tuples or hitting work_mem
>>>
>>>               (work_mem) => meh, use the same nbuckets above
>>>
>>>               (all tuples) => compute optimal nbuckets / resize
>>>
>>>
>>> But I need to think about this a bit. So far it seems to me there's no
>>> way additional batches might benefit from increasing nbuckets further.
>>
>> I think this is a simple and solid solution, solving the batchno
>> computation issues quite nicely. Attached is v10 patch (bare and
>> combined with the dense allocation), that does this:
>>
>> 1) when we know we'll need batching, buckets are sized for full work_mem
>>    (using the estimated tuple width, etc.)
>>
>> 2) without the batching, we estimate the 'right number of buckets' for
>>    the estimated number of tuples, and keep track of the optimal number
>>    as tuples are added to the hash table
>>
>>    - if we discover we need to start batching, we keep the current
>>      optimal value (which should be the same as the max number of
>>      buckets) and don't mess with it anymore (making it possible to
>>      compute batch IDs just like before)
>>
>>    - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table
>>      is resized as part of the rebatch
>>
>>    - if the hash build completes without batching, we do the resize
>>
>> I believe the patch is pretty much perfect. I plan to do more thorough
>> testing on a wide range of queries in the next few days.
>>
>> I also removed the 'enable_hash_resize' GUC, because it would be more
>> complex to implement this properly after doing the resize as part of
>> rebatch etc.. So either it would make the patch more complex, or it
>> wouldn't do what the name promises.
>
> A variety of trivial comments on this:
>
> PostgreSQL style is un-cuddled curly braces.  Also, multi-line
> comments need to start with a line containing only /* and end with a
> line containing only */.  In one place you've added curly braces
> around a single-line block that is otherwise unmodified; please don't
> do that.  In one place, you have "becase" instead of "because".  In
> another place, you write "add if after it" but it should say "add it
> after it" or maybe better "add the new one after it".  Avoid using
> punctuation like "=>" in comments to illustrate the connection between
> sentences; instead, use a connecting word like "then" or "therefore"
> or whatever is appropriate; in this instance, a period followed by the
> start of a new sentence seems sufficient.  Revert the removal of a
> single line of whitespace near the top of nodeHash.c.
>
> There are too many things marked XXX in this patch.  They should
> either be fixed, if they are real problems, or they should be
> commented in a way that doesn't give rise to the idea that they're
> problems if they aren't.

OK, thanks for pointing this out. Attached is v11 of the patch (both
separate and combined with the dense allocation, as before).

I fixed as many of those issues as possible. All the XXX items were
obsolete, except for one in the chunk_alloc function.

I have also removed one constant

>
> OK, now on to some more substantive stuff:
>
> 1. It's not clear to me what the overall effect of this patch on
> memory utilization is.  Reducing NTUP_PER_BUCKET from 10 to 1 is going
> to use, on the average, 10x as much bucket-header memory per tuple.
> Specifically, I think it means we'll use about 8 bytes of
> bucket-header memory per tuple instead of 0.8 bytes per tuple.  If the
> tuples are narrow, that could be significant; concerns have been
> expressed about that here in the past.  Increasing the number of
> buckets could also increase memory usage.  On the other hand, the
> dense allocation stuff probably saves a ton of memory, so maybe we end
> up overall, but I'm not sure.  Your thoughts, and maybe some test
> results with narrow and wide tuples, would be appreciated.

The effect of the dense allocation was briefly discussed in this thread,
along with some quick measurements:

http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz

The dense allocation removes pretty much all the palloc overhead. For a
40B tuple, I did get this before the dense allocation

   HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

and this with the dense allocation patch applied

   HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So it's pretty much 50% reduction in memory consumption.

Now, the palloc header is >=16B, and removing this more than compensates
the increased number of buckets (in the worst case, there may be ~2x
buckets per tuple, which is 2x8B pointers).

I did a number of tests, and the results are quite consistent with this.


> 2. But, on the positive side, modulo the memory utilization questions
> mentioned above, I would expect the impact on hash join performance to
> be positive.  Going from 10 tuples per bucket to just 1 should help,
> and on cases where the actual load factor would have ended up much
> higher because of poor estimation, increasing the number of buckets on
> the fly should help even more.  I haven't tested this, though.

Yes, that's the results I was getting.

I've done a number of tests, and not a single one showed a slowdown so
far. Most well-estimated queries were 2-3x faster, and the poorly
estimated ones were orders of magnitude faster.

We're working on getting this fixed on a range of workloads, I'll post
some results once I have that.


> I haven't had a chance to completely go through this yet, so these are
> just some initial thoughts.

Thanks!

Tomas

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

05 September 2014, 19:24:04

Hi everyone,

as Heikki mentioned in his "commitfest status" message, this patch still
has no reviewers :-( Is there anyone willing to pick up that, together
with the 'dense allocation' patch (as those two are closely related)?

I'm looking in Robert's direction, as he's the one who came up with the
dense allocation idea, and also commented on the hasjoin bucket resizing
patch ...

I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
not really independent ;-) I'll do some reviews on the other patches
over the weekend, to balance this of course.

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

08 September 2014, 20:44:35

On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> as Heikki mentioned in his "commitfest status" message, this patch still
> has no reviewers :-( Is there anyone willing to pick up that, together
> with the 'dense allocation' patch (as those two are closely related)?
>
> I'm looking in Robert's direction, as he's the one who came up with the
> dense allocation idea, and also commented on the hasjoin bucket resizing
> patch ...
>
> I'd ask Pavel Stehule, but as he's sitting next to me in the office he's
> not really independent ;-) I'll do some reviews on the other patches
> over the weekend, to balance this of course.

Will any of those patches to be, heh heh, mine?

I am a bit confused by the relationship between the two patches you
posted.  The "combined" patch applies, but the other one does not.  If
this is a sequence of two patches, it seems like it would be more
helpful to post A and B rather than B and A+B.  Perhaps the other
patch is on a different thread but there's not an obvious reference to
said thread here.

In ExecChooseHashTableSize(), if buckets_bytes is independent of
nbatch, could we just compute it before working out dbatch, and just
deduct it from hash_table_bytes?  That seems like it'd do the same
thing for less work.  Either way, what happens if buckets_bytes is
already bigger than hash_table_bytes?

A few more stylistic issues that I see:

+               if ((hashtable->nbatch == 1) &&
+                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+       if (size > (HASH_CHUNK_SIZE/8))

While I'm all in favor of using parentheses generously where it's
needed to add clarity, these and similar usages seem over the top to
me.

On a related note, the first of these reads like this if (stuff) { if
(more stuff) { do things } } which makes one wonder why we need two if
statements.

+
+       /* used for dense allocation of tuples (into linked chunks) */
+       HashChunk       chunks_batch;   /*  one list for the whole batch */
+}      HashJoinTableData;

If the original coding didn't have a blank line between the last
structure member and the closing brace, it's probably not right to
make it that way now.  There are similar issues at the end of some
functions you modified, and a few other places (like the new code in
ExecChooseHashTableSize and chunk_alloc) where there are extra blank
lines at the starts and ends of blocks.

+char * chunk_alloc(HashJoinTable hashtable, int size) {

Eh... no.

+       /* XXX maybe we should use MAXALIGN(size) here ... */

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

08 September 2014, 21:53:49

On 8.9.2014 22:44, Robert Haas wrote:
> On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> as Heikki mentioned in his "commitfest status" message, this patch
>> still has no reviewers :-( Is there anyone willing to pick up that,
>> together with the 'dense allocation' patch (as those two are
>> closely related)?
>>
>> I'm looking in Robert's direction, as he's the one who came up with
>> the dense allocation idea, and also commented on the hasjoin bucket
>> resizing patch ...
>>
>> I'd ask Pavel Stehule, but as he's sitting next to me in the office
>> he's not really independent ;-) I'll do some reviews on the other
>> patches over the weekend, to balance this of course.
> 
> Will any of those patches to be, heh heh, mine?

I'll exchange some of the credit for +1. Deal? ;-)

> I am a bit confused by the relationship between the two patches you 
> posted. The "combined" patch applies, but the other one does not. If 
> this is a sequence of two patches, it seems like it would be more 
> helpful to post A and B rather than B and A+B. Perhaps the other 
> patch is on a different thread but there's not an obvious reference
> to said thread here.

Yeah, it's probably a bit confusing. The thing is the "dense allocation"
idea was discussed in a different thread, so I posted the patch there:
 http://www.postgresql.org/message-id/53CBEB8A.207@fuzzy.cz

The patch tweaking hash join buckets builds on the dense allocation,
because it (a) makes up for the memory used for more buckets and (b)
it's actually easier to rebuild the buckets this way.

So I only posted the separate patch for those who want to do a review,
and then a "complete patch" with both parts combined. But it sure may be
a bit confusing.

> In ExecChooseHashTableSize(), if buckets_bytes is independent of 
> nbatch, could we just compute it before working out dbatch, and just 
> deduct it from hash_table_bytes? That seems like it'd do the same 
> thing for less work.

Well, we can do that. But I really don't think that's going to make
measurable difference. And I think the code is more clear this way.

> Either way, what happens if buckets_bytes is already bigger than 
> hash_table_bytes?

Hm, I don't see how that could happen.

FWIW, I think the first buckets_bytes formula is actually wrong:
  buckets_bytes    = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET);

and should be
  buckets_bytes =    sizeof(HashJoinTuple) * (1 << my_log2(ntuples / NTUP_PER_BUCKET));

instead. Also, we might consider that we never use less than 1024
buckets (but that's minor issue, I guess).

But once we get into the "need batching" branch, we do this:
  lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET
+ sizeof(HashJoinTuple)));

which includes both the bucket (pointer) and tuple size, and IMHO
guarantees that bucket_bytes will never be over work_mem (which is what
hash_table_bytes is).

The only case where this might happen is (tupsize < 8), thanks to the
my_log2 (getting (50% work_mem + epsilon), doubled to >100% work_mem).

But tupsize is defined as this:
tupsize = HJTUPLE_OVERHEAD +    MAXALIGN(sizeof(MinimalTupleData)) +    MAXALIGN(tupwidth);

and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is >11B (ignoring
alignment) etc.

So I believe this is safe ...

> A few more stylistic issues that I see:
> 
> +               if ((hashtable->nbatch == 1) &&
> +                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
> +       if (size > (HASH_CHUNK_SIZE/8))
> 
> While I'm all in favor of using parentheses generously where it's
> needed to add clarity, these and similar usages seem over the top to
> me.

It seemed more readable for me at the time of coding it, and it still
seems better this way, but ...

https://www.youtube.com/watch?v=CxK_nA2iVXw

> On a related note, the first of these reads like this if (stuff) { if
> (more stuff) { do things } } which makes one wonder why we need two if
> statements.

We probably don't ...

> 
> +
> +       /* used for dense allocation of tuples (into linked chunks) */
> +       HashChunk       chunks_batch;   /*  one list for the whole batch */
> +
>  }      HashJoinTableData;
> 
> If the original coding didn't have a blank line between the last
> structure member and the closing brace, it's probably not right to
> make it that way now.  There are similar issues at the end of some
> functions you modified, and a few other places (like the new code in
> ExecChooseHashTableSize and chunk_alloc) where there are extra blank
> lines at the starts and ends of blocks.

I'll fix that. FWIW, those issues seem to be in the 'dense allocation'
patch.

> 
> +char * chunk_alloc(HashJoinTable hashtable, int size) {
> 
> Eh... no.
> 
> +       /* XXX maybe we should use MAXALIGN(size) here ... */
> 
> +1.

I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
should be on the next line? And is the +1 about doing MAXALING?

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

09 September 2014, 14:10:08

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> So I only posted the separate patch for those who want to do a review,
> and then a "complete patch" with both parts combined. But it sure may be
> a bit confusing.

Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.

>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>> nbatch, could we just compute it before working out dbatch, and just
>> deduct it from hash_table_bytes? That seems like it'd do the same
>> thing for less work.
>
> Well, we can do that. But I really don't think that's going to make
> measurable difference. And I think the code is more clear this way.

Really?  It seems to me that you're doing more or less the same
calculation that's already been done a second time.  It took me 20
minutes of staring to figure out what it was really doing.  Now
admittedly I was a little low on caffeine, but...

>> Either way, what happens if buckets_bytes is already bigger than
>> hash_table_bytes?
>
> Hm, I don't see how that could happen.

How about an Assert() and a comment, then?

>> A few more stylistic issues that I see:
>>
>> +               if ((hashtable->nbatch == 1) &&
>> +                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>> +       if (size > (HASH_CHUNK_SIZE/8))
>>
>> While I'm all in favor of using parentheses generously where it's
>> needed to add clarity, these and similar usages seem over the top to
>> me.
>
> It seemed more readable for me at the time of coding it, and it still
> seems better this way, but ...
>
> https://www.youtube.com/watch?v=CxK_nA2iVXw

Heh.  Well, at any rate, I think PostgreSQL style is to not use
parentheses in that situation.

>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>
>> Eh... no.
>>
>> +       /* XXX maybe we should use MAXALIGN(size) here ... */
>>
>> +1.
>
> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
> should be on the next line? And is the +1 about doing MAXALING?

The "no" is about the fact that you have the return type, the function
name, and the opening brace on one line instead of three lines as is
project style.  The +1 is for applying MAXALIGN to the size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

09 September 2014, 22:49:48

On 9.9.2014 16:09, Robert Haas wrote:
> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> So I only posted the separate patch for those who want to do a review,
>> and then a "complete patch" with both parts combined. But it sure may be
>> a bit confusing.
>
> Let's do this: post each new version of the patches only on this
> thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

>>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>>> nbatch, could we just compute it before working out dbatch, and just
>>> deduct it from hash_table_bytes? That seems like it'd do the same
>>> thing for less work.
>>
>> Well, we can do that. But I really don't think that's going to make
>> measurable difference. And I think the code is more clear this way.
>
> Really?  It seems to me that you're doing more or less the same
> calculation that's already been done a second time.  It took me 20
> minutes of staring to figure out what it was really doing.  Now
> admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

>
>>> Either way, what happens if buckets_bytes is already bigger than
>>> hash_table_bytes?
>>
>> Hm, I don't see how that could happen.
>
> How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

   Assert(buckets_bytes <= hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

>
>>> A few more stylistic issues that I see:
>>>
>>> +               if ((hashtable->nbatch == 1) &&
>>> +                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>>> +       if (size > (HASH_CHUNK_SIZE/8))
>>>
>>> While I'm all in favor of using parentheses generously where it's
>>> needed to add clarity, these and similar usages seem over the top to
>>> me.
>>
>> It seemed more readable for me at the time of coding it, and it still
>> seems better this way, but ...
>>
>> https://www.youtube.com/watch?v=CxK_nA2iVXw
>
> Heh.  Well, at any rate, I think PostgreSQL style is to not use
> parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

>
>>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>>
>>> Eh... no.
>>>
>>> +       /* XXX maybe we should use MAXALIGN(size) here ... */
>>>
>>> +1.
>>
>> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
>> should be on the next line? And is the +1 about doing MAXALING?
>
> The "no" is about the fact that you have the return type, the function
> name, and the opening brace on one line instead of three lines as is
> project style.  The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
  The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
  Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
  threshold for 'oversized' items

  We need the threshold to be >=8kB, to trigger the special case
  within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

Tomas

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Heikki Linnakangas

Date:

10 September 2014, 18:25:43

On 09/10/2014 01:49 AM, Tomas Vondra wrote:
> On 9.9.2014 16:09, Robert Haas wrote:
>> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> So I only posted the separate patch for those who want to do a review,
>>> and then a "complete patch" with both parts combined. But it sure may be
>>> a bit confusing.
>>
>> Let's do this: post each new version of the patches only on this
>> thread, as a series of patches that can be applied one after another.
>
> OK, attached. Apply in this order
>
> 1) dense-alloc-v5.patch
> 2) hashjoin-nbuckets-v12.patch

The dense-alloc-v5.patch looks good to me. I have committed that with 
minor cleanup (more comments below). I have not looked at the second patch.

> I also did a few 'minor' changes to the dense allocation patch, most
> notably:
>
> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>    The original naming seemed a bit awkward.

That's too easy to confuse with regular MemoryContext and AllocChunk 
stuff. I renamed it to HashMemoryChunk.

> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>    threshold for 'oversized' items
>
>    We need the threshold to be >=8kB, to trigger the special case
>    within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

Should we care about the fact that if there are only a few tuples, we 
will nevertheless waste 32kB of memory for the chunk? I guess not, but I 
thought I'd mention it. The smallest allowed value for work_mem is 64kB.

> I also considered Heikki's suggestion that while rebatching, we can
> reuse chunks with a single large tuple, instead of copying it
> needlessly. My attempt however made the code much uglier (I almost used
> a GOTO, but my hands rejected to type it ...). I'll look into that.

I tried constructing a test case where the rehashing time would be 
significant enough for this to matter, but I couldn't find one. Even if 
the original estimate on the number of batches is way too small, we ramp 
up quickly to a reasonable size and the rehashing time is completely 
insignificant compared to all the other work. So I think we can drop 
that idea.

- Heikki

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

10 September 2014, 18:31:41

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> The dense-alloc-v5.patch looks good to me. I have committed that with minor
> cleanup (more comments below). I have not looked at the second patch.

Gah.  I was in the middle of doing this.  Sigh.

>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>    threshold for 'oversized' items
>>
>>    We need the threshold to be >=8kB, to trigger the special case
>>    within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>
> Should we care about the fact that if there are only a few tuples, we will
> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
> I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5 > 6k.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Heikki Linnakangas

Date:

10 September 2014, 18:57:04

On 09/10/2014 09:31 PM, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>> cleanup (more comments below). I have not looked at the second patch.
>
> Gah.  I was in the middle of doing this.  Sigh.

Sorry.

>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>     threshold for 'oversized' items
>>>
>>>     We need the threshold to be >=8kB, to trigger the special case
>>>     within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>
>> Should we care about the fact that if there are only a few tuples, we will
>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>
> I think we should change the threshold here to 1/8th.  The worst case
> memory wastage as-is ~32k/5 > 6k.

Not sure how you arrived at that number. The worst case is that each 32k 
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each 
chunk. That's 25% wastage. That's not too bad IMHO - the worst case 
waste of AllocSet is 50%.

But if you want to twiddle it, feel free. Note that there's some 
interplay with allocset code, like Tomas mentioned. If the threshold is 
< 8k, palloc() will round up request sizes smaller than 8k anyway. That 
wastes some memory, nothing more serious than that, but it seems like a 
good idea to avoid it.

- Heikki

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 September 2014, 18:59:59

On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>    The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

OK.

> 
>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>    threshold for 'oversized' items
>>
>>    We need the threshold to be >=8kB, to trigger the special case
>>    within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
> 
> Should we care about the fact that if there are only a few tuples, we
> will nevertheless waste 32kB of memory for the chunk? I guess not, but I
> thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I don't think that's a problem.

>> I also considered Heikki's suggestion that while rebatching, we can
>> reuse chunks with a single large tuple, instead of copying it
>> needlessly. My attempt however made the code much uglier (I almost used
>> a GOTO, but my hands rejected to type it ...). I'll look into that.
> 
> I tried constructing a test case where the rehashing time would be 
> significant enough for this to matter, but I couldn't find one. Even
> if the original estimate on the number of batches is way too small,
> we ramp up quickly to a reasonable size and the rehashing time is
> completely insignificant compared to all the other work. So I think
> we can drop that idea.

I don't think that had anything to do with rehashing. What you pointed
out is that when rebatching, we have to keep ~50% of the tuples, which
means 'copy into a new chunk, then throw away the old ones'.

For large tuples (you mentioned 100MB tuples, IIRC), we needlessly
allocate this amount of memory only to copy the single tuple and then
throw away the old tuple. So (a) that's an additional memcpy, and (b) it
allocates additional 100MB of memory for a short period of time.

Now, I'd guess when dealing with tuples this large, there will be more
serious trouble elsewhere, but I don't want to neglect it. Maybe it's
worth optimizing?

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 September 2014, 19:02:26

On 10.9.2014 20:31, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>> cleanup (more comments below). I have not looked at the second patch.
> 
> Gah.  I was in the middle of doing this.  Sigh.
> 
>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>    threshold for 'oversized' items
>>>
>>>    We need the threshold to be >=8kB, to trigger the special case
>>>    within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>
>> Should we care about the fact that if there are only a few tuples, we will
>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>> I'd mention it. The smallest allowed value for work_mem is 64kB.
> 
> I think we should change the threshold here to 1/8th.  The worst case
> memory wastage as-is ~32k/5 > 6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

10 September 2014, 19:06:22

On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 10.9.2014 20:31, Robert Haas wrote:
>> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> The dense-alloc-v5.patch looks good to me. I have committed that with minor
>>> cleanup (more comments below). I have not looked at the second patch.
>>
>> Gah.  I was in the middle of doing this.  Sigh.
>>
>>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>>    threshold for 'oversized' items
>>>>
>>>>    We need the threshold to be >=8kB, to trigger the special case
>>>>    within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
>
> So you'd lower the threshold to 4kB? That may lower the wastage in the
> chunks, but palloc will actually allocate 8kB anyway, wasting up to
> additional 4kB. So I don't see how lowering the threshold to 1/8th
> improves the situation ...

Ah, OK.  Well, never mind then.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 September 2014, 19:09:27

On 10.9.2014 20:55, Heikki Linnakangas wrote:
> On 09/10/2014 09:31 PM, Robert Haas wrote:
>>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4
>>>>     threshold for 'oversized' items
>>>>
>>>>     We need the threshold to be >=8kB, to trigger the special case
>>>>     within AllocSet. The 1/4 rule is consistent with
>>>> ALLOC_CHUNK_FRACTION.
>>>
>>> Should we care about the fact that if there are only a few tuples, we
>>> will
>>> nevertheless waste 32kB of memory for the chunk? I guess not, but I
>>> thought
>>> I'd mention it. The smallest allowed value for work_mem is 64kB.
>>
>> I think we should change the threshold here to 1/8th.  The worst case
>> memory wastage as-is ~32k/5 > 6k.
> 
> Not sure how you arrived at that number. The worst case is that each 32k
> chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
> chunk. That's 25% wastage. That's not too bad IMHO - the worst case
> waste of AllocSet is 50%.
> 
> But if you want to twiddle it, feel free. Note that there's some
> interplay with allocset code, like Tomas mentioned. If the threshold is
> < 8k, palloc() will round up request sizes smaller than 8k anyway. That
> wastes some memory, nothing more serious than that, but it seems like a
> good idea to avoid it.

Yes, and pfree then keeps these blocks, which means a chance of keeping
memory that won't be reused. That's why I chose the 8kB threshold. So
I'd keep the 32kB/8kB, as proposed in the patch.

But I don't see this as a big issue - my experience is that either we
have very much smaller than 4kB, or much larger tuples than 8kB. And
even for the tuples between 4kB and 8kB, the waste is not that bad.

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 September 2014, 19:12:42

On 10.9.2014 20:25, Heikki Linnakangas wrote:
> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>> I also did a few 'minor' changes to the dense allocation patch, most
>> notably:
>>
>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>    The original naming seemed a bit awkward.
> 
> That's too easy to confuse with regular MemoryContext and AllocChunk
> stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

10 September 2014, 19:34:10

On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>>> I also did a few 'minor' changes to the dense allocation patch, most
>>> notably:
>>>
>>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>>    The original naming seemed a bit awkward.
>>
>> That's too easy to confuse with regular MemoryContext and AllocChunk
>> stuff. I renamed it to HashMemoryChunk.
>
> BTW this breaks the second patch, which is allocating new chunks when
> resizing the hash table. Should I rebase the patch, or can the commiter
> do s/MemoryChunk/HashMemoryChunk/ ?
>
> Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch.  So it's highly recommended.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 September 2014, 21:10:11

On 10.9.2014 21:34, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 10.9.2014 20:25, Heikki Linnakangas wrote:
>>> On 09/10/2014 01:49 AM, Tomas Vondra wrote:
>>>> I also did a few 'minor' changes to the dense allocation patch, most
>>>> notably:
>>>>
>>>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
>>>>    The original naming seemed a bit awkward.
>>>
>>> That's too easy to confuse with regular MemoryContext and AllocChunk
>>> stuff. I renamed it to HashMemoryChunk.
>>
>> BTW this breaks the second patch, which is allocating new chunks when
>> resizing the hash table. Should I rebase the patch, or can the commiter
>> do s/MemoryChunk/HashMemoryChunk/ ?
>>
>> Assuming there are no more fixes needed, of course.
>
> Rebasing it will save the committer time, which will increase the
> chances that one will look at your patch.  So it's highly recommended.

OK. So here's v13 of the patch, reflecting this change.

regards
Tomas

Attachment

hashjoin-nbuckets-v13.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

11 September 2014, 13:31:50

On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> OK. So here's v13 of the patch, reflecting this change.

With the exception of ExecChooseHashTableSize() and a lot of stylistic
issues along the lines of what I've already complained about, this
patch seems pretty good to me.  It does three things:

(1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
utilization, it also improves hash table performance, and the
additional memory we use is less than what we saved as a result of
commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.

(2) It changes ExecChooseHashTableSize() to include the effect of the
memory consumed by the bucket array.  If we care about properly
respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
setting, but more important for a small setting like 1 than for the
current value of 10.  I'm not sure the logic in this function is as
clean and simple as it can be just yet.

(3) It allows the number of batches to increase on the fly while the
hash join is in process.  This case arises when we initially estimate
that we only need a small hash table, and then it turns out that there
are more tuples than we expect.  Without this code, the hash table's
load factor gets too high and things start to suck.

I recommend separating this patch in two patches, the first covering
items (1) and (2) and the second covering item (3), which really seems
like a separate improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tom Lane

Date:

11 September 2014, 14:00:17

Robert Haas <robertmhaas@gmail.com> writes:
> With the exception of ExecChooseHashTableSize() and a lot of stylistic
> issues along the lines of what I've already complained about, this
> patch seems pretty good to me.  It does three things:
> ...
> (3) It allows the number of batches to increase on the fly while the
> hash join is in process.  This case arises when we initially estimate
> that we only need a small hash table, and then it turns out that there
> are more tuples than we expect.  Without this code, the hash table's
> load factor gets too high and things start to suck.

Pardon me for not having read the patch yet, but what part of (3)
wasn't there already?
        regards, tom lane

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

11 September 2014, 14:04:36

On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> With the exception of ExecChooseHashTableSize() and a lot of stylistic
>> issues along the lines of what I've already complained about, this
>> patch seems pretty good to me.  It does three things:
>> ...
>> (3) It allows the number of batches to increase on the fly while the
>> hash join is in process.  This case arises when we initially estimate
>> that we only need a small hash table, and then it turns out that there
>> are more tuples than we expect.  Without this code, the hash table's
>> load factor gets too high and things start to suck.
>
> Pardon me for not having read the patch yet, but what part of (3)
> wasn't there already?

EINSUFFICIENTCAFFEINE.

It allows the number of BUCKETS to increase, not the number of
batches.  As you say, the number of batches could already increase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tom Lane

Date:

11 September 2014, 14:11:56

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> (3) It allows the number of batches to increase on the fly while the
>>> hash join is in process.

>> Pardon me for not having read the patch yet, but what part of (3)
>> wasn't there already?

> EINSUFFICIENTCAFFEINE.

> It allows the number of BUCKETS to increase, not the number of
> batches.  As you say, the number of batches could already increase.

Ah.  Well, that would mean that we need a heuristic for deciding when to
increase the number of buckets versus the number of batches ... seems
like a difficult decision.
        regards, tom lane

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

"Tomas Vondra"

Date:

11 September 2014, 14:34:12

On 11 Září 2014, 15:31, Robert Haas wrote:
> On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> OK. So here's v13 of the patch, reflecting this change.
>
> With the exception of ExecChooseHashTableSize() and a lot of stylistic
> issues along the lines of what I've already complained about, this
> patch seems pretty good to me.  It does three things:

I believe I fixed the stylistic issues you pointed out, except maybe the
bracketing (which seems fine to me, though). So if you could point the
issues out, that'd be helpful.

>
> (1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
> utilization, it also improves hash table performance, and the
> additional memory we use is less than what we saved as a result of
> commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.
>
> (2) It changes ExecChooseHashTableSize() to include the effect of the
> memory consumed by the bucket array.  If we care about properly
> respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
> setting, but more important for a small setting like 1 than for the
> current value of 10.  I'm not sure the logic in this function is as
> clean and simple as it can be just yet.

I made that as clear as I was able to (based on your feedback), but I'm
"stuck" as I'm familiar with the logic. That is not to say it can't be
improved, but this needs to be reviewed by someone else.

>
> (3) It allows the number of batches to increase on the fly while the
> hash join is in process.  This case arises when we initially estimate
> that we only need a small hash table, and then it turns out that there
> are more tuples than we expect.  Without this code, the hash table's
> load factor gets too high and things start to suck.
>
> I recommend separating this patch in two patches, the first covering
> items (1) and (2) and the second covering item (3), which really seems
> like a separate improvement.

That probably makes sense.

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

"Tomas Vondra"

Date:

11 September 2014, 14:46:44

On 11 Září 2014, 16:11, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> (3) It allows the number of batches to increase on the fly while the
>>>> hash join is in process.
>
>>> Pardon me for not having read the patch yet, but what part of (3)
>>> wasn't there already?
>
>> EINSUFFICIENTCAFFEINE.
>
>> It allows the number of BUCKETS to increase, not the number of
>> batches.  As you say, the number of batches could already increase.
>
> Ah.  Well, that would mean that we need a heuristic for deciding when to
> increase the number of buckets versus the number of batches ... seems
> like a difficult decision.

That's true, but that's not the aim of this patch. The patch simply
increases the number of buckets if the load happens to get too high, and
does not try to decide between increasing nbuckets and nbatch.

It's true that we can often get better performance with more batches, but
the cases I was able to inspect were caused either by (a) underestimates
resulting in inappropriate nbucket count, or (b) caching effects. This
patch solves (a), not even trying to fix (b).

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

11 September 2014, 15:22:03

On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> (3) It allows the number of batches to increase on the fly while the
>>>> hash join is in process.
>
>>> Pardon me for not having read the patch yet, but what part of (3)
>>> wasn't there already?
>
>> EINSUFFICIENTCAFFEINE.
>
>> It allows the number of BUCKETS to increase, not the number of
>> batches.  As you say, the number of batches could already increase.
>
> Ah.  Well, that would mean that we need a heuristic for deciding when to
> increase the number of buckets versus the number of batches ... seems
> like a difficult decision.

Well, the patch takes the approach of increasing the number of buckets
when (1) there's only one batch and (2) the load factor exceeds
NTUP_PER_BUCKET.  As Tomas points out, it's just increasing the number
of buckets to the exact same number which we would have allocated had
we known that this many tuples would show up.

Now, there is a possibility that we could lose.  Let's suppose that
the tuples overflow work_mem, but just barely.  If we've accurately
estimate how many tuples there are, the patch changes nothing: either
way we're going to need two batches.  But let's say we've
underestimated the number of tuples by 3x.  Then it could be that
without the patch we'd allocate fewer buckets, never increase the
number of buckets, and avoid batching; whereas with the patch, we'll
discover that our tuple estimate was wrong, increase the number of
buckets on the fly, and then be forced by the slightly-increased
memory consumption that results from increasing the number of buckets
to do two batches.  That would suck.

But catering to that case is basically hoping that we'll fall into a
septic tank and come out smelling like a rose - that is, we're hoping
that our initial poor estimate will cause us to accidentally do the
right thing later.  I don't think that's the way to go.   It's much
more important to get the case where things are bigger than we
expected but still fit within work_mem right; and we're currently
taking a huge run-time penalty in that case, as Tomas' results show.
If we wanted to cater to the other case in the future, we could
consider the opposite approach: if work_mem is exhahusted, throw the
bucket headers away and keep reading tuples into the dense-packed
chunks added by yesterday's commit.  If we again run out of work_mem,
then we *definitely* need to increase the batch count.  If we manage
to make everything fit, then we know exactly how many bucket headers
we can afford, and can use some heuristic to decide between that and
using more batches.

I don't think that should be a requirement for this patch, though: I
think the cumulative effect of Tomas's patches is going to be a win
*much* more often than it's a loss.  In 100% of cases, the
dense-packing stuff will make a hash table containing the same tuples
significantly smaller.  Except when we've radically overestimated the
tuple count, the change to NTUP_PER_BUCKET will mean less time spent
chasing hash chains.  It does use more memory, but that's more than
paid for by the first change.  Both the dense-packing stuff and the
changes to include bucket headers in the work_mem calculation will
have the effect of making the work_mem bound on memory utilization far
more accurate than it's ever been previously.  And the change to
increase the number of buckets at runtime will mean that we're
significantly more resilient against the planner underestimating the
number of tuples.  That's clearly good.  The fact that we can
construct borderline cases where it loses shouldn't deter us from
pushing forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tom Lane

Date:

11 September 2014, 15:28:32

"Tomas Vondra" <tv@fuzzy.cz> writes:
> On 11 Září 2014, 16:11, Tom Lane wrote:
>> Ah.  Well, that would mean that we need a heuristic for deciding when to
>> increase the number of buckets versus the number of batches ... seems
>> like a difficult decision.

> That's true, but that's not the aim of this patch. The patch simply
> increases the number of buckets if the load happens to get too high, and
> does not try to decide between increasing nbuckets and nbatch.

On further thought, increasing nbuckets without changing the batch
boundaries would not get us out of an out-of-work_mem situation, in fact
it makes memory consumption worse not better (assuming you count the
bucket headers towards work_mem ;-)).

So in principle, the rule seems like it ought to go "if load (defined as
max bucket chain length, I imagine?) gets too high, but we are still
well below work_mem, increase nbuckets; else increase nbatch".  And
perhaps we reset nbuckets again for the next batch, not sure.  If we
are dealing with an out-of-work_mem situation then only increasing nbatch
would be a suitable response.

Because of the risk that increasing nbuckets would itself lead to a
work_mem violation, I don't think it's sane to ignore the interaction
entirely, even in a first patch.
        regards, tom lane

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

"Tomas Vondra"

Date:

11 September 2014, 16:13:48

On 11 Září 2014, 17:28, Tom Lane wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> On 11 Z?????? 2014, 16:11, Tom Lane wrote:
>>> Ah.  Well, that would mean that we need a heuristic for deciding when
>>> to
>>> increase the number of buckets versus the number of batches ... seems
>>> like a difficult decision.
>
>> That's true, but that's not the aim of this patch. The patch simply
>> increases the number of buckets if the load happens to get too high, and
>> does not try to decide between increasing nbuckets and nbatch.
>
> On further thought, increasing nbuckets without changing the batch
> boundaries would not get us out of an out-of-work_mem situation, in fact
> it makes memory consumption worse not better (assuming you count the
> bucket headers towards work_mem ;-)).

Yes, the patch counts the bucket headers towards work_mem, while the
original code didn't. The reasoning is that due to changing
NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not
negligible and should be accounted for.

> So in principle, the rule seems like it ought to go "if load (defined as
> max bucket chain length, I imagine?) gets too high, but we are still
> well below work_mem, increase nbuckets; else increase nbatch".  And
> perhaps we reset nbuckets again for the next batch, not sure.  If we
> are dealing with an out-of-work_mem situation then only increasing nbatch
> would be a suitable response.

Almost.

If we expect batching at the very beginning, we size nbuckets for "full
work_mem" (see how many tuples we can get into work_mem, while not
breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the optimal
nbuckets value), we keep the value.

Only at the end, we check whether (nbuckets != nbuckets_optimal) and
resize the hash table if needed. Also, we keep this value for all batches
(it's OK because it assumes full work_mem, and it makes the batchno
evaluation trivial). So the resize happens only once and only for the
first batch.

> Because of the risk that increasing nbuckets would itself lead to a
> work_mem violation, I don't think it's sane to ignore the interaction
> entirely, even in a first patch.

This possibility of increasing the number of batches is the consequence of
counting the buckets towards work_mem. I believe that's the right thing to
do here, to make the work_mem guarantee stronger, but it's not something
this patch depends on. If this is the only concern, we can leave this out.

It's also worth mentioning that the dense allocation of tuples makes a
huge difference. palloc easily results in 100% overhead for narrow tuples
(that is, allocating 2x the amount of needed memory). For example over
here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB.

And the dense allocation patch pretty much makes this go away (as
demonstrated in the [1]). For wider tuples, the difference is smaller, but
that when the buckets are less important.

What I'm trying to say is that it's only a matter of work_mem values.
Currently, because of the weak guarantee, people tend to set the value
lower than necessary. The dense allocation makes this unnecessary,
allowing higher work_mem values, and thus eliminating the issue.

If this is not enough, we can stop counting the buckets towards work_mem.
It'll still be more memory efficient than the old approach (as a pointer
is smaller than palloc overhead), and it won't cause additional batches.
But I don't like this - I think we should make work_mem a stronger
guarantee.

[1] http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

11 September 2014, 21:57:41

On 11.9.2014 16:33, Tomas Vondra wrote:
> On 11 Září 2014, 15:31, Robert Haas wrote:
>> On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> OK. So here's v13 of the patch, reflecting this change.
>>
>> [...] It does three things:
>>
>> (1) It changes NTUP_PER_BUCKET to 1.  Although this increases memory
>> utilization, it also improves hash table performance, and the
>> additional memory we use is less than what we saved as a result of
>> commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a.
>>
>> (2) It changes ExecChooseHashTableSize() to include the effect of the
>> memory consumed by the bucket array.  If we care about properly
>> respecting work_mem, this is clearly right for any NTUP_PER_BUCKET
>> setting, but more important for a small setting like 1 than for the
>> current value of 10.  I'm not sure the logic in this function is as
>> clean and simple as it can be just yet.
>> (3) It allows the number of batches to increase on the fly while the
>> hash join is in process.  This case arises when we initially estimate
>> that we only need a small hash table, and then it turns out that there
>> are more tuples than we expect.  Without this code, the hash table's
>> load factor gets too high and things start to suck.
>>
>> I recommend separating this patch in two patches, the first covering
>> items (1) and (2) and the second covering item (3), which really seems
>> like a separate improvement.
>
> That probably makes sense.

Attached is the patch split as suggested:

(a) hashjoin-nbuckets-v14a-size.patch

    * NTUP_PER_BUCKET=1
    * counting buckets towards work_mem
    * changes in ExecChooseHashTableSize (due to the other changes)

(b) hashjoin-nbuckets-v14a-resize.patch

    * rest of the patch, that is ...
    * tracking optimal number of buckets
    * dynamic resize of the hash table
    * adding info the the EXPLAIN ANALYZE output

regards
Tomas

Attachment

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

12 September 2014, 12:28:57

On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Attached is the patch split as suggested:
>
> (a) hashjoin-nbuckets-v14a-size.patch
>
>     * NTUP_PER_BUCKET=1
>     * counting buckets towards work_mem
>     * changes in ExecChooseHashTableSize (due to the other changes)

OK, I'm going to work on this one now.  I have some ideas about how to
simplify this a bit and will post an update in a few hours once I see
how that pans out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

12 September 2014, 16:49:23

On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Attached is the patch split as suggested:
>>
>> (a) hashjoin-nbuckets-v14a-size.patch
>>
>>     * NTUP_PER_BUCKET=1
>>     * counting buckets towards work_mem
>>     * changes in ExecChooseHashTableSize (due to the other changes)
>
> OK, I'm going to work on this one now.  I have some ideas about how to
> simplify this a bit and will post an update in a few hours once I see
> how that pans out.

Here's what I came up with.  I believe this has the same semantics as
your patch for less code, but please check, because I might be wrong.
The main simplifications I made were:

(1) In master, we decide whether or not to batch based on the size of
the data, without knowing the number of buckets.  If we want to
account for the bucket overhead, that doesn't work.  But in your
patch, we basically computed the number of buckets we'd need for the
single-batch case, then decide whether to do a single batch, and if
yes, computed the same values over again with the same formula phrased
differently.  I revised that so that the single-batch case is
calculated only once.  If everything fits, we're all set, and don't
need to recalculate anything.

(2) In your patch, once we decide we need to batch, you set nbatch as
now, and then check whether the computed value is still adequate after
subtracting buckets_byte from hash_table_bytes.  I revised that so
that we make the initial batch estimate of dbatch by dividing
inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
hash_table_bytes.  I think that avoids the need to maybe increase
nbatch again afterwards.

I'm comfortable with this version if you are, but (maybe as a
follow-on commit) I think we could make this even a bit smarter.  If
inner_rel_bytes + bucket_bytes > hash_table_bytes but inner_rel_bytes
+ bucket_bytes / 4 < hash_table_bytes, then reduce the number of
buckets by 2x or 4x to make it fit.  That would provide a bit of
additional safety margin against cases where this patch might
unluckily lose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

hashjoin-lf1-rmh.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

12 September 2014, 19:39:14

On 12.9.2014 18:49, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Attached is the patch split as suggested:
>>>
>>> (a) hashjoin-nbuckets-v14a-size.patch
>>>
>>>     * NTUP_PER_BUCKET=1
>>>     * counting buckets towards work_mem
>>>     * changes in ExecChooseHashTableSize (due to the other changes)
>>
>> OK, I'm going to work on this one now.  I have some ideas about how to
>> simplify this a bit and will post an update in a few hours once I see
>> how that pans out.
> 
> Here's what I came up with. I believe this has the same semantics as 
> your patch for less code, but please check, because I might be
> wrong. The main simplifications I made were:
> 
> (1) In master, we decide whether or not to batch based on the size
> of the data, without knowing the number of buckets. If we want to 
> account for the bucket overhead, that doesn't work. But in your 
> patch, we basically computed the number of buckets we'd need for the 
> single-batch case, then decide whether to do a single batch, and if 
> yes, computed the same values over again with the same formula
> phrased differently. I revised that so that the single-batch case is 
> calculated only once. If everything fits, we're all set, and don't 
> need to recalculate anything.
> 
> (2) In your patch, once we decide we need to batch, you set nbatch
> as now, and then check whether the computed value is still adequate
> after subtracting buckets_byte from hash_table_bytes. I revised that
> so that we make the initial batch estimate of dbatch by dividing 
> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than 
> hash_table_bytes. I think that avoids the need to maybe increase 
> nbatch again afterwards.

Yes, I like those changes and I think your reasoning is correct in both
cases. It certainly makes the method shorter and more readable - I was
too "stuck" in the original logic, so thanks for improving this.

So +1 from me to those changes.

> I'm comfortable with this version if you are, but (maybe as a 
> follow-on commit) I think we could make this even a bit smarter. If 
> inner_rel_bytes + bucket_bytes > hash_table_bytes but
> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
> the number of buckets by 2x or 4x to make it fit. That would provide
> a bit of additional safety margin against cases where this patch
> might unluckily lose.

I don't think that's a good idea. That essentially says 'we're shooting
for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
batching (because of smaller work_mem) actually significantly improves
performance. If we want to make this reasoning properly, deciding
whether to add batches or reduce buckets, we need a better heuristics -
that's quite complex and I'd expect it to result ain a quite complex patch.

So -1 from me to this at this moment (it certainly needs much more
thought than a mere follow-on commit).

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

12 September 2014, 20:24:35

On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 12.9.2014 18:49, Robert Haas wrote:
>> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>> Attached is the patch split as suggested:
>>>>
>>>> (a) hashjoin-nbuckets-v14a-size.patch
>>>>
>>>>     * NTUP_PER_BUCKET=1
>>>>     * counting buckets towards work_mem
>>>>     * changes in ExecChooseHashTableSize (due to the other changes)
>>>
>>> OK, I'm going to work on this one now.  I have some ideas about how to
>>> simplify this a bit and will post an update in a few hours once I see
>>> how that pans out.
>>
>> Here's what I came up with. I believe this has the same semantics as
>> your patch for less code, but please check, because I might be
>> wrong. The main simplifications I made were:
>>
>> (1) In master, we decide whether or not to batch based on the size
>> of the data, without knowing the number of buckets. If we want to
>> account for the bucket overhead, that doesn't work. But in your
>> patch, we basically computed the number of buckets we'd need for the
>> single-batch case, then decide whether to do a single batch, and if
>> yes, computed the same values over again with the same formula
>> phrased differently. I revised that so that the single-batch case is
>> calculated only once. If everything fits, we're all set, and don't
>> need to recalculate anything.
>>
>> (2) In your patch, once we decide we need to batch, you set nbatch
>> as now, and then check whether the computed value is still adequate
>> after subtracting buckets_byte from hash_table_bytes. I revised that
>> so that we make the initial batch estimate of dbatch by dividing
>> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than
>> hash_table_bytes. I think that avoids the need to maybe increase
>> nbatch again afterwards.
>
> Yes, I like those changes and I think your reasoning is correct in both
> cases. It certainly makes the method shorter and more readable - I was
> too "stuck" in the original logic, so thanks for improving this.
>
> So +1 from me to those changes.

OK, committed.  Please rebase the rest.

>> I'm comfortable with this version if you are, but (maybe as a
>> follow-on commit) I think we could make this even a bit smarter. If
>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>> the number of buckets by 2x or 4x to make it fit. That would provide
>> a bit of additional safety margin against cases where this patch
>> might unluckily lose.
>
> I don't think that's a good idea. That essentially says 'we're shooting
> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
> batching (because of smaller work_mem) actually significantly improves
> performance. If we want to make this reasoning properly, deciding
> whether to add batches or reduce buckets, we need a better heuristics -
> that's quite complex and I'd expect it to result ain a quite complex patch.
>
> So -1 from me to this at this moment (it certainly needs much more
> thought than a mere follow-on commit).

OK, but let's discuss further.  You make it sound like treating
NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I'm actually quite surprised that you find batching to be a better
strategy than skimping on buckets, because I would have expect the
opposite, almost categorically.  Batching means having to write out
the tuples we can't process right away and read them back in.  If that
involves disk I/O, I think the cost of that I/O is going to be FAR
more than the overhead we would have incurred by searching slightly
longer bucket chains.  If it didn't, then you could've set work_mem
higher and avoided batching in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

12 September 2014, 20:55:40

On 12.9.2014 22:24, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 12.9.2014 18:49, Robert Haas wrote:
>>> I'm comfortable with this version if you are, but (maybe as a
>>> follow-on commit) I think we could make this even a bit smarter. If
>>> inner_rel_bytes + bucket_bytes > hash_table_bytes but
>>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce
>>> the number of buckets by 2x or 4x to make it fit. That would provide
>>> a bit of additional safety margin against cases where this patch
>>> might unluckily lose.
>>
>> I don't think that's a good idea. That essentially says 'we're shooting
>> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where
>> batching (because of smaller work_mem) actually significantly improves
>> performance. If we want to make this reasoning properly, deciding
>> whether to add batches or reduce buckets, we need a better heuristics -
>> that's quite complex and I'd expect it to result ain a quite complex patch.
>>
>> So -1 from me to this at this moment (it certainly needs much more
>> thought than a mere follow-on commit).
> 
> OK, but let's discuss further.  You make it sound like treating
> NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced.
> I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range:
> if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2
> or NTUP_PER_BUCKET=4 may be perfectly reasonable.

I think this really depends on various factors. For example what may
work for small work_mem (fitting into L2 cache) may not work for large
values, etc.

> I'm actually quite surprised that you find batching to be a better 
> strategy than skimping on buckets, because I would have expect the 
> opposite, almost categorically. Batching means having to write out 
> the tuples we can't process right away and read them back in. If
> that involves disk I/O, I think the cost of that I/O is going to be
> FAR more than the overhead we would have incurred by searching
> slightly longer bucket chains. If it didn't, then you could've set
> work_mem higher and avoided batching in the first place.

No, I don't find batching to be a better strategy. I just think this
really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
to avoid batching" follow-up patch.

For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
and then discover we need to start batching anyway. Should we keep the
settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
for nbatch=2, but for nbatch=16?

I'm not saying it's wrong, but I think it's more complicated than it
seems from your description.

regards
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

12 September 2014, 21:09:11

On 12.9.2014 22:24, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>> Yes, I like those changes and I think your reasoning is correct in both
>> cases. It certainly makes the method shorter and more readable - I was
>> too "stuck" in the original logic, so thanks for improving this.
>>
>> So +1 from me to those changes.
>
> OK, committed.  Please rebase the rest.

Attached is v15 of the patch, rebased to current HEAD.

Tomas

Attachment

hashjoin-nbuckets-v15-resize.patch

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Robert Haas

Date:

12 September 2014, 21:23:07

On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I'm actually quite surprised that you find batching to be a better
>> strategy than skimping on buckets, because I would have expect the
>> opposite, almost categorically. Batching means having to write out
>> the tuples we can't process right away and read them back in. If
>> that involves disk I/O, I think the cost of that I/O is going to be
>> FAR more than the overhead we would have incurred by searching
>> slightly longer bucket chains. If it didn't, then you could've set
>> work_mem higher and avoided batching in the first place.
>
> No, I don't find batching to be a better strategy. I just think this
> really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4
> to avoid batching" follow-up patch.
>
> For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching,
> and then discover we need to start batching anyway. Should we keep the
> settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that
> for nbatch=2, but for nbatch=16?

My first thought is to revert to NTUP_PER_BUCKET=1, but it's certainly
arguable.  Either method, though, figures to be better than doing
nothing, so let's do something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

12 September 2014, 22:07:46

On 12.9.2014 23:22, Robert Haas wrote:
> On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> I'm actually quite surprised that you find batching to be a
>>> better strategy than skimping on buckets, because I would have
>>> expect the opposite, almost categorically. Batching means having
>>> to write out the tuples we can't process right away and read them
>>> back in. If that involves disk I/O, I think the cost of that I/O
>>> is going to be FAR more than the overhead we would have incurred
>>> by searching slightly longer bucket chains. If it didn't, then
>>> you could've set work_mem higher and avoided batching in the
>>> first place.
>>
>> No, I don't find batching to be a better strategy. I just think
>> this really needs more discussion than a simple "let's use
>> NTUP_PER_BUCKET=4 to avoid batching" follow-up patch.
>>
>> For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid
>> batching, and then discover we need to start batching anyway.
>> Should we keep the settings, or should we revert NTUP_PER_BUCKET=1?
>> Or maybe not doing that for nbatch=2, but for nbatch=16?
> 
> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
> certainly arguable. Either method, though, figures to be better than
> doing nothing, so let's do something.

OK, but can we commit the remaining part first? Because it'll certainly
interact with this proposed part, and it's easier to tweak when the code
is already there. Instead of rebasing over and over.

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Kevin Grittner

Date:

02 October 2014, 00:21:00

Tomas Vondra <tv@fuzzy.cz> wrote:
> On 12.9.2014 23:22, Robert Haas wrote:

>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
>> certainly arguable. Either method, though, figures to be better than
>> doing nothing, so let's do something.
>
> OK, but can we commit the remaining part first? Because it'll certainly
> interact with this proposed part, and it's easier to tweak when the code
> is already there. Instead of rebasing over and over.

The patch applied and built without problem, and pass `make
check-world`.  The only thing that caught my eye was the addition
of debug code using printf() instead of logging at a DEBUG level.
Is there any reason for that?

I still need to work through the (rather voluminous) email threads
and the code to be totally comfortable with all the issues, but
if Robert and Heikki are comfortable with it, I'm not gonna argue.

Preliminary benchmarks look outstanding!  I tried to control
carefully to ensure consistent results (by dropping, recreating,
vacuuming, analyzing, and checkpointing before each test), but
there were surprising outliers in the unpatched version.  It turned
out that it was because slight differences in the random samples
caused different numbers of buckets for both unpatched and patched,
but the patched version dealt with the difference gracefully while
the unpatched version's performance fluctuated randomly.

My thinking is that we should get this much committed and then
discuss further optimizations.  I tried to throw something at it
that would be something of a "worst case" because with the new code
it would start with one batch and go to two.

Five runs per test, alternating between unpatched and patched.

First I tried with the default work_mem size of 4MB:

Planning time / Execution time (ms)

unpatched, work_mem = 4MB

0.694 / 10392.930
0.327 / 10388.787
0.412 / 10533.036
0.650 / 17186.719
0.338 / 10707.423

Fast: Buckets: 2048  Batches: 1  Memory Usage: 3516kB
Slow: Buckets: 1024  Batches: 1  Memory Usage: 3516kB

patched, work_mem = 4MB

0.764 / 5021.792
0.655 / 4991.887
0.436 / 5068.777
0.410 / 5057.902
0.444 / 5021.543

1st & 5th:  Buckets: 131072 (originally 1024)  Batches: 2 (originally 1)  Memory Usage: 3073kB
all others: Buckets: 131072 (originally 2048)  Batches: 2 (originally 1)  Memory Usage: 3073kB

Then, just to see how both dealt with extra memory I did it again with 1GB:

unpatched, work_mem = 1GB

0.328 / 10407.836
0.319 / 10465.348
0.324 / 16606.948
0.569 / 10408.671
0.542 / 16996.209

Memory usage same as before.  Guess which runs started with 1024 buckets.  ;-)

0.556 / 3974.741
0.325 / 4012.613
0.334 / 3941.734
0.834 / 3894.391
0.603 / 3959.440

2nd & 4th:  Buckets: 131072 (originally 2048)  Batches: 1 (originally 1)  Memory Usage: 4540kB
all others: Buckets: 131072 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 4540kB

My only concern from the benchmarks is that it seemed like there
was a statistically significant increase in planning time:

unpatched plan time average: 0.450 ms
patched plan time average:   0.536 ms

That *might* just be noise, but it seems likely to be real.  For
the improvement in run time, I'd put up with an extra 86us in
planning time per hash join; but if there's any way to shave some
of that off, all the better.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Heikki Linnakangas

Date:

02 October 2014, 07:11:50

On 10/02/2014 03:20 AM, Kevin Grittner wrote:
> My only concern from the benchmarks is that it seemed like there
> was a statistically significant increase in planning time:
>
> unpatched plan time average: 0.450 ms
> patched plan time average:   0.536 ms
>
> That *might* just be noise, but it seems likely to be real.  For
> the improvement in run time, I'd put up with an extra 86us in
> planning time per hash join; but if there's any way to shave some
> of that off, all the better.

The patch doesn't modify the planner at all, so it would be rather 
surprising if it increased planning time. I'm willing to just write that 
off as noise.

- Heikki

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

"Tomas Vondra"

Date:

02 October 2014, 07:50:32

Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a):
> Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 12.9.2014 23:22, Robert Haas wrote:
>
>>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's
>>> certainly arguable. Either method, though, figures to be better than
>>> doing nothing, so let's do something.
>>
>> OK, but can we commit the remaining part first? Because it'll certainly
>> interact with this proposed part, and it's easier to tweak when the code
>> is already there. Instead of rebasing over and over.
>
> The patch applied and built without problem, and pass `make
> check-world`.  The only thing that caught my eye was the addition
> of debug code using printf() instead of logging at a DEBUG level.
> Is there any reason for that?

Not really. IIRC the main reason it that the other code in nodeHash.c uses
the same approach.

> I still need to work through the (rather voluminous) email threads
> and the code to be totally comfortable with all the issues, but
> if Robert and Heikki are comfortable with it, I'm not gonna argue.

;-)

> Preliminary benchmarks look outstanding!  I tried to control
> carefully to ensure consistent results (by dropping, recreating,
> vacuuming, analyzing, and checkpointing before each test), but
> there were surprising outliers in the unpatched version.  It turned
> out that it was because slight differences in the random samples
> caused different numbers of buckets for both unpatched and patched,
> but the patched version dealt with the difference gracefully while
> the unpatched version's performance fluctuated randomly.

Can we get a bit more details on the test cases? I did a number of tests
while working on the patch, and I generally tested two cases:

(a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET)

(b) mis-estimated queries, with various levels of accuracy (say, 10x -
1000x misestimates)

From the description, it seems you only tested (b) - is that correct?

The thing is that while the resize is very fast and happens only once,
it's not perfectly free. In my tests, this was always more than
compensated by the speedups (even in the weird cases reported by Stephen
Frost), so I think we're safe here.

Also, I definitely recommend testing with different hash table sizes (say,
work_mem=256MB and the hash table just enough to fit in without batching).
The thing is the effect of CPU caches is very different for small and
large hash tables. (This is not about work_mem alone, but about how much
memory is used by the hash table - according to the results you posted it
never gets over ~4.5MB)

You tested against current HEAD, right? This patch was split into three
parts, two of which were already commited (45f6240a and 8cce08f1). The
logic of the patch was "this might take a of time/memory, but it's
compensated by these other changes". Maybe running the tests on the
original code would be interesting?

Although, if this last part of the patch actually improves the performance
on it's own, we're fine - it'll improve it even more compared to the old
code (especially before lowering NTUP_PER_BUCKET 10 to 1).

> My only concern from the benchmarks is that it seemed like there
> was a statistically significant increase in planning time:
>
> unpatched plan time average: 0.450 ms
> patched plan time average:   0.536 ms
>
> That *might* just be noise, but it seems likely to be real.  For
> the improvement in run time, I'd put up with an extra 86us in
> planning time per hash join; but if there's any way to shave some
> of that off, all the better.

I agree with Heikki that this is probably noise, because the patch does
not mess with planner at all.

The only thing I can think of is adding a few fields into
HashJoinTableData. Maybe this makes the structure too large to fit into a
cacheline, or something?

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Kevin Grittner

Date:

09 October 2014, 14:58:44

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 10/02/2014 03:20 AM, Kevin Grittner wrote:
>> My only concern from the benchmarks is that it seemed like there
>> was a statistically significant increase in planning time:
>>
>> unpatched plan time average: 0.450 ms
>> patched plan time average:  0.536 ms
>>
>> That *might* just be noise, but it seems likely to be real.  For
>> the improvement in run time, I'd put up with an extra 86us in
>> planning time per hash join; but if there's any way to shave some
>> of that off, all the better.
>
> The patch doesn't modify the planner at all, so it would be rather
> surprising if it increased planning time. I'm willing to just write that
> off as noise.

Fair enough.  I have seen much larger variations caused by how
the executable code happened to line up relative to cacheline
boundaries; and we've never made any effort to manage that.

I've tried various other tests using \timing rather than EXPLAIN,
and the patched version looks even better in those cases.  I have
seen up to 4x the performance for a query using the patched
version, higher variability in run time without the patch, and have
yet to devise a benchmark where the patched version came out slower
(although I admit to not being as good at creating such cases as
some here).

When given a generous work_mem setting the patched version often
uses more of what it is allowed than the unpatched version (which
is probably one of the reasons it tends to do better).  If someone
has set a high work_mem and has gotten by only because the
configured amount is not all being used when it would benefit
performance, they may need to adjust work_mem down to avoid memory
problems.  That doesn't seem to me to be a reason to reject the
patch.

This is in "Waiting on Author" status only because I never got an
answer about why the debug code used printf() rather the elog() at
a DEBUG level.  Other than that, I would say this patch is Ready
for Committer.  Tomas?  You there?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

09 October 2014, 18:50:45

On 9.10.2014 16:55, Kevin Grittner wrote:
>
> I've tried various other tests using \timing rather than EXPLAIN, and
> the patched version looks even better in those cases. I have seen up
> to 4x the performance for a query using the patched version, higher
> variability in run time without the patch, and have yet to devise a
> benchmark where the patched version came out slower (although I admit
> to not being as good at creating such cases as some here).

Nice. Thanks for the testing!

The only case I've been able to come up with is when the hash table fits
into work_mem only thanks to not counting the buckets. The new code will
start batching in this case.

That is mostly luck, however, because it depends on the cardinality
estimates, and the best way to fix it is increasing work_mem (which is
safer thanks to reducing the overhead).

Also, Robert proposed a way to mitigate this, if we realize we'd have to
do batching during the initial sizing, we can peek whether reducing the
number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
good approach, and will look into that after pgconf.eu (i.e. early
November), unless someone else is interested.

> When given a generous work_mem setting the patched version often uses
> more of what it is allowed than the unpatched version (which is
> probably one of the reasons it tends to do better). If someone has
> set a high work_mem and has gotten by only because the configured
> amount is not all being used when it would benefit performance, they
> may need to adjust work_mem down to avoid memory problems. That
> doesn't seem to me to be a reason to reject the patch.

I'm not entirely sure I understand this paragraph. What do you mean by
"configured amount is not all being used when it would benefit
performance"? Can you give an example?

The only thing I can think of is the batching behavior described above.

> This is in "Waiting on Author" status only because I never got an
> answer about why the debug code used printf() rather the elog() at
> a DEBUG level.  Other than that, I would say this patch is Ready
> for Committer.  Tomas?  You there?

I think I responded to that on October 2, quoting:

===================================================================
On 2.10.2014 09:50, Tomas Vondra wrote:
> On 2.10.2014, 2:20, Kevin Grittner wrote:
>>
>> The patch applied and built without problem, and pass `make 
>> check-world`. The only thing that caught my eye was the addition of
>> debug code using printf() instead of logging at a DEBUG level. Is
>> there any reason for that?
>
> Not really. IIRC the main reason it that the other code in
> nodeHash.c uses the same approach.
===================================================================

I believe it's safe to switch the logging to elog(). IMHO the printf
logging is there from some very early version of the code, before elog
was introduced. Or something like that.

Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Kevin Grittner

Date:

09 October 2014, 20:31:51

Tomas Vondra <tv@fuzzy.cz> wrote:
> On 9.10.2014 16:55, Kevin Grittner wrote:

>> I've tried various other tests using \timing rather than EXPLAIN, and
>> the patched version looks even better in those cases. I have seen up
>> to 4x the performance for a query using the patched version, higher
>> variability in run time without the patch, and have yet to devise a
>> benchmark where the patched version came out slower (although I admit
>> to not being as good at creating such cases as some here).
>
> Nice. Thanks for the testing!
>
> The only case I've been able to come up with is when the hash table fits
> into work_mem only thanks to not counting the buckets. The new code will
> start batching in this case.

Hmm.  If you look at the timings in my posting from 2014-10-01 I
had cases where the patched version started with one batch and went
to two, while the unpatched used just one batch, and the patched
version was still more than twice as fast.  I'm sure the "on disk"
batch was fully cached, however; it might work out differently if
disk speed actually came into the picture.

> That is mostly luck, however, because it depends on the cardinality
> estimates, and the best way to fix it is increasing work_mem (which is
> safer thanks to reducing the overhead).
>
> Also, Robert proposed a way to mitigate this, if we realize we'd have to
> do batching during the initial sizing, we can peek whether reducing the
> number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a
> good approach, and will look into that after pgconf.eu (i.e. early
> November), unless someone else is interested.

Sure, but it would be good to confirm that it's actually needed first.

>> When given a generous work_mem setting the patched version often uses
>> more of what it is allowed than the unpatched version (which is
>> probably one of the reasons it tends to do better). If someone has
>> set a high work_mem and has gotten by only because the configured
>> amount is not all being used when it would benefit performance, they
>> may need to adjust work_mem down to avoid memory problems. That
>> doesn't seem to me to be a reason to reject the patch.
>
> I'm not entirely sure I understand this paragraph. What do you mean by
> "configured amount is not all being used when it would benefit
> performance"? Can you give an example?

Again, the earlier post showed that while the unpatched used 3516kB
whether it had work_mem set to 4MB or 1GB, the patched version used
3073kB when work_mem was set to 4MB and 4540kB when work_mem was
set to 1GB.  The extra memory allowed the patched version to stay
at 1 batch, improving performance over the other setting.

> The only thing I can think of is the batching behavior described above.
>
>> This is in "Waiting on Author" status only because I never got an
>> answer about why the debug code used printf() rather the elog() at
>> a DEBUG level.  Other than that, I would say this patch is Ready
>> for Committer.  Tomas?  You there?
>
> I think I responded to that on October 2, quoting:
>
> ===================================================================
> On 2.10.2014 09:50, Tomas Vondra wrote:
>> On 2.10.2014, 2:20, Kevin Grittner wrote:
>>>
>>> The patch applied and built without problem, and pass `make
>>> check-world`. The only thing that caught my eye was the addition of
>>> debug code using printf() instead of logging at a DEBUG level. Is
>>> there any reason for that?
>>
>> Not really. IIRC the main reason it that the other code in
>> nodeHash.c uses the same approach.
>===================================================================

Ah, I never received that email.  That tends to happen every now
and then.  :-(

> I believe it's safe to switch the logging to elog(). IMHO the printf
> logging is there from some very early version of the code, before elog
> was introduced. Or something like that.

I guess it might be best to remain consistent in this patch and
change that in a separate patch.  I just wanted to make sure you
didn't see any reason not to do so.

With that addressed I will move this to Ready for Committer.  Since
both Heikki and Robert spent time on this patch earlier, I'll give
either of them a shot at committing it if they want; otherwise I'll
do it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Tomas Vondra

Date:

10 October 2014, 18:09:53

Hi!

On 9.10.2014 22:28, Kevin Grittner wrote:
> Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>> The only case I've been able to come up with is when the hash table
>> fits into work_mem only thanks to not counting the buckets. The new
>> code will start batching in this case.
> 
> Hmm. If you look at the timings in my posting from 2014-10-01 I had
> cases where the patched version started with one batch and went to
> two, while the unpatched used just one batch, and the patched version
> was still more than twice as fast. I'm sure the "on disk" batch was
> fully cached, however; it might work out differently if disk speed
> actually came into the picture.

I think the case with large outer table and memory pressure, forcing the
batches to be actually written to disk (instead of just being in the
page cache) is the main concern here.


>> That is mostly luck, however, because it depends on the cardinality
>> estimates, and the best way to fix it is increasing work_mem (which is
>> safer thanks to reducing the overhead).
>>
>> Also, Robert proposed a way to mitigate this, if we realize we'd
>> have to do batching during the initial sizing, we can peek whether
>> reducing the number of buckets (to 1/2 or maybe 1/4) would help. I
>> believe this is a good approach, and will look into that after
>> pgconf.eu (i.e. early November), unless someone else is interested.
> 
> Sure, but it would be good to confirm that it's actually needed first.

Yeah. But with cleverly crafted test case it's possible to confirm
almost everything ;-)

>>> When given a generous work_mem setting the patched version often uses
>>> more of what it is allowed than the unpatched version (which is
>>> probably one of the reasons it tends to do better). If someone has
>>> set a high work_mem and has gotten by only because the configured
>>> amount is not all being used when it would benefit performance, they
>>> may need to adjust work_mem down to avoid memory problems. That
>>> doesn't seem to me to be a reason to reject the patch.
>>
>> I'm not entirely sure I understand this paragraph. What do you mean by
>> "configured amount is not all being used when it would benefit
>> performance"? Can you give an example?
> 
> Again, the earlier post showed that while the unpatched used 3516kB
> whether it had work_mem set to 4MB or 1GB, the patched version used
> 3073kB when work_mem was set to 4MB and 4540kB when work_mem was
> set to 1GB.  The extra memory allowed the patched version to stay
> at 1 batch, improving performance over the other setting.

OK, so with 4MB the new version was batching, while the original code
keeps a single batch (likely due to not counting buckets into work_mem).
I believe that's expected behavior.

Also, in the e-mail from Oct 2 that got lost [1], I pointed out that by
testing only with small hash tables the results are likely influenced by
high CPU cache hit ratio. I think benchmarking with larger inner
relation (thus increasing the memory occupied by the hash table) might
be interesting.

[1]
http://www.postgresql.org/message-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu

> 
>> The only thing I can think of is the batching behavior described above.
>>
>>> This is in "Waiting on Author" status only because I never got an
>>> answer about why the debug code used printf() rather the elog() at
>>> a DEBUG level.  Other than that, I would say this patch is Ready
>>> for Committer.  Tomas?  You there?
>>
>> I think I responded to that on October 2, quoting:
...
> Ah, I never received that email.  That tends to happen every now
> and then.  :-(
> 
>> I believe it's safe to switch the logging to elog(). IMHO the printf
>> logging is there from some very early version of the code, before elog
>> was introduced. Or something like that.
> 
> I guess it might be best to remain consistent in this patch and
> change that in a separate patch.  I just wanted to make sure you
> didn't see any reason not to do so.

OK, I'll put that on my TODO list for the next CF.

> With that addressed I will move this to Ready for Committer.  Since
> both Heikki and Robert spent time on this patch earlier, I'll give
> either of them a shot at committing it if they want; otherwise I'll
> do it.

Great, thanks!
Tomas

Re: bad estimation together with large work_mem generates terrible slow hash joins

From

Kevin Grittner

Date:

13 October 2014, 15:29:19

Kevin Grittner <kgrittn@ymail.com> wrote:

> Since both Heikki and Robert spent time on this patch earlier,
> I'll give either of them a shot at committing it if they want;
> otherwise I'll do it.

Done.  Thanks, Tomas!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company