Thread: bad estimation together with large work_mem generates terrible slow hash joins
bad estimation together with large work_mem generates terrible slow hash joins
From
Pavel Stehule
Date:
Hello all,
today I had to work with one slow query - when I checked different scenarios I found a dependency on work_mem size - but it is atypical - bigger work_mem increased query execution 31 minutes (600MB work mem) and 1 minute (1MB).db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '600MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2145957.37..2145957.96 rows=59 width=12) (actual time=1864088.145..1864088.155 rows=31 loops=1)
-> Hash Join (cost=882573.27..2142753.08 rows=213619 width=12) (actual time=9083.326..1859069.171 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.006..14271.239 rows=32211565 loops=1)
-> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual time=9076.153..9076.153 rows=3310768 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 129327kB
-> Hash Join (cost=782.15..881801.58 rows=61735 width=8) (actual time=1.423..8086.228 rows=3310768 loops=1)
Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845420.42 rows=9328442 width=12) (actual time=0.015..4098.915 rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=1.400..1.400 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.111..1.218 rows=365 loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.068..0.068 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 1864107.373 ms
(16 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2275675.45..2275675.95 rows=50 width=12) (actual time=48438.407..48438.416 rows=31 loops=1)
-> Hash Join (cost=882772.88..2272526.68 rows=209918 width=12) (actual time=9384.610..45211.963 rows=6988804 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64 rows=31719464 width=12) (actual time=0.034..14299.338 rows=31653392 loops=1)
-> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual time=9368.371..9368.371 rows=3310768 loops=1)
Buckets: 4096 Batches: 256 (originally 4) Memory Usage: 1025kB
-> Hash Join (cost=782.15..881759.44 rows=61715 width=8) (actual time=3.839..8223.097 rows=3310768 loops=1)
Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.017..4118.598 rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=3.804..3.804 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.188..3.641 rows=365 loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.115..0.115 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 48439.958 ms
(16 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '10MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2124026.77..2124027.27 rows=50 width=12) (actual time=100739.984..100739.998 rows=31 loops=1)
-> Hash Join (cost=882530.88..2120878.00 rows=209918 width=12) (actual time=9467.702..97238.959 rows=6988804 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = f_users_aax8qg0e23asx5h.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..839754.64 rows=31719464 width=12) (actual time=0.015..9185.978 rows=31653392 loops=1)
-> Hash (cost=881759.44..881759.44 rows=61715 width=8) (actual time=9466.440..9466.440 rows=3310768 loops=1)
Buckets: 8192 Batches: 16 (originally 1) Memory Usage: 10241kB
-> Hash Join (cost=782.15..881759.44 rows=61715 width=8) (actual time=3.681..8183.423 rows=3310768 loops=1)
Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..845389.92 rows=9325392 width=12) (actual time=0.012..4064.044 rows=9322096 loops=1)
-> Hash (cost=777.59..777.59 rows=365 width=4) (actual time=3.654..3.654 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=11.08..777.59 rows=365 width=4) (actual time=0.129..3.449 rows=365 loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..10.99 rows=365 width=0) (actual time=0.077..0.077 rows=365 loops=1)
Index Cond: (2014 = id_year)
Total runtime: 100740.552 ms
(16 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN stehule on "f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2073456.55..2073457.23 rows=68 width=12) (actual time=58724.094..58724.106 rows=31 loops=1)
-> Hash Join (cost=89142.28..1589276.13 rows=32278695 width=12) (actual time=2191.019..55499.331 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.018..8364.554 rows=32211565 loops=1)
-> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual time=2188.858..2188.858 rows=3310768 loops=1)
Buckets: 524288 Batches: 1 Memory Usage: 129327kB
-> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768 width=8) (actual time=0.026..927.235 rows=3310768 loops=1)
Total runtime: 58741.224 ms
(8 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# set work_mem to '1MB';
SET
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "stehule"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN stehule on "f_emails_aaekn159p5aw3t8"."userid_id" = stehule."id" ) ) GROUP BY 1 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2293499.45..2293500.13 rows=68 width=12) (actual time=37357.967..37357.976 rows=31 loops=1)
-> Hash Join (cost=102075.28..1809319.02 rows=32278695 width=12) (actual time=1814.232..34290.818 rows=7070141 loops=1)
Hash Cond: (f_emails_aaekn159p5aw3t8.userid_id = stehule.id)
-> Seq Scan on f_emails_aaekn159p5aw3t8 (cost=0.00..854559.95 rows=32278695 width=12) (actual time=0.007..14066.754 rows=32211565 loops=1)
-> Hash (cost=47757.68..47757.68 rows=3310768 width=8) (actual time=1806.453..1806.453 rows=3310768 loops=1)
Buckets: 4096 Batches: 256 Memory Usage: 518kB
-> Seq Scan on stehule (cost=0.00..47757.68 rows=3310768 width=8) (actual time=0.007..781.672 rows=3310768 loops=1)
Total runtime: 37359.264 ms
(8 rows)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# select version();
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.8 on x86_64-gdc-linux-gnu, with GoodData patches, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ stehule
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+--------+-------------
public | stehule | table | postgres | 114 MB |
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_users_aax8qg0e23asx5h
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------------------------+-------+-------+---------+-------------
public | f_users_aax8qg0e23asx5h | table | beard | 5878 MB |
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# \dt+ f_emails_aaekn159p5aw3t8
List of relations
Schema | Name | Type | Owner | Size | Description
--------+--------------------------+-------+-------+---------+-------------
public | f_emails_aaekn159p5aw3t8 | table | beard | 4156 MB |
(1 row)
db_kost07e2d9cdmg20b1takpqntobo6ghj=# explain analyze SELECT "f_users_aax8qg0e23asx5h"."firtsclickutmsource_id" AS "a_5078", COUNT( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN "f_emails_aaekn159p5aw3t8"."id" ELSE CAST( NULL AS INT ) END ) ) AS "c_7a2a545bf904a48a", COUNT( "f_emails_aaekn159p5aw3t8"."id" ) AS "c_9e301e13350cc0dc", ( MAX( ( CASE WHEN ( 89388 = "f_emails_aaekn159p5aw3t8"."read_id" ) THEN 0 END ) ) IS NOT NULL ) AS "def_7a2a545bf904a48a", TRUE AS "def_9e301e13350cc0dc" FROM ( ( "f_emails_aaekn159p5aw3t8" INNER JOIN "f_users_aax8qg0e23asx5h" ON ( "f_emails_aaekn159p5aw3t8"."userid_id" = "f_users_aax8qg0e23asx5h"."id" ) ) INNER JOIN "dwh_dm_abnblk9j5oagorw" ON ( "f_users_aax8qg0e23asx5h"."dt_registrationdatetime_id" = "dwh_dm_abnblk9j5oagorw"."id" ) ) WHERE ( 2014 = "dwh_dm_abnblk9j5oagorw"."id_year" ) GROUP BY 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=40076389.70..40076448.70 rows=59 width=12) (actual time=31124.643..31124.658 rows=31 loops=1)
-> Nested Loop (cost=2773.92..40073185.42 rows=213619 width=12) (actual time=37.945..27670.473 rows=7070141 loops=1)
-> Hash Join (cost=2773.92..10180068.58 rows=61735 width=8) (actual time=0.951..8934.170 rows=3310768 loops=1)
Hash Cond: (f_users_aax8qg0e23asx5h.dt_registrationdatetime_id = dwh_dm_abnblk9j5oagorw.id)
-> Seq Scan on f_users_aax8qg0e23asx5h (cost=0.00..10080578.00 rows=9328442 width=12) (actual time=0.004..4297.579 rows=9322096 loops=1)
-> Hash (cost=2408.01..2408.01 rows=365 width=4) (actual time=0.919..0.919 rows=365 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Bitmap Heap Scan on dwh_dm_abnblk9j5oagorw (cost=386.27..2408.01 rows=365 width=4) (actual time=0.104..0.733 rows=365 loops=1)
Recheck Cond: (2014 = id_year)
-> Bitmap Index Scan on dwh_dm_abnblk9j5oagorw_id_year_idx (cost=0.00..386.18 rows=365 width=0) (actual time=0.059..0.059 rows=365 loops=1)
Index Cond: (2014 = id_year)
-> Index Scan using f_emails_aaekn159p5aw3t8_userid_id_idx on f_emails_aaekn159p5aw3t8 (cost=0.00..396.22 rows=88 width=12) (actual time=0.002..0.004 rows=2 loops=3310768)
Index Cond: (userid_id = f_users_aax8qg0e23asx5h.id)
Total runtime: 31125.975 ms
Re: bad estimation together with largework_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Hi, Dne 2014-06-26 14:10, Pavel Stehule napsal: > Hello all, > > today I had to work with one slow query - when I checked different > scenarios I found a dependency on work_mem size - but it is atypical - > bigger work_mem increased query execution 31 minutes (600MB work mem) > and 1 minute (1MB). The problem described in Pavel's emails (illustrated by the awful explain plans) is in this part: -> Hash (cost=881801.58..881801.58 rows=61735 width=8) (actual time=9076.153..9076.153 rows=3310768 loops=1) That is, the estimated number of rows is ~60k, but in reality it's ~3.3M. This then leads to a hash table with small number of buckets (8192) containing large number of tuples (~400 in this case) in a linked list. Which significantly slows down the lookups during the hash join. This issue is actually quite common - all it takes is a significant underestimation of the hashed relation, either because it's a complex join (thus inherently difficult to estimate) or because the WHERE conditions are not independent (see the example below). The attached patch (early WIP, after ~30 minutes of hacking) addresses this by increasing the number of bucket whenever the average number of tuples per item exceeds 1.5x NTUP_PER_BUCKET. ======= Example ======== create table table_a (id int, fk_id int); create table table_b (fk_id int, val_a int, val_b int); insert into table_a select i, i from generate_series(1,10000000) s(i); insert into table_b select i, mod(i,1000), mod(i,1000) from generate_series(1,10000000) s(i); analyze table_a; analyze table_b; explain analyze select count(*) from table_a join table_b on (table_a.id = table_b.fk_id) where val_a < 10 and val_b < 10; without the patch: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual time=49543.263..49543.264 rows=1 loops=1) -> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual time=923.751..49531.554 rows=100000 loops=1) Hash Cond: (table_a.id = table_b.fk_id) -> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977 width=4) (actual time=0.104..967.090 rows=10000000 loops=1) -> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual time=923.615..923.615 rows=100000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 3516kB -> Seq Scan on table_b (cost=0.00..204052.90 rows=1359 width=4) (actual time=0.086..910.656 rows=100000 loops=1) Filter: ((val_a < 10) AND (val_b < 10)) Rows Removed by Filter: 9900000 Planning time: 0.271 ms Execution time: 49545.053 ms (11 rows) with the patch: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=385834.56..385834.57 rows=1 width=0) (actual time=9780.346..9780.347 rows=1 loops=1) -> Hash Join (cost=204069.89..385831.16 rows=1359 width=0) (actual time=939.297..9772.256 rows=100000 loops=1) Hash Cond: (table_a.id = table_b.fk_id) -> Seq Scan on table_a (cost=0.00..144247.77 rows=9999977 width=4) (actual time=0.103..962.446 rows=10000000 loops=1) -> Hash (cost=204052.90..204052.90 rows=1359 width=4) (actual time=939.172..939.172 rows=100000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 3516kB -> Seq Scan on table_b (cost=0.00..204052.90 rows=1359 width=4) (actual time=0.064..909.191 rows=100000 loops=1) Filter: ((val_a < 10) AND (val_b < 10)) Rows Removed by Filter: 9900000 Planning time: 0.276 ms Execution time: 9782.295 ms (11 rows) Time: 9784.392 ms So the duration improved significantly - from ~52 seconds to ~10 seconds. The patch is not perfect, it needs a bit more love, as illustrated by the FIXME/TODO items. Feel free to comment. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Attached is v2 of the patch, with some cleanups / minor improvements: * improved comments, whitespace fixed / TODOs etc. * tracking inital # of buckets (similar to initial # of batches) * adding info about buckets to EXPLAIN ANALYZE, similar to batches - I didn't want to make it overly complex, so the info about initial bucket/batch count is added if at least one them is modified * modified threshold triggering the growth, to get NTUP_PER_BUCKET on average (see the NTUP_GROW_THRESHOLD comment nodeHash.c) * there's a single FIXME, related to counting tuples in the One thing that's important to note is the difference between # of batches and # of buckets. While one # of batches is "global" the # of buckets is 'within a batch'. So theoretically each batch can use different number of buckets. However the value is reused between batches, so it only grows. That means this is possible: initial: 1024 buckets (before 1st batch) batch 1: 1024 buckets batch 2: 1024 buckets batch 3: 4096 buckets batch 4: 8192 buckets while this is not: initial: 1024 buckets (before 1st batch) batch 1: 1024 buckets batch 2: 4096 buckets batch 3: 1024 buckets batch 4: 8192 buckets However in practice I expect the first batch will to do all the work, and the following batches will just reuse the same number of buckets. This of course assumes the batches have similar tuple sizes etc. So the first batch will do all the reshuffling the tables, and the following batches will reuse the 'right' number of buckets from the start. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 26.6.2014 20:43, Tomas Vondra wrote: > Attached is v2 of the patch, with some cleanups / minor improvements: > > * there's a single FIXME, related to counting tuples in the Meh, I couldn't resist resolving this FIXME, so attached is v3 of the patch. This just adds a proper 'batch tuples' counter to the hash table. All comments, measurements on different queries etc. welcome. We'll certainly do a lot of testing, because this was a big issue for us. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 26.6.2014 23:48, Tomas Vondra wrote: > On 26.6.2014 20:43, Tomas Vondra wrote: >> Attached is v2 of the patch, with some cleanups / minor improvements: >> >> * there's a single FIXME, related to counting tuples in the > > Meh, I couldn't resist resolving this FIXME, so attached is v3 of the > patch. This just adds a proper 'batch tuples' counter to the hash table. > > All comments, measurements on different queries etc. welcome. We'll > certainly do a lot of testing, because this was a big issue for us. Attached is v4 of the patch, with a few minor improvements. The only thing worth mentioning is overflow protection, similar to what's done in the ExecChooseHashTableSize() function. Otherwise it's mostly about improving comments. Also attached is a v4 with GUC, making it easier to compare effect of the patch, by simply setting "enable_hashjoin_bucket" to "off" (original behaviour) or "on" (new behaviour). And finally there's an SQL script demonstrating the effect of the patch with various work_mem settings. For example what I see on my desktop is this (averages from 3 runs): ===== SMALL WORK MEM (2MB) ===== no dynamic buckets dynamic buckets query A 5945 ms 5969 ms query B 6080 ms 5943 ms query C 6531 ms 6822 ms query D 6962 ms 6618 ms ===== MEDIUM WORK MEM (16MB) ===== no dynamic buckets dynamic buckets query A 7955 ms 7944 ms query B 9970 ms 7569 ms query C 8643 ms 8560 ms query D 33908 ms 7700 ms ===== LARGE WORK MEM (64MB) ===== no dynamic buckets dynamic buckets query A 10235 ms 10233 ms query B 32229 ms 9318 ms query C 14500 ms 10554 ms query D 213344 ms 9145 ms Where "A" is "exactly estimated" and the other queries suffer by various underestimates. My observations from this are: (1) For small work_mem values it does not really matter, thanks to the caching effects (the whole hash table fits into L2 CPU cache). (2) For medium work_mem values (not really huge, but exceeding CPU caches), the differences are negligible, except for the last query with most severe underestimate. In that case the new behaviour is much faster. (3) For large work_mem values, the speedup is pretty obvious and dependent on the underestimate. The question is why to choose large work_mem values when the smaller values actually perform better. Well, the example tables are not perfectly representative. In case the outer table is much larger and does not fit into RAM that easily (which is the case of large fact tables or joins), the rescans (because of more batches) are more expensive and outweight the caching benefits. Also, the work_mem is shared with other nodes, e.g. aggregates, and decreasing it because of hash joins would hurt them. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Hi, attached is v5 of the patch. The main change is that scaling the number of buckets is done only once, after the initial hash table is build. The main advantage of this is lower price. This also allowed some cleanup of unecessary code. However, this new patch causes warning like this: WARNING: temporary file leak: File 231 still referenced I've been eyeballing the code for a while now, but I still fail to see where this comes from :-( Any ideas? regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 30.6.2014 23:12, Tomas Vondra wrote: > Hi, > > attached is v5 of the patch. The main change is that scaling the number > of buckets is done only once, after the initial hash table is build. The > main advantage of this is lower price. This also allowed some cleanup of > unecessary code. > > However, this new patch causes warning like this: > > WARNING: temporary file leak: File 231 still referenced > > I've been eyeballing the code for a while now, but I still fail to see > where this comes from :-( Any ideas? Meh, the patches are wrong - I haven't realized the tight coupling between buckets/batches in ExecHashGetBucketAndBatch: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); The previous patches worked mostly by pure luck (the nbuckets was set only in the first batch), but once I moved the code at the end, it started failing. And by "worked" I mean "didn't throw an error, but probably returned bogus results with (nbatch>1)". However, ISTM this nbuckets-nbatch coupling is not really necessary, it's only constructed this way to assign independent batch/bucket. So I went and changed the code like this: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> (32 - hashtable->log2_nbatch)); I.e. using the top bits for batchno, low bits for bucketno (as before). Hopefully I got it right this time. At least it seems to be working for cases that failed before (no file leaks, proper rowcounts so far). regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Atri Sharma
Date:
On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Regards,
Atri
-- Meh, the patches are wrong - I haven't realized the tight couplingOn 30.6.2014 23:12, Tomas Vondra wrote:
> Hi,
>
> attached is v5 of the patch. The main change is that scaling the number
> of buckets is done only once, after the initial hash table is build. The
> main advantage of this is lower price. This also allowed some cleanup of
> unecessary code.
>
> However, this new patch causes warning like this:
>
> WARNING: temporary file leak: File 231 still referenced
>
> I've been eyeballing the code for a while now, but I still fail to see
> where this comes from :-( Any ideas?
between buckets/batches in ExecHashGetBucketAndBatch:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
The previous patches worked mostly by pure luck (the nbuckets was set
only in the first batch), but once I moved the code at the end, it
started failing. And by "worked" I mean "didn't throw an error, but
probably returned bogus results with (nbatch>1)".
However, ISTM this nbuckets-nbatch coupling is not really necessary,
it's only constructed this way to assign independent batch/bucket. So I
went and changed the code like this:
*bucketno = hashvalue & (nbuckets - 1);
*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
I.e. using the top bits for batchno, low bits for bucketno (as before).
Hopefully I got it right this time. At least it seems to be working for
cases that failed before (no file leaks, proper rowcounts so far).
Hi,
I had a look at the patch and I was wondering if there is a way we can set a cap on the resized size, and instead increase the number of batches instead? Since this patch evaluates the growth of tuples vs increase of space, we could look at increasing the number of batches itself instead of number of buckets, if the resized number of buckets exceeds a threshold. It shouldnt be too hard, AIUI it would involve a call in MultiExecHash right after the new cost evaluation the patch adds.
It isnt a target of the current patch, but I think that it is a logical extension to the same.
Thoughts/Comments?
Regards,
Atri
Regards,
Atri
l'apprenant
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 1.7.2014 01:24, Tomas Vondra wrote: > On 30.6.2014 23:12, Tomas Vondra wrote: >> Hi, > > Hopefully I got it right this time. At least it seems to be working for > cases that failed before (no file leaks, proper rowcounts so far). Attached v7, fixing nbatch/ntuples in an assert. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 3.7.2014 15:42, Atri Sharma wrote: > > > > On Tue, Jul 1, 2014 at 4:54 AM, Tomas Vondra <tv@fuzzy.cz > <mailto:tv@fuzzy.cz>> wrote: > > On 30.6.2014 23:12, Tomas Vondra wrote: > > Hi, > > > > attached is v5 of the patch. The main change is that scaling the > number > > of buckets is done only once, after the initial hash table is > build. The > > main advantage of this is lower price. This also allowed some > cleanup of > > unecessary code. > > > > However, this new patch causes warning like this: > > > > WARNING: temporary file leak: File 231 still referenced > > > > I've been eyeballing the code for a while now, but I still fail to see > > where this comes from :-( Any ideas? > > Meh, the patches are wrong - I haven't realized the tight coupling > between buckets/batches in ExecHashGetBucketAndBatch: > > *bucketno = hashvalue & (nbuckets - 1); > *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); > > The previous patches worked mostly by pure luck (the nbuckets was set > only in the first batch), but once I moved the code at the end, it > started failing. And by "worked" I mean "didn't throw an error, but > probably returned bogus results with (nbatch>1)". > > However, ISTM this nbuckets-nbatch coupling is not really necessary, > it's only constructed this way to assign independent batch/bucket. So I > went and changed the code like this: > > *bucketno = hashvalue & (nbuckets - 1); > *batchno = (hashvalue >> (32 - hashtable->log2_nbatch)); > > I.e. using the top bits for batchno, low bits for bucketno (as before). > > Hopefully I got it right this time. At least it seems to be working for > cases that failed before (no file leaks, proper rowcounts so far). > > > Hi, > > I had a look at the patch and I was wondering if there is a way we can > set a cap on the resized size, and instead increase the number of > batches instead? Since this patch evaluates the growth of tuples vs > increase of space, we could look at increasing the number of batches > itself instead of number of buckets, if the resized number of buckets > exceeds a threshold. It shouldnt be too hard, AIUI it would involve a > call in MultiExecHash right after the new cost evaluation the patch adds. So you propose to fill the hash table, and when hitting NTUP_PER_BUCKET (i.e. after adding 'nbuckets * NTUP_PER_BUCKET' tuples) increase the number of batches? Hmmm, that'd be easy to implement. I don't think it's possible to do that in MultiExecHash (that's too late). But it's a matter of changing this condition in ExecHashTableInsert: if (hashtable->spaceUsed > hashtable->spaceAllowed) ExecHashIncreaseNumBatches(hashtable); to something like this if ((hashtable->spaceUsed > hashtable->spaceAllowed) || (hashtable->totalTuples / hashtable->nbatch == NTUP_PER_BUCKET * hashtable->nbuckets)) ExecHashIncreaseNumBatches(hashtable); But I don't think increasing number of batches is a good approach, as it means more rescans. I think using memory is more efficient (after all, that's what work_mem is for, right?). regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 3.7.2014 19:36, Tomas Vondra wrote: > On 1.7.2014 01:24, Tomas Vondra wrote: >> On 30.6.2014 23:12, Tomas Vondra wrote: >>> Hi, >> >> Hopefully I got it right this time. At least it seems to be working >> for cases that failed before (no file leaks, proper rowcounts so >> far). > > Attached v7, fixing nbatch/ntuples in an assert. > > regards > Tomas Attached is v8 of this patch, significantly reworked based on the related discussion (mostly in 'tweaking NTUP_PER_BUCKET'). With the patch applied, the hash table works like this: initial sizing (ExecChooseHashTableSize) ---------------------------------------- - size nbuckets for NTUP_PER_BUCKET=1 - account for memory allocated for buckets building the hash table ----------------------- - while adding tuples, keep track of optimal number of buckets (for the current number of tuples per batch) - don't resize until all the tuples are added (by increasing nbatch the number of buckets may decrease) - after adding all tuples, consider resizing (optimal nbuckets != initial nbuckets) - do the resize This means that for well estimated plans (reasonably exact tuple count for the inner relation), the hash table is properly sized and no resize is needed. For badly estimated plans, this works about the same as the previous patches (we're accounting for memory needed for buckets etc.). More batches or less buckets? ----------------------------- In the related threads, I repeatedly mentioned that I don't know a good answer to "Should I add more batches or decrease the number of buckets?" while sizing the hash table. The current code (as in HEAD) does not face this dillema because (a) it does not count buckets into work_mem and (b) does not allow changing nbuckets. I still don't have a good answer to this question, but I think the patch can stand even without it. If you want more/less batches, use smaller/larger work_mem - same as with the current code. Also, with the 'dense allocation' patch [1], it's possible to use larger work_mem values (and thus compensate for counting buckets into work_mem). Changes since v7 ---------------- (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly) (b) set NTUP_PER_BUCKET=1 (c) include buckets (optimal) when considering nbatch increase (d) track optimal number of buckets (for NTUP_PER_BUCKET) (e) after loading all the tuples, resize the hash table to get nbuckets_optimal (if needed) (f) renamed GUC to enable_hash_resize (makes a bit more sense) Comments -------- (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of misunderstanding how NTUP_PER_BUCKET works (it's upper threshold, not average) and is not really needed. (b) The value "1" gives the best performance, but also requires more memory for the buckets. This should be more than compensated by densely allocating the tuples (see hashjoin-alloc-v3.patch in the "tweaking NTUP_PER_BUCKET" thread [1]). (c,d) Originally, the memory needed for buckets was not included in 'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is not really acceptable (needs much more memory than before). So now it's included both in the initial sizing of the hash table, and when adding more tuples (nbuckets_optimal). Still, spaceUsed does not include palloc overhead (which may be a significant amount of memory). This is tackled by [1]. (e) No change here. (f) A bit better GUC name. Anyway, this is for easier development only, and won't be included in the final patch. regards Tomas [1] http://www.postgresql.org/message-id/53C2DECC.2010408@fuzzy.cz
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Attached v9 of the patch. Aside from a few minor fixes, the main change is that this is assumed to be combined with the "dense allocation" patch. It also rewrites the ExecHashIncreaseNumBuckets to follow the same pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly, instead of buckets). It's cleaner and more consistent. hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need to grab the hashjoin-alloc-v4.patch from a different thread and apply it first) hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined regards Tomas Vondra
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 20.7.2014 18:29, Tomas Vondra wrote: > Attached v9 of the patch. Aside from a few minor fixes, the main change > is that this is assumed to be combined with the "dense allocation" patch. > > It also rewrites the ExecHashIncreaseNumBuckets to follow the same > pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly, > instead of buckets). It's cleaner and more consistent. > > hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need > to grab the hashjoin-alloc-v4.patch from a different thread and apply it > first) > > hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined I just noticed that the changes made to ExecHashGetBucketAndBatch may not be perfectly fine - it does not "break" the hash join (i.e. the results are still correct), but I believe it may cause more batches. But I'm not entirely sure about it, or how to fix that. First, an intro into ExecHashGetBucketAndBatch internals, and how the patch changes that. If not interested, skip to the "problem" section below. The "old" ExecHashGetBucketAndBatch did this: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used to determine bucket number. The rest of the hash (after removing the bits used for buckets) is used for batch. I.e. something like this 31 23 15 7 0 |----------------|----------------|----------------|----------------|| | <- batches | buckets | This worked fine, because the number of bits for buckets was fixed, and only the number of batches was growing. This was done by adding most-significant bits (as illustrated by the tiny arrow. So when there were 4 bits for batch number, after adding another bit (doubling nbatches) batch '0000' would be split either into '00000' or '10000'. With dynamic bucket count this does not work, because the batches and buckets need to be independend (i.e. derived from non-overlapping parts of the hash). The simplest approach of 'moving batches around' does not work, because that would break this assert: Assert(batchno > curbatch); In other words "don't move tuples to already-processed batches". So the batch number for a tuple needs to be sufficiently stable, and only ever increase (never decrease). So what I did is this: 31 23 15 7 0 |----------------|----------------|----------------|----------------|| batches -> | | <- buckets | and this is what happens in ExecHashGetBucketAndBatch: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> (32 - hashtable->log2_nbatch)); So the "bucketno" expression is exactly the same (but now the nbuckets may change dynamically), but "batchno" is now computed from the other end of the hash (highest bits), and grows by adding a least-significant bit. Problem ------- The original implementation had the nice feature, that each increase of number of batches (nbatches *= 2) split each batch in half. Half the tuples stayed in the batch, half the tuples moved to one of the newly created batches, thanks to adding a most-significant bit. The tuples getting 0 bit stayed in the same batch, tuples getting 1 moved to the new one (and it was guaranteed to be one of the new ones). It's slightly more complicated because of the lazy behavior when repeatedly incrementing the number of batches, but this is the principle. Keep 1/2 the tuples, move 1/2 to another bucket. The new implementation changes this, because the batch number grows in the opposite direction - adds a lest-significant bit, so for example batch '1111' gets split into '11111' and '11110'. It's rougly equal to (batchno << 1) + K where K is either 0 or 1, which is always >= than the old batchno. So it does not violate the Assert (which is why I haven't noticed this earlier). But it breaks the desirable behavior that 1/2 the tuples remain in the current batch, and instead moves a lot of them to batches further down the road, and needlessly increases the number of batches. At least that's how I understand it ... Possible solutions ------------------ Adding least-significant bit does not work, we need get back to adding the most-significant one. Not sure what's the least complex way to do that, though. I'm thinking about computing the nbuckets limit (how many buckets may fit into work_mem): tupsize = HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData)) + MAXALIGN(tupwidth); nbuckets_bits_max = my_log2(work_mem / tupsize) and then start with nbatches from this bit, like this: 31 23 max 15 7 0 |----------------|----------------|----------------|----------------|| | <- batches | free | <- buckets | That might work, I guess. But maybe there's something better (using some secret bitwise-operator-fu ...). Further concerns ---------------- I'm wondering whether we should deploy some safe-guards against exceeding the 32 bits in the hash. It probably was not a big issue before, but with large work_mem values and NTUP_PER_BUCKET=1 this might become a problem. With work_mem=1GB, and 50B tuples (including overhead): buckets = 21474836 which means ~25 bits reserved for nbucketno. That leaves only 7 bits for batch number. Yeah, that's ~128 batches (so ~128GB hash table), but maybe it's not that far. Also, what if the tupwidth estimate is too low. In that case the nbuckets_bits_max would be artificially high (and we'd have to do more batches). regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Adding least-significant bit does not work, we need get back to adding > the most-significant one. Not sure what's the least complex way to do > that, though. > > I'm thinking about computing the nbuckets limit (how many buckets may > fit into work_mem): > > tupsize = HJTUPLE_OVERHEAD + > MAXALIGN(sizeof(MinimalTupleData)) + > MAXALIGN(tupwidth); > > nbuckets_bits_max = my_log2(work_mem / tupsize) > > and then start with nbatches from this bit, like this: > > 31 23 max 15 7 0 > |----------------|----------------|----------------|----------------| > | | <- batches | free | <- buckets | That doesn't seem unreasonable provide there's enough bit space, but, as you point out, the day might not be far off when there isn't. It also strikes me that when there's only 1 batch, the set of bits that map onto the batch number is zero-width, and one zero-width bit range is as good as another. In other words, if you're only planning to do one batch, you can easily grow the number of buckets on the fly. Growing the number of buckets only becomes difficult once you have more than one batch. And I mention that because, in the scenario mentioned in your original post ("a hash table with small number of buckets (8192) containing large number of tuples [~3.3M]"), presumably what happens is you start out thinking you are going to have one batch with 8K buckets. Then, as more tuples continue to pour in, you see the load factor rising and responding by bumping up the size of the hash table. Now eventually you realize, gosh, this isn't even going to fit into work_mem, so you need to start batching it. But by that point you've presumably done as much as you're going to do in terms of increasing the number of buckets; after that, you're just going to add more batches as necessary. So not being able to further increase the number of buckets once the number of batches is no longer 1 wouldn't be a problem in that case. Now if you start out planning for multiple batches, and then you discover you'd like to keep the same number of batches but have more buckets per batch, that's gonna be hard. It's not impossible, because you could steal some of the unused high-order bits above the number of batches, and then concatenate them with the low-order bits that you already had, but that seems like kind of an ugly kludge, and AFAICS it only helps if the new tuples are narrower by enough to justify the cost of splitting all the buckets. I won't say that couldn't happen, but I think it would be better to keep that complexity out of the patch unless we're absolutely certain it's justified. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 11.8.2014 20:25, Robert Haas wrote: > On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Adding least-significant bit does not work, we need get back to adding >> the most-significant one. Not sure what's the least complex way to do >> that, though. >> >> I'm thinking about computing the nbuckets limit (how many buckets may >> fit into work_mem): >> >> tupsize = HJTUPLE_OVERHEAD + >> MAXALIGN(sizeof(MinimalTupleData)) + >> MAXALIGN(tupwidth); >> >> nbuckets_bits_max = my_log2(work_mem / tupsize) >> >> and then start with nbatches from this bit, like this: >> >> 31 23 max 15 7 0 >> |----------------|----------------|----------------|----------------| >> | | <- batches | free | <- buckets | > > That doesn't seem unreasonable provide there's enough bit space, but, > as you point out, the day might not be far off when there isn't. Thinking about this a bit more, I believe the danger is not that imminent. 32 bits mean the Hash node should handle 2^32 tuples in total (possibly split into multiple batches). It used to be "2^32 buckets" thanks to NTUP_PER_BUCKET=10, but the patch lowers this to 1 so it's "tuples" now. But while we're we're a bit closer to exhausting the 32 bits, I think it's not really that big issue. Not only hashing a table with ~4e9 rows is going to be a hellish experience, but if we really want to support it, we should probably switch to 64-bit hashes. Because adding some checks is not going to help - it may detect an issue, but it won't fix it. And actually, even if the two values get overlap, it does not mean the hashjoin will stop working. There'll be dependency between batches and buckets, causing non-uniform usage of the buckets, but it won't blow up. So I don't think we need to worry about exhausting the 32 bits for now. > It also strikes me that when there's only 1 batch, the set of bits > that map onto the batch number is zero-width, and one zero-width bit > range is as good as another. In other words, if you're only planning > to do one batch, you can easily grow the number of buckets on the fly. > Growing the number of buckets only becomes difficult once you have > more than one batch. Yes, that's true. It's also roughly what the patch does IMHO. If you know you'll need more batches, you can start with the maximum number of buckets right away (because you expect there's more data than work_mem, so shoot for nbuckets = mylog2(work_mem/tuple_size) or something like that. It's also true that if you're starting with a single batch, you can do whatever you want with nbuckets until you need to do (nbatches *= 2). It's also true that once you're done with the first batch, all the other batches will be just fine with the number of buckets, because the batches be about equally sized. But I think this is pretty much what the patch does, in the end. > And I mention that because, in the scenario mentioned in your original > post ("a hash table with small number of buckets (8192) containing > large number of tuples [~3.3M]"), presumably what happens is you start > out thinking you are going to have one batch with 8K buckets. Then, > as more tuples continue to pour in, you see the load factor rising and > responding by bumping up the size of the hash table. Now eventually > you realize, gosh, this isn't even going to fit into work_mem, so you > need to start batching it. But by that point you've presumably done > as much as you're going to do in terms of increasing the number of > buckets; after that, you're just going to add more batches as > necessary. So not being able to further increase the number of > buckets once the number of batches is no longer 1 wouldn't be a > problem in that case. > > Now if you start out planning for multiple batches, and then you > discover you'd like to keep the same number of batches but have more > buckets per batch, that's gonna be hard. It's not impossible, because > you could steal some of the unused high-order bits above the number of > batches, and then concatenate them with the low-order bits that you > already had, but that seems like kind of an ugly kludge, and AFAICS it > only helps if the new tuples are narrower by enough to justify the > cost of splitting all the buckets. I won't say that couldn't happen, > but I think it would be better to keep that complexity out of the > patch unless we're absolutely certain it's justified. I'm definitely in favor of keeping the patch as simple as possible. I was considering using reversing the bits of the hash, because that's pretty much the simplest solution. But I think you're right it might actually work like this: * Are more batches needed? (yes) => just use nbuckets = my_log2(work_mem / tuple_size) (no) => go ahead, until processing all tuples or hitting work_mem (work_mem) => meh, use the same nbuckets above (all tuples) => compute optimal nbuckets / resize But I need to think about this a bit. So far it seems to me there's no way additional batches might benefit from increasing nbuckets further. Each batch should contain about the same number of tuples, or more correctly, the same number of distinct hash values. But if there are multiple tuples with the same hash value, increasing the number of buckets is not going to help (it's still going to end up in the same bucket). And the number of tuples in a batch may only get lower, while adding batches (e.g. if a batch somewhere in the middle gets a lot of tuples with the same hash value, exceeding work_mem). I need to think about this a bit more, but so far it seems like a good clean solution to me. regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 12.8.2014 00:30, Tomas Vondra wrote: > On 11.8.2014 20:25, Robert Haas wrote: >> It also strikes me that when there's only 1 batch, the set of bits >> that map onto the batch number is zero-width, and one zero-width bit >> range is as good as another. In other words, if you're only planning >> to do one batch, you can easily grow the number of buckets on the fly. >> Growing the number of buckets only becomes difficult once you have >> more than one batch. ... > I was considering using reversing the bits of the hash, because that's > pretty much the simplest solution. But I think you're right it might > actually work like this: > > * Are more batches needed? > > (yes) => just use nbuckets = my_log2(work_mem / tuple_size) > > (no) => go ahead, until processing all tuples or hitting work_mem > > (work_mem) => meh, use the same nbuckets above > > (all tuples) => compute optimal nbuckets / resize > > > But I need to think about this a bit. So far it seems to me there's no > way additional batches might benefit from increasing nbuckets further. I think this is a simple and solid solution, solving the batchno computation issues quite nicely. Attached is v10 patch (bare and combined with the dense allocation), that does this: 1) when we know we'll need batching, buckets are sized for full work_mem (using the estimated tuple width, etc.) 2) without the batching, we estimate the 'right number of buckets' for the estimated number of tuples, and keep track of the optimal number as tuples are added to the hash table - if we discover we need to start batching, we keep the current optimal value (which should be the same as the max number of buckets) and don't mess with it anymore (making it possible to compute batch IDs just like before) - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table is resized as part of the rebatch - if the hash build completes without batching, we do the resize I believe the patch is pretty much perfect. I plan to do more thorough testing on a wide range of queries in the next few days. I also removed the 'enable_hash_resize' GUC, because it would be more complex to implement this properly after doing the resize as part of rebatch etc.. So either it would make the patch more complex, or it wouldn't do what the name promises. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 12.8.2014 00:30, Tomas Vondra wrote: >> On 11.8.2014 20:25, Robert Haas wrote: >>> It also strikes me that when there's only 1 batch, the set of bits >>> that map onto the batch number is zero-width, and one zero-width bit >>> range is as good as another. In other words, if you're only planning >>> to do one batch, you can easily grow the number of buckets on the fly. >>> Growing the number of buckets only becomes difficult once you have >>> more than one batch. > > ... > >> I was considering using reversing the bits of the hash, because that's >> pretty much the simplest solution. But I think you're right it might >> actually work like this: >> >> * Are more batches needed? >> >> (yes) => just use nbuckets = my_log2(work_mem / tuple_size) >> >> (no) => go ahead, until processing all tuples or hitting work_mem >> >> (work_mem) => meh, use the same nbuckets above >> >> (all tuples) => compute optimal nbuckets / resize >> >> >> But I need to think about this a bit. So far it seems to me there's no >> way additional batches might benefit from increasing nbuckets further. > > I think this is a simple and solid solution, solving the batchno > computation issues quite nicely. Attached is v10 patch (bare and > combined with the dense allocation), that does this: > > 1) when we know we'll need batching, buckets are sized for full work_mem > (using the estimated tuple width, etc.) > > 2) without the batching, we estimate the 'right number of buckets' for > the estimated number of tuples, and keep track of the optimal number > as tuples are added to the hash table > > - if we discover we need to start batching, we keep the current > optimal value (which should be the same as the max number of > buckets) and don't mess with it anymore (making it possible to > compute batch IDs just like before) > > - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table > is resized as part of the rebatch > > - if the hash build completes without batching, we do the resize > > I believe the patch is pretty much perfect. I plan to do more thorough > testing on a wide range of queries in the next few days. > > I also removed the 'enable_hash_resize' GUC, because it would be more > complex to implement this properly after doing the resize as part of > rebatch etc.. So either it would make the patch more complex, or it > wouldn't do what the name promises. A variety of trivial comments on this: PostgreSQL style is un-cuddled curly braces. Also, multi-line comments need to start with a line containing only /* and end with a line containing only */. In one place you've added curly braces around a single-line block that is otherwise unmodified; please don't do that. In one place, you have "becase" instead of "because". In another place, you write "add if after it" but it should say "add it after it" or maybe better "add the new one after it". Avoid using punctuation like "=>" in comments to illustrate the connection between sentences; instead, use a connecting word like "then" or "therefore" or whatever is appropriate; in this instance, a period followed by the start of a new sentence seems sufficient. Revert the removal of a single line of whitespace near the top of nodeHash.c. There are too many things marked XXX in this patch. They should either be fixed, if they are real problems, or they should be commented in a way that doesn't give rise to the idea that they're problems if they aren't. OK, now on to some more substantive stuff: 1. It's not clear to me what the overall effect of this patch on memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going to use, on the average, 10x as much bucket-header memory per tuple. Specifically, I think it means we'll use about 8 bytes of bucket-header memory per tuple instead of 0.8 bytes per tuple. If the tuples are narrow, that could be significant; concerns have been expressed about that here in the past. Increasing the number of buckets could also increase memory usage. On the other hand, the dense allocation stuff probably saves a ton of memory, so maybe we end up overall, but I'm not sure. Your thoughts, and maybe some test results with narrow and wide tuples, would be appreciated. 2. But, on the positive side, modulo the memory utilization questions mentioned above, I would expect the impact on hash join performance to be positive. Going from 10 tuples per bucket to just 1 should help, and on cases where the actual load factor would have ended up much higher because of poor estimation, increasing the number of buckets on the fly should help even more. I haven't tested this, though. I haven't had a chance to completely go through this yet, so these are just some initial thoughts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 19.8.2014 19:05, Robert Haas wrote: > On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 12.8.2014 00:30, Tomas Vondra wrote: >>> On 11.8.2014 20:25, Robert Haas wrote: >>>> It also strikes me that when there's only 1 batch, the set of bits >>>> that map onto the batch number is zero-width, and one zero-width bit >>>> range is as good as another. In other words, if you're only planning >>>> to do one batch, you can easily grow the number of buckets on the fly. >>>> Growing the number of buckets only becomes difficult once you have >>>> more than one batch. >> >> ... >> >>> I was considering using reversing the bits of the hash, because that's >>> pretty much the simplest solution. But I think you're right it might >>> actually work like this: >>> >>> * Are more batches needed? >>> >>> (yes) => just use nbuckets = my_log2(work_mem / tuple_size) >>> >>> (no) => go ahead, until processing all tuples or hitting work_mem >>> >>> (work_mem) => meh, use the same nbuckets above >>> >>> (all tuples) => compute optimal nbuckets / resize >>> >>> >>> But I need to think about this a bit. So far it seems to me there's no >>> way additional batches might benefit from increasing nbuckets further. >> >> I think this is a simple and solid solution, solving the batchno >> computation issues quite nicely. Attached is v10 patch (bare and >> combined with the dense allocation), that does this: >> >> 1) when we know we'll need batching, buckets are sized for full work_mem >> (using the estimated tuple width, etc.) >> >> 2) without the batching, we estimate the 'right number of buckets' for >> the estimated number of tuples, and keep track of the optimal number >> as tuples are added to the hash table >> >> - if we discover we need to start batching, we keep the current >> optimal value (which should be the same as the max number of >> buckets) and don't mess with it anymore (making it possible to >> compute batch IDs just like before) >> >> - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table >> is resized as part of the rebatch >> >> - if the hash build completes without batching, we do the resize >> >> I believe the patch is pretty much perfect. I plan to do more thorough >> testing on a wide range of queries in the next few days. >> >> I also removed the 'enable_hash_resize' GUC, because it would be more >> complex to implement this properly after doing the resize as part of >> rebatch etc.. So either it would make the patch more complex, or it >> wouldn't do what the name promises. > > A variety of trivial comments on this: > > PostgreSQL style is un-cuddled curly braces. Also, multi-line > comments need to start with a line containing only /* and end with a > line containing only */. In one place you've added curly braces > around a single-line block that is otherwise unmodified; please don't > do that. In one place, you have "becase" instead of "because". In > another place, you write "add if after it" but it should say "add it > after it" or maybe better "add the new one after it". Avoid using > punctuation like "=>" in comments to illustrate the connection between > sentences; instead, use a connecting word like "then" or "therefore" > or whatever is appropriate; in this instance, a period followed by the > start of a new sentence seems sufficient. Revert the removal of a > single line of whitespace near the top of nodeHash.c. > > There are too many things marked XXX in this patch. They should > either be fixed, if they are real problems, or they should be > commented in a way that doesn't give rise to the idea that they're > problems if they aren't. OK, thanks for pointing this out. Attached is v11 of the patch (both separate and combined with the dense allocation, as before). I fixed as many of those issues as possible. All the XXX items were obsolete, except for one in the chunk_alloc function. I have also removed one constant > > OK, now on to some more substantive stuff: > > 1. It's not clear to me what the overall effect of this patch on > memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going > to use, on the average, 10x as much bucket-header memory per tuple. > Specifically, I think it means we'll use about 8 bytes of > bucket-header memory per tuple instead of 0.8 bytes per tuple. If the > tuples are narrow, that could be significant; concerns have been > expressed about that here in the past. Increasing the number of > buckets could also increase memory usage. On the other hand, the > dense allocation stuff probably saves a ton of memory, so maybe we end > up overall, but I'm not sure. Your thoughts, and maybe some test > results with narrow and wide tuples, would be appreciated. The effect of the dense allocation was briefly discussed in this thread, along with some quick measurements: http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz The dense allocation removes pretty much all the palloc overhead. For a 40B tuple, I did get this before the dense allocation HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11 chunks); 1448394448 used and this with the dense allocation patch applied HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks); 729651520 used So it's pretty much 50% reduction in memory consumption. Now, the palloc header is >=16B, and removing this more than compensates the increased number of buckets (in the worst case, there may be ~2x buckets per tuple, which is 2x8B pointers). I did a number of tests, and the results are quite consistent with this. > 2. But, on the positive side, modulo the memory utilization questions > mentioned above, I would expect the impact on hash join performance to > be positive. Going from 10 tuples per bucket to just 1 should help, > and on cases where the actual load factor would have ended up much > higher because of poor estimation, increasing the number of buckets on > the fly should help even more. I haven't tested this, though. Yes, that's the results I was getting. I've done a number of tests, and not a single one showed a slowdown so far. Most well-estimated queries were 2-3x faster, and the poorly estimated ones were orders of magnitude faster. We're working on getting this fixed on a range of workloads, I'll post some results once I have that. > I haven't had a chance to completely go through this yet, so these are > just some initial thoughts. Thanks! Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Hi everyone, as Heikki mentioned in his "commitfest status" message, this patch still has no reviewers :-( Is there anyone willing to pick up that, together with the 'dense allocation' patch (as those two are closely related)? I'm looking in Robert's direction, as he's the one who came up with the dense allocation idea, and also commented on the hasjoin bucket resizing patch ... I'd ask Pavel Stehule, but as he's sitting next to me in the office he's not really independent ;-) I'll do some reviews on the other patches over the weekend, to balance this of course. regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > as Heikki mentioned in his "commitfest status" message, this patch still > has no reviewers :-( Is there anyone willing to pick up that, together > with the 'dense allocation' patch (as those two are closely related)? > > I'm looking in Robert's direction, as he's the one who came up with the > dense allocation idea, and also commented on the hasjoin bucket resizing > patch ... > > I'd ask Pavel Stehule, but as he's sitting next to me in the office he's > not really independent ;-) I'll do some reviews on the other patches > over the weekend, to balance this of course. Will any of those patches to be, heh heh, mine? I am a bit confused by the relationship between the two patches you posted. The "combined" patch applies, but the other one does not. If this is a sequence of two patches, it seems like it would be more helpful to post A and B rather than B and A+B. Perhaps the other patch is on a different thread but there's not an obvious reference to said thread here. In ExecChooseHashTableSize(), if buckets_bytes is independent of nbatch, could we just compute it before working out dbatch, and just deduct it from hash_table_bytes? That seems like it'd do the same thing for less work. Either way, what happens if buckets_bytes is already bigger than hash_table_bytes? A few more stylistic issues that I see: + if ((hashtable->nbatch == 1) && + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) + if (size > (HASH_CHUNK_SIZE/8)) While I'm all in favor of using parentheses generously where it's needed to add clarity, these and similar usages seem over the top to me. On a related note, the first of these reads like this if (stuff) { if (more stuff) { do things } } which makes one wonder why we need two if statements. + + /* used for dense allocation of tuples (into linked chunks) */ + HashChunk chunks_batch; /* one list for the whole batch */ +} HashJoinTableData; If the original coding didn't have a blank line between the last structure member and the closing brace, it's probably not right to make it that way now. There are similar issues at the end of some functions you modified, and a few other places (like the new code in ExecChooseHashTableSize and chunk_alloc) where there are extra blank lines at the starts and ends of blocks. +char * chunk_alloc(HashJoinTable hashtable, int size) { Eh... no. + /* XXX maybe we should use MAXALIGN(size) here ... */ +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 8.9.2014 22:44, Robert Haas wrote: > On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> as Heikki mentioned in his "commitfest status" message, this patch >> still has no reviewers :-( Is there anyone willing to pick up that, >> together with the 'dense allocation' patch (as those two are >> closely related)? >> >> I'm looking in Robert's direction, as he's the one who came up with >> the dense allocation idea, and also commented on the hasjoin bucket >> resizing patch ... >> >> I'd ask Pavel Stehule, but as he's sitting next to me in the office >> he's not really independent ;-) I'll do some reviews on the other >> patches over the weekend, to balance this of course. > > Will any of those patches to be, heh heh, mine? I'll exchange some of the credit for +1. Deal? ;-) > I am a bit confused by the relationship between the two patches you > posted. The "combined" patch applies, but the other one does not. If > this is a sequence of two patches, it seems like it would be more > helpful to post A and B rather than B and A+B. Perhaps the other > patch is on a different thread but there's not an obvious reference > to said thread here. Yeah, it's probably a bit confusing. The thing is the "dense allocation" idea was discussed in a different thread, so I posted the patch there: http://www.postgresql.org/message-id/53CBEB8A.207@fuzzy.cz The patch tweaking hash join buckets builds on the dense allocation, because it (a) makes up for the memory used for more buckets and (b) it's actually easier to rebuild the buckets this way. So I only posted the separate patch for those who want to do a review, and then a "complete patch" with both parts combined. But it sure may be a bit confusing. > In ExecChooseHashTableSize(), if buckets_bytes is independent of > nbatch, could we just compute it before working out dbatch, and just > deduct it from hash_table_bytes? That seems like it'd do the same > thing for less work. Well, we can do that. But I really don't think that's going to make measurable difference. And I think the code is more clear this way. > Either way, what happens if buckets_bytes is already bigger than > hash_table_bytes? Hm, I don't see how that could happen. FWIW, I think the first buckets_bytes formula is actually wrong: buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET); and should be buckets_bytes = sizeof(HashJoinTuple) * (1 << my_log2(ntuples / NTUP_PER_BUCKET)); instead. Also, we might consider that we never use less than 1024 buckets (but that's minor issue, I guess). But once we get into the "need batching" branch, we do this: lbuckets = 1 << my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple))); which includes both the bucket (pointer) and tuple size, and IMHO guarantees that bucket_bytes will never be over work_mem (which is what hash_table_bytes is). The only case where this might happen is (tupsize < 8), thanks to the my_log2 (getting (50% work_mem + epsilon), doubled to >100% work_mem). But tupsize is defined as this: tupsize = HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData)) + MAXALIGN(tupwidth); and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is >11B (ignoring alignment) etc. So I believe this is safe ... > A few more stylistic issues that I see: > > + if ((hashtable->nbatch == 1) && > + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) > + if (size > (HASH_CHUNK_SIZE/8)) > > While I'm all in favor of using parentheses generously where it's > needed to add clarity, these and similar usages seem over the top to > me. It seemed more readable for me at the time of coding it, and it still seems better this way, but ... https://www.youtube.com/watch?v=CxK_nA2iVXw > On a related note, the first of these reads like this if (stuff) { if > (more stuff) { do things } } which makes one wonder why we need two if > statements. We probably don't ... > > + > + /* used for dense allocation of tuples (into linked chunks) */ > + HashChunk chunks_batch; /* one list for the whole batch */ > + > } HashJoinTableData; > > If the original coding didn't have a blank line between the last > structure member and the closing brace, it's probably not right to > make it that way now. There are similar issues at the end of some > functions you modified, and a few other places (like the new code in > ExecChooseHashTableSize and chunk_alloc) where there are extra blank > lines at the starts and ends of blocks. I'll fix that. FWIW, those issues seem to be in the 'dense allocation' patch. > > +char * chunk_alloc(HashJoinTable hashtable, int size) { > > Eh... no. > > + /* XXX maybe we should use MAXALIGN(size) here ... */ > > +1. I'm not sure what's the 'no' pointing at? Maybe that the parenthesis should be on the next line? And is the +1 about doing MAXALING? regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > So I only posted the separate patch for those who want to do a review, > and then a "complete patch" with both parts combined. But it sure may be > a bit confusing. Let's do this: post each new version of the patches only on this thread, as a series of patches that can be applied one after another. >> In ExecChooseHashTableSize(), if buckets_bytes is independent of >> nbatch, could we just compute it before working out dbatch, and just >> deduct it from hash_table_bytes? That seems like it'd do the same >> thing for less work. > > Well, we can do that. But I really don't think that's going to make > measurable difference. And I think the code is more clear this way. Really? It seems to me that you're doing more or less the same calculation that's already been done a second time. It took me 20 minutes of staring to figure out what it was really doing. Now admittedly I was a little low on caffeine, but... >> Either way, what happens if buckets_bytes is already bigger than >> hash_table_bytes? > > Hm, I don't see how that could happen. How about an Assert() and a comment, then? >> A few more stylistic issues that I see: >> >> + if ((hashtable->nbatch == 1) && >> + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) >> + if (size > (HASH_CHUNK_SIZE/8)) >> >> While I'm all in favor of using parentheses generously where it's >> needed to add clarity, these and similar usages seem over the top to >> me. > > It seemed more readable for me at the time of coding it, and it still > seems better this way, but ... > > https://www.youtube.com/watch?v=CxK_nA2iVXw Heh. Well, at any rate, I think PostgreSQL style is to not use parentheses in that situation. >> +char * chunk_alloc(HashJoinTable hashtable, int size) { >> >> Eh... no. >> >> + /* XXX maybe we should use MAXALIGN(size) here ... */ >> >> +1. > > I'm not sure what's the 'no' pointing at? Maybe that the parenthesis > should be on the next line? And is the +1 about doing MAXALING? The "no" is about the fact that you have the return type, the function name, and the opening brace on one line instead of three lines as is project style. The +1 is for applying MAXALIGN to the size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 9.9.2014 16:09, Robert Haas wrote: > On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> So I only posted the separate patch for those who want to do a review, >> and then a "complete patch" with both parts combined. But it sure may be >> a bit confusing. > > Let's do this: post each new version of the patches only on this > thread, as a series of patches that can be applied one after another. OK, attached. Apply in this order 1) dense-alloc-v5.patch 2) hashjoin-nbuckets-v12.patch >>> In ExecChooseHashTableSize(), if buckets_bytes is independent of >>> nbatch, could we just compute it before working out dbatch, and just >>> deduct it from hash_table_bytes? That seems like it'd do the same >>> thing for less work. >> >> Well, we can do that. But I really don't think that's going to make >> measurable difference. And I think the code is more clear this way. > > Really? It seems to me that you're doing more or less the same > calculation that's already been done a second time. It took me 20 > minutes of staring to figure out what it was really doing. Now > admittedly I was a little low on caffeine, but... I reworked this part a bit, to make it easier to understand. Mostly following your recommendations. > >>> Either way, what happens if buckets_bytes is already bigger than >>> hash_table_bytes? >> >> Hm, I don't see how that could happen. > > How about an Assert() and a comment, then? Done. According to the comment, we should never really get over 25%, except maybe for strange work_mem values, because we're keeping nbuckets as 2^N. But even then we should not reach 50%, so I added this assert: Assert(buckets_bytes <= hash_table_bytes/2); And then we subtract the buckets_bytes, and continue with the loop. > >>> A few more stylistic issues that I see: >>> >>> + if ((hashtable->nbatch == 1) && >>> + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) >>> + if (size > (HASH_CHUNK_SIZE/8)) >>> >>> While I'm all in favor of using parentheses generously where it's >>> needed to add clarity, these and similar usages seem over the top to >>> me. >> >> It seemed more readable for me at the time of coding it, and it still >> seems better this way, but ... >> >> https://www.youtube.com/watch?v=CxK_nA2iVXw > > Heh. Well, at any rate, I think PostgreSQL style is to not use > parentheses in that situation. FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I also simplified the conditions a bit, and removed some of the parentheses. > >>> +char * chunk_alloc(HashJoinTable hashtable, int size) { >>> >>> Eh... no. >>> >>> + /* XXX maybe we should use MAXALIGN(size) here ... */ >>> >>> +1. >> >> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis >> should be on the next line? And is the +1 about doing MAXALING? > > The "no" is about the fact that you have the return type, the function > name, and the opening brace on one line instead of three lines as is > project style. The +1 is for applying MAXALIGN to the size. OK, fixed. I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. * renamed the function to 'dense_alloc' (instead of 'chunk_alloc') Seems like a better fit to what the function does. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be >=8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. I also considered Heikki's suggestion that while rebatching, we can reuse chunks with a single large tuple, instead of copying it needlessly. My attempt however made the code much uglier (I almost used a GOTO, but my hands rejected to type it ...). I'll look into that. Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Heikki Linnakangas
Date:
On 09/10/2014 01:49 AM, Tomas Vondra wrote: > On 9.9.2014 16:09, Robert Haas wrote: >> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> So I only posted the separate patch for those who want to do a review, >>> and then a "complete patch" with both parts combined. But it sure may be >>> a bit confusing. >> >> Let's do this: post each new version of the patches only on this >> thread, as a series of patches that can be applied one after another. > > OK, attached. Apply in this order > > 1) dense-alloc-v5.patch > 2) hashjoin-nbuckets-v12.patch The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. > I also did a few 'minor' changes to the dense allocation patch, most > notably: > > * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData > The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. > * the chunks size is 32kB (instead of 16kB), and we're using 1/4 > threshold for 'oversized' items > > We need the threshold to be >=8kB, to trigger the special case > within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. > I also considered Heikki's suggestion that while rebatching, we can > reuse chunks with a single large tuple, instead of copying it > needlessly. My attempt however made the code much uglier (I almost used > a GOTO, but my hands rejected to type it ...). I'll look into that. I tried constructing a test case where the rehashing time would be significant enough for this to matter, but I couldn't find one. Even if the original estimate on the number of batches is way too small, we ramp up quickly to a reasonable size and the rehashing time is completely insignificant compared to all the other work. So I think we can drop that idea. - Heikki
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > The dense-alloc-v5.patch looks good to me. I have committed that with minor > cleanup (more comments below). I have not looked at the second patch. Gah. I was in the middle of doing this. Sigh. >> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >> threshold for 'oversized' items >> >> We need the threshold to be >=8kB, to trigger the special case >> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. > > Should we care about the fact that if there are only a few tuples, we will > nevertheless waste 32kB of memory for the chunk? I guess not, but I thought > I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 > 6k. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Heikki Linnakangas
Date:
On 09/10/2014 09:31 PM, Robert Haas wrote: > On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> The dense-alloc-v5.patch looks good to me. I have committed that with minor >> cleanup (more comments below). I have not looked at the second patch. > > Gah. I was in the middle of doing this. Sigh. Sorry. >>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >>> threshold for 'oversized' items >>> >>> We need the threshold to be >=8kB, to trigger the special case >>> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. >> >> Should we care about the fact that if there are only a few tuples, we will >> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought >> I'd mention it. The smallest allowed value for work_mem is 64kB. > > I think we should change the threshold here to 1/8th. The worst case > memory wastage as-is ~32k/5 > 6k. Not sure how you arrived at that number. The worst case is that each 32k chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each chunk. That's 25% wastage. That's not too bad IMHO - the worst case waste of AllocSet is 50%. But if you want to twiddle it, feel free. Note that there's some interplay with allocset code, like Tomas mentioned. If the threshold is < 8k, palloc() will round up request sizes smaller than 8k anyway. That wastes some memory, nothing more serious than that, but it seems like a good idea to avoid it. - Heikki
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 10.9.2014 20:25, Heikki Linnakangas wrote: > On 09/10/2014 01:49 AM, Tomas Vondra wrote: >> I also did a few 'minor' changes to the dense allocation patch, most >> notably: >> >> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData >> The original naming seemed a bit awkward. > > That's too easy to confuse with regular MemoryContext and AllocChunk > stuff. I renamed it to HashMemoryChunk. OK. > >> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >> threshold for 'oversized' items >> >> We need the threshold to be >=8kB, to trigger the special case >> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. > > Should we care about the fact that if there are only a few tuples, we > will nevertheless waste 32kB of memory for the chunk? I guess not, but I > thought I'd mention it. The smallest allowed value for work_mem is 64kB. I don't think that's a problem. >> I also considered Heikki's suggestion that while rebatching, we can >> reuse chunks with a single large tuple, instead of copying it >> needlessly. My attempt however made the code much uglier (I almost used >> a GOTO, but my hands rejected to type it ...). I'll look into that. > > I tried constructing a test case where the rehashing time would be > significant enough for this to matter, but I couldn't find one. Even > if the original estimate on the number of batches is way too small, > we ramp up quickly to a reasonable size and the rehashing time is > completely insignificant compared to all the other work. So I think > we can drop that idea. I don't think that had anything to do with rehashing. What you pointed out is that when rebatching, we have to keep ~50% of the tuples, which means 'copy into a new chunk, then throw away the old ones'. For large tuples (you mentioned 100MB tuples, IIRC), we needlessly allocate this amount of memory only to copy the single tuple and then throw away the old tuple. So (a) that's an additional memcpy, and (b) it allocates additional 100MB of memory for a short period of time. Now, I'd guess when dealing with tuples this large, there will be more serious trouble elsewhere, but I don't want to neglect it. Maybe it's worth optimizing? Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 10.9.2014 20:31, Robert Haas wrote: > On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> The dense-alloc-v5.patch looks good to me. I have committed that with minor >> cleanup (more comments below). I have not looked at the second patch. > > Gah. I was in the middle of doing this. Sigh. > >>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >>> threshold for 'oversized' items >>> >>> We need the threshold to be >=8kB, to trigger the special case >>> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. >> >> Should we care about the fact that if there are only a few tuples, we will >> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought >> I'd mention it. The smallest allowed value for work_mem is 64kB. > > I think we should change the threshold here to 1/8th. The worst case > memory wastage as-is ~32k/5 > 6k. So you'd lower the threshold to 4kB? That may lower the wastage in the chunks, but palloc will actually allocate 8kB anyway, wasting up to additional 4kB. So I don't see how lowering the threshold to 1/8th improves the situation ... Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 10.9.2014 20:31, Robert Haas wrote: >> On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> The dense-alloc-v5.patch looks good to me. I have committed that with minor >>> cleanup (more comments below). I have not looked at the second patch. >> >> Gah. I was in the middle of doing this. Sigh. >> >>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >>>> threshold for 'oversized' items >>>> >>>> We need the threshold to be >=8kB, to trigger the special case >>>> within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. >>> >>> Should we care about the fact that if there are only a few tuples, we will >>> nevertheless waste 32kB of memory for the chunk? I guess not, but I thought >>> I'd mention it. The smallest allowed value for work_mem is 64kB. >> >> I think we should change the threshold here to 1/8th. The worst case >> memory wastage as-is ~32k/5 > 6k. > > So you'd lower the threshold to 4kB? That may lower the wastage in the > chunks, but palloc will actually allocate 8kB anyway, wasting up to > additional 4kB. So I don't see how lowering the threshold to 1/8th > improves the situation ... Ah, OK. Well, never mind then. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 10.9.2014 20:55, Heikki Linnakangas wrote: > On 09/10/2014 09:31 PM, Robert Haas wrote: >>>> * the chunks size is 32kB (instead of 16kB), and we're using 1/4 >>>> threshold for 'oversized' items >>>> >>>> We need the threshold to be >=8kB, to trigger the special case >>>> within AllocSet. The 1/4 rule is consistent with >>>> ALLOC_CHUNK_FRACTION. >>> >>> Should we care about the fact that if there are only a few tuples, we >>> will >>> nevertheless waste 32kB of memory for the chunk? I guess not, but I >>> thought >>> I'd mention it. The smallest allowed value for work_mem is 64kB. >> >> I think we should change the threshold here to 1/8th. The worst case >> memory wastage as-is ~32k/5 > 6k. > > Not sure how you arrived at that number. The worst case is that each 32k > chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each > chunk. That's 25% wastage. That's not too bad IMHO - the worst case > waste of AllocSet is 50%. > > But if you want to twiddle it, feel free. Note that there's some > interplay with allocset code, like Tomas mentioned. If the threshold is > < 8k, palloc() will round up request sizes smaller than 8k anyway. That > wastes some memory, nothing more serious than that, but it seems like a > good idea to avoid it. Yes, and pfree then keeps these blocks, which means a chance of keeping memory that won't be reused. That's why I chose the 8kB threshold. So I'd keep the 32kB/8kB, as proposed in the patch. But I don't see this as a big issue - my experience is that either we have very much smaller than 4kB, or much larger tuples than 8kB. And even for the tuples between 4kB and 8kB, the waste is not that bad. regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 10.9.2014 20:25, Heikki Linnakangas wrote: > On 09/10/2014 01:49 AM, Tomas Vondra wrote: >> I also did a few 'minor' changes to the dense allocation patch, most >> notably: >> >> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData >> The original naming seemed a bit awkward. > > That's too easy to confuse with regular MemoryContext and AllocChunk > stuff. I renamed it to HashMemoryChunk. BTW this breaks the second patch, which is allocating new chunks when resizing the hash table. Should I rebase the patch, or can the commiter do s/MemoryChunk/HashMemoryChunk/ ? Assuming there are no more fixes needed, of course. Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 10.9.2014 20:25, Heikki Linnakangas wrote: >> On 09/10/2014 01:49 AM, Tomas Vondra wrote: >>> I also did a few 'minor' changes to the dense allocation patch, most >>> notably: >>> >>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData >>> The original naming seemed a bit awkward. >> >> That's too easy to confuse with regular MemoryContext and AllocChunk >> stuff. I renamed it to HashMemoryChunk. > > BTW this breaks the second patch, which is allocating new chunks when > resizing the hash table. Should I rebase the patch, or can the commiter > do s/MemoryChunk/HashMemoryChunk/ ? > > Assuming there are no more fixes needed, of course. Rebasing it will save the committer time, which will increase the chances that one will look at your patch. So it's highly recommended. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 10.9.2014 21:34, Robert Haas wrote: > On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 10.9.2014 20:25, Heikki Linnakangas wrote: >>> On 09/10/2014 01:49 AM, Tomas Vondra wrote: >>>> I also did a few 'minor' changes to the dense allocation patch, most >>>> notably: >>>> >>>> * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData >>>> The original naming seemed a bit awkward. >>> >>> That's too easy to confuse with regular MemoryContext and AllocChunk >>> stuff. I renamed it to HashMemoryChunk. >> >> BTW this breaks the second patch, which is allocating new chunks when >> resizing the hash table. Should I rebase the patch, or can the commiter >> do s/MemoryChunk/HashMemoryChunk/ ? >> >> Assuming there are no more fixes needed, of course. > > Rebasing it will save the committer time, which will increase the > chances that one will look at your patch. So it's highly recommended. OK. So here's v13 of the patch, reflecting this change. regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > OK. So here's v13 of the patch, reflecting this change. With the exception of ExecChooseHashTableSize() and a lot of stylistic issues along the lines of what I've already complained about, this patch seems pretty good to me. It does three things: (1) It changes NTUP_PER_BUCKET to 1. Although this increases memory utilization, it also improves hash table performance, and the additional memory we use is less than what we saved as a result of commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a. (2) It changes ExecChooseHashTableSize() to include the effect of the memory consumed by the bucket array. If we care about properly respecting work_mem, this is clearly right for any NTUP_PER_BUCKET setting, but more important for a small setting like 1 than for the current value of 10. I'm not sure the logic in this function is as clean and simple as it can be just yet. (3) It allows the number of batches to increase on the fly while the hash join is in process. This case arises when we initially estimate that we only need a small hash table, and then it turns out that there are more tuples than we expect. Without this code, the hash table's load factor gets too high and things start to suck. I recommend separating this patch in two patches, the first covering items (1) and (2) and the second covering item (3), which really seems like a separate improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > With the exception of ExecChooseHashTableSize() and a lot of stylistic > issues along the lines of what I've already complained about, this > patch seems pretty good to me. It does three things: > ... > (3) It allows the number of batches to increase on the fly while the > hash join is in process. This case arises when we initially estimate > that we only need a small hash table, and then it turns out that there > are more tuples than we expect. Without this code, the hash table's > load factor gets too high and things start to suck. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? regards, tom lane
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> With the exception of ExecChooseHashTableSize() and a lot of stylistic >> issues along the lines of what I've already complained about, this >> patch seems pretty good to me. It does three things: >> ... >> (3) It allows the number of batches to increase on the fly while the >> hash join is in process. This case arises when we initially estimate >> that we only need a small hash table, and then it turns out that there >> are more tuples than we expect. Without this code, the hash table's >> load factor gets too high and things start to suck. > > Pardon me for not having read the patch yet, but what part of (3) > wasn't there already? EINSUFFICIENTCAFFEINE. It allows the number of BUCKETS to increase, not the number of batches. As you say, the number of batches could already increase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> (3) It allows the number of batches to increase on the fly while the >>> hash join is in process. >> Pardon me for not having read the patch yet, but what part of (3) >> wasn't there already? > EINSUFFICIENTCAFFEINE. > It allows the number of BUCKETS to increase, not the number of > batches. As you say, the number of batches could already increase. Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. regards, tom lane
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
"Tomas Vondra"
Date:
On 11 Září 2014, 15:31, Robert Haas wrote: > On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> OK. So here's v13 of the patch, reflecting this change. > > With the exception of ExecChooseHashTableSize() and a lot of stylistic > issues along the lines of what I've already complained about, this > patch seems pretty good to me. It does three things: I believe I fixed the stylistic issues you pointed out, except maybe the bracketing (which seems fine to me, though). So if you could point the issues out, that'd be helpful. > > (1) It changes NTUP_PER_BUCKET to 1. Although this increases memory > utilization, it also improves hash table performance, and the > additional memory we use is less than what we saved as a result of > commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a. > > (2) It changes ExecChooseHashTableSize() to include the effect of the > memory consumed by the bucket array. If we care about properly > respecting work_mem, this is clearly right for any NTUP_PER_BUCKET > setting, but more important for a small setting like 1 than for the > current value of 10. I'm not sure the logic in this function is as > clean and simple as it can be just yet. I made that as clear as I was able to (based on your feedback), but I'm "stuck" as I'm familiar with the logic. That is not to say it can't be improved, but this needs to be reviewed by someone else. > > (3) It allows the number of batches to increase on the fly while the > hash join is in process. This case arises when we initially estimate > that we only need a small hash table, and then it turns out that there > are more tuples than we expect. Without this code, the hash table's > load factor gets too high and things start to suck. > > I recommend separating this patch in two patches, the first covering > items (1) and (2) and the second covering item (3), which really seems > like a separate improvement. That probably makes sense. regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
"Tomas Vondra"
Date:
On 11 Září 2014, 16:11, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> (3) It allows the number of batches to increase on the fly while the >>>> hash join is in process. > >>> Pardon me for not having read the patch yet, but what part of (3) >>> wasn't there already? > >> EINSUFFICIENTCAFFEINE. > >> It allows the number of BUCKETS to increase, not the number of >> batches. As you say, the number of batches could already increase. > > Ah. Well, that would mean that we need a heuristic for deciding when to > increase the number of buckets versus the number of batches ... seems > like a difficult decision. That's true, but that's not the aim of this patch. The patch simply increases the number of buckets if the load happens to get too high, and does not try to decide between increasing nbuckets and nbatch. It's true that we can often get better performance with more batches, but the cases I was able to inspect were caused either by (a) underestimates resulting in inappropriate nbucket count, or (b) caching effects. This patch solves (a), not even trying to fix (b). regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> (3) It allows the number of batches to increase on the fly while the >>>> hash join is in process. > >>> Pardon me for not having read the patch yet, but what part of (3) >>> wasn't there already? > >> EINSUFFICIENTCAFFEINE. > >> It allows the number of BUCKETS to increase, not the number of >> batches. As you say, the number of batches could already increase. > > Ah. Well, that would mean that we need a heuristic for deciding when to > increase the number of buckets versus the number of batches ... seems > like a difficult decision. Well, the patch takes the approach of increasing the number of buckets when (1) there's only one batch and (2) the load factor exceeds NTUP_PER_BUCKET. As Tomas points out, it's just increasing the number of buckets to the exact same number which we would have allocated had we known that this many tuples would show up. Now, there is a possibility that we could lose. Let's suppose that the tuples overflow work_mem, but just barely. If we've accurately estimate how many tuples there are, the patch changes nothing: either way we're going to need two batches. But let's say we've underestimated the number of tuples by 3x. Then it could be that without the patch we'd allocate fewer buckets, never increase the number of buckets, and avoid batching; whereas with the patch, we'll discover that our tuple estimate was wrong, increase the number of buckets on the fly, and then be forced by the slightly-increased memory consumption that results from increasing the number of buckets to do two batches. That would suck. But catering to that case is basically hoping that we'll fall into a septic tank and come out smelling like a rose - that is, we're hoping that our initial poor estimate will cause us to accidentally do the right thing later. I don't think that's the way to go. It's much more important to get the case where things are bigger than we expected but still fit within work_mem right; and we're currently taking a huge run-time penalty in that case, as Tomas' results show. If we wanted to cater to the other case in the future, we could consider the opposite approach: if work_mem is exhahusted, throw the bucket headers away and keep reading tuples into the dense-packed chunks added by yesterday's commit. If we again run out of work_mem, then we *definitely* need to increase the batch count. If we manage to make everything fit, then we know exactly how many bucket headers we can afford, and can use some heuristic to decide between that and using more batches. I don't think that should be a requirement for this patch, though: I think the cumulative effect of Tomas's patches is going to be a win *much* more often than it's a loss. In 100% of cases, the dense-packing stuff will make a hash table containing the same tuples significantly smaller. Except when we've radically overestimated the tuple count, the change to NTUP_PER_BUCKET will mean less time spent chasing hash chains. It does use more memory, but that's more than paid for by the first change. Both the dense-packing stuff and the changes to include bucket headers in the work_mem calculation will have the effect of making the work_mem bound on memory utilization far more accurate than it's ever been previously. And the change to increase the number of buckets at runtime will mean that we're significantly more resilient against the planner underestimating the number of tuples. That's clearly good. The fact that we can construct borderline cases where it loses shouldn't deter us from pushing forward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tom Lane
Date:
"Tomas Vondra" <tv@fuzzy.cz> writes: > On 11 Září 2014, 16:11, Tom Lane wrote: >> Ah. Well, that would mean that we need a heuristic for deciding when to >> increase the number of buckets versus the number of batches ... seems >> like a difficult decision. > That's true, but that's not the aim of this patch. The patch simply > increases the number of buckets if the load happens to get too high, and > does not try to decide between increasing nbuckets and nbatch. On further thought, increasing nbuckets without changing the batch boundaries would not get us out of an out-of-work_mem situation, in fact it makes memory consumption worse not better (assuming you count the bucket headers towards work_mem ;-)). So in principle, the rule seems like it ought to go "if load (defined as max bucket chain length, I imagine?) gets too high, but we are still well below work_mem, increase nbuckets; else increase nbatch". And perhaps we reset nbuckets again for the next batch, not sure. If we are dealing with an out-of-work_mem situation then only increasing nbatch would be a suitable response. Because of the risk that increasing nbuckets would itself lead to a work_mem violation, I don't think it's sane to ignore the interaction entirely, even in a first patch. regards, tom lane
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
"Tomas Vondra"
Date:
On 11 Září 2014, 17:28, Tom Lane wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> On 11 Z?????? 2014, 16:11, Tom Lane wrote: >>> Ah. Well, that would mean that we need a heuristic for deciding when >>> to >>> increase the number of buckets versus the number of batches ... seems >>> like a difficult decision. > >> That's true, but that's not the aim of this patch. The patch simply >> increases the number of buckets if the load happens to get too high, and >> does not try to decide between increasing nbuckets and nbatch. > > On further thought, increasing nbuckets without changing the batch > boundaries would not get us out of an out-of-work_mem situation, in fact > it makes memory consumption worse not better (assuming you count the > bucket headers towards work_mem ;-)). Yes, the patch counts the bucket headers towards work_mem, while the original code didn't. The reasoning is that due to changing NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not negligible and should be accounted for. > So in principle, the rule seems like it ought to go "if load (defined as > max bucket chain length, I imagine?) gets too high, but we are still > well below work_mem, increase nbuckets; else increase nbatch". And > perhaps we reset nbuckets again for the next batch, not sure. If we > are dealing with an out-of-work_mem situation then only increasing nbatch > would be a suitable response. Almost. If we expect batching at the very beginning, we size nbuckets for "full work_mem" (see how many tuples we can get into work_mem, while not breaking NTUP_PER_BUCKET threshold). If we expect to be fine without batching, we start with the 'right' nbuckets and track the optimal nbuckets as we go (without actually resizing the hash table). Once we hit work_mem (considering the optimal nbuckets value), we keep the value. Only at the end, we check whether (nbuckets != nbuckets_optimal) and resize the hash table if needed. Also, we keep this value for all batches (it's OK because it assumes full work_mem, and it makes the batchno evaluation trivial). So the resize happens only once and only for the first batch. > Because of the risk that increasing nbuckets would itself lead to a > work_mem violation, I don't think it's sane to ignore the interaction > entirely, even in a first patch. This possibility of increasing the number of batches is the consequence of counting the buckets towards work_mem. I believe that's the right thing to do here, to make the work_mem guarantee stronger, but it's not something this patch depends on. If this is the only concern, we can leave this out. It's also worth mentioning that the dense allocation of tuples makes a huge difference. palloc easily results in 100% overhead for narrow tuples (that is, allocating 2x the amount of needed memory). For example over here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB. And the dense allocation patch pretty much makes this go away (as demonstrated in the [1]). For wider tuples, the difference is smaller, but that when the buckets are less important. What I'm trying to say is that it's only a matter of work_mem values. Currently, because of the weak guarantee, people tend to set the value lower than necessary. The dense allocation makes this unnecessary, allowing higher work_mem values, and thus eliminating the issue. If this is not enough, we can stop counting the buckets towards work_mem. It'll still be more memory efficient than the old approach (as a pointer is smaller than palloc overhead), and it won't cause additional batches. But I don't like this - I think we should make work_mem a stronger guarantee. [1] http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 11.9.2014 16:33, Tomas Vondra wrote: > On 11 Září 2014, 15:31, Robert Haas wrote: >> On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> OK. So here's v13 of the patch, reflecting this change. >> >> [...] It does three things: >> >> (1) It changes NTUP_PER_BUCKET to 1. Although this increases memory >> utilization, it also improves hash table performance, and the >> additional memory we use is less than what we saved as a result of >> commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a. >> >> (2) It changes ExecChooseHashTableSize() to include the effect of the >> memory consumed by the bucket array. If we care about properly >> respecting work_mem, this is clearly right for any NTUP_PER_BUCKET >> setting, but more important for a small setting like 1 than for the >> current value of 10. I'm not sure the logic in this function is as >> clean and simple as it can be just yet. >> (3) It allows the number of batches to increase on the fly while the >> hash join is in process. This case arises when we initially estimate >> that we only need a small hash table, and then it turns out that there >> are more tuples than we expect. Without this code, the hash table's >> load factor gets too high and things start to suck. >> >> I recommend separating this patch in two patches, the first covering >> items (1) and (2) and the second covering item (3), which really seems >> like a separate improvement. > > That probably makes sense. Attached is the patch split as suggested: (a) hashjoin-nbuckets-v14a-size.patch * NTUP_PER_BUCKET=1 * counting buckets towards work_mem * changes in ExecChooseHashTableSize (due to the other changes) (b) hashjoin-nbuckets-v14a-resize.patch * rest of the patch, that is ... * tracking optimal number of buckets * dynamic resize of the hash table * adding info the the EXPLAIN ANALYZE output regards Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Attached is the patch split as suggested: > > (a) hashjoin-nbuckets-v14a-size.patch > > * NTUP_PER_BUCKET=1 > * counting buckets towards work_mem > * changes in ExecChooseHashTableSize (due to the other changes) OK, I'm going to work on this one now. I have some ideas about how to simplify this a bit and will post an update in a few hours once I see how that pans out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Attached is the patch split as suggested: >> >> (a) hashjoin-nbuckets-v14a-size.patch >> >> * NTUP_PER_BUCKET=1 >> * counting buckets towards work_mem >> * changes in ExecChooseHashTableSize (due to the other changes) > > OK, I'm going to work on this one now. I have some ideas about how to > simplify this a bit and will post an update in a few hours once I see > how that pans out. Here's what I came up with. I believe this has the same semantics as your patch for less code, but please check, because I might be wrong. The main simplifications I made were: (1) In master, we decide whether or not to batch based on the size of the data, without knowing the number of buckets. If we want to account for the bucket overhead, that doesn't work. But in your patch, we basically computed the number of buckets we'd need for the single-batch case, then decide whether to do a single batch, and if yes, computed the same values over again with the same formula phrased differently. I revised that so that the single-batch case is calculated only once. If everything fits, we're all set, and don't need to recalculate anything. (2) In your patch, once we decide we need to batch, you set nbatch as now, and then check whether the computed value is still adequate after subtracting buckets_byte from hash_table_bytes. I revised that so that we make the initial batch estimate of dbatch by dividing inner_rel_bytes by hash_table_bytes - bucket_bytes rather than hash_table_bytes. I think that avoids the need to maybe increase nbatch again afterwards. I'm comfortable with this version if you are, but (maybe as a follow-on commit) I think we could make this even a bit smarter. If inner_rel_bytes + bucket_bytes > hash_table_bytes but inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce the number of buckets by 2x or 4x to make it fit. That would provide a bit of additional safety margin against cases where this patch might unluckily lose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 12.9.2014 18:49, Robert Haas wrote: > On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Attached is the patch split as suggested: >>> >>> (a) hashjoin-nbuckets-v14a-size.patch >>> >>> * NTUP_PER_BUCKET=1 >>> * counting buckets towards work_mem >>> * changes in ExecChooseHashTableSize (due to the other changes) >> >> OK, I'm going to work on this one now. I have some ideas about how to >> simplify this a bit and will post an update in a few hours once I see >> how that pans out. > > Here's what I came up with. I believe this has the same semantics as > your patch for less code, but please check, because I might be > wrong. The main simplifications I made were: > > (1) In master, we decide whether or not to batch based on the size > of the data, without knowing the number of buckets. If we want to > account for the bucket overhead, that doesn't work. But in your > patch, we basically computed the number of buckets we'd need for the > single-batch case, then decide whether to do a single batch, and if > yes, computed the same values over again with the same formula > phrased differently. I revised that so that the single-batch case is > calculated only once. If everything fits, we're all set, and don't > need to recalculate anything. > > (2) In your patch, once we decide we need to batch, you set nbatch > as now, and then check whether the computed value is still adequate > after subtracting buckets_byte from hash_table_bytes. I revised that > so that we make the initial batch estimate of dbatch by dividing > inner_rel_bytes by hash_table_bytes - bucket_bytes rather than > hash_table_bytes. I think that avoids the need to maybe increase > nbatch again afterwards. Yes, I like those changes and I think your reasoning is correct in both cases. It certainly makes the method shorter and more readable - I was too "stuck" in the original logic, so thanks for improving this. So +1 from me to those changes. > I'm comfortable with this version if you are, but (maybe as a > follow-on commit) I think we could make this even a bit smarter. If > inner_rel_bytes + bucket_bytes > hash_table_bytes but > inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce > the number of buckets by 2x or 4x to make it fit. That would provide > a bit of additional safety margin against cases where this patch > might unluckily lose. I don't think that's a good idea. That essentially says 'we're shooting for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where batching (because of smaller work_mem) actually significantly improves performance. If we want to make this reasoning properly, deciding whether to add batches or reduce buckets, we need a better heuristics - that's quite complex and I'd expect it to result ain a quite complex patch. So -1 from me to this at this moment (it certainly needs much more thought than a mere follow-on commit). regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 12.9.2014 18:49, Robert Haas wrote: >> On Fri, Sep 12, 2014 at 8:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Sep 11, 2014 at 5:57 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> Attached is the patch split as suggested: >>>> >>>> (a) hashjoin-nbuckets-v14a-size.patch >>>> >>>> * NTUP_PER_BUCKET=1 >>>> * counting buckets towards work_mem >>>> * changes in ExecChooseHashTableSize (due to the other changes) >>> >>> OK, I'm going to work on this one now. I have some ideas about how to >>> simplify this a bit and will post an update in a few hours once I see >>> how that pans out. >> >> Here's what I came up with. I believe this has the same semantics as >> your patch for less code, but please check, because I might be >> wrong. The main simplifications I made were: >> >> (1) In master, we decide whether or not to batch based on the size >> of the data, without knowing the number of buckets. If we want to >> account for the bucket overhead, that doesn't work. But in your >> patch, we basically computed the number of buckets we'd need for the >> single-batch case, then decide whether to do a single batch, and if >> yes, computed the same values over again with the same formula >> phrased differently. I revised that so that the single-batch case is >> calculated only once. If everything fits, we're all set, and don't >> need to recalculate anything. >> >> (2) In your patch, once we decide we need to batch, you set nbatch >> as now, and then check whether the computed value is still adequate >> after subtracting buckets_byte from hash_table_bytes. I revised that >> so that we make the initial batch estimate of dbatch by dividing >> inner_rel_bytes by hash_table_bytes - bucket_bytes rather than >> hash_table_bytes. I think that avoids the need to maybe increase >> nbatch again afterwards. > > Yes, I like those changes and I think your reasoning is correct in both > cases. It certainly makes the method shorter and more readable - I was > too "stuck" in the original logic, so thanks for improving this. > > So +1 from me to those changes. OK, committed. Please rebase the rest. >> I'm comfortable with this version if you are, but (maybe as a >> follow-on commit) I think we could make this even a bit smarter. If >> inner_rel_bytes + bucket_bytes > hash_table_bytes but >> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce >> the number of buckets by 2x or 4x to make it fit. That would provide >> a bit of additional safety margin against cases where this patch >> might unluckily lose. > > I don't think that's a good idea. That essentially says 'we're shooting > for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where > batching (because of smaller work_mem) actually significantly improves > performance. If we want to make this reasoning properly, deciding > whether to add batches or reduce buckets, we need a better heuristics - > that's quite complex and I'd expect it to result ain a quite complex patch. > > So -1 from me to this at this moment (it certainly needs much more > thought than a mere follow-on commit). OK, but let's discuss further. You make it sound like treating NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced. I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range: if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2 or NTUP_PER_BUCKET=4 may be perfectly reasonable. I'm actually quite surprised that you find batching to be a better strategy than skimping on buckets, because I would have expect the opposite, almost categorically. Batching means having to write out the tuples we can't process right away and read them back in. If that involves disk I/O, I think the cost of that I/O is going to be FAR more than the overhead we would have incurred by searching slightly longer bucket chains. If it didn't, then you could've set work_mem higher and avoided batching in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 12.9.2014 22:24, Robert Haas wrote: > On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 12.9.2014 18:49, Robert Haas wrote: >>> I'm comfortable with this version if you are, but (maybe as a >>> follow-on commit) I think we could make this even a bit smarter. If >>> inner_rel_bytes + bucket_bytes > hash_table_bytes but >>> inner_rel_bytes + bucket_bytes / 4 < hash_table_bytes, then reduce >>> the number of buckets by 2x or 4x to make it fit. That would provide >>> a bit of additional safety margin against cases where this patch >>> might unluckily lose. >> >> I don't think that's a good idea. That essentially says 'we're shooting >> for NTUP_PER_BUCKET but not really'. Moreover, I often see cases where >> batching (because of smaller work_mem) actually significantly improves >> performance. If we want to make this reasoning properly, deciding >> whether to add batches or reduce buckets, we need a better heuristics - >> that's quite complex and I'd expect it to result ain a quite complex patch. >> >> So -1 from me to this at this moment (it certainly needs much more >> thought than a mere follow-on commit). > > OK, but let's discuss further. You make it sound like treating > NTUP_PER_BUCKET as a soft limit is a bad thing, but I'm not convinced. > I think it's perfectly reasonable to treat NTUP_PER_BUCKET as a range: > if we can get NTUP_PER_BUCKET=1, great, but if not, NTUP_PER_BUCKET=2 > or NTUP_PER_BUCKET=4 may be perfectly reasonable. I think this really depends on various factors. For example what may work for small work_mem (fitting into L2 cache) may not work for large values, etc. > I'm actually quite surprised that you find batching to be a better > strategy than skimping on buckets, because I would have expect the > opposite, almost categorically. Batching means having to write out > the tuples we can't process right away and read them back in. If > that involves disk I/O, I think the cost of that I/O is going to be > FAR more than the overhead we would have incurred by searching > slightly longer bucket chains. If it didn't, then you could've set > work_mem higher and avoided batching in the first place. No, I don't find batching to be a better strategy. I just think this really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4 to avoid batching" follow-up patch. For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching, and then discover we need to start batching anyway. Should we keep the settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that for nbatch=2, but for nbatch=16? I'm not saying it's wrong, but I think it's more complicated than it seems from your description. regards Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 12.9.2014 22:24, Robert Haas wrote: > On Fri, Sep 12, 2014 at 3:39 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> >> Yes, I like those changes and I think your reasoning is correct in both >> cases. It certainly makes the method shorter and more readable - I was >> too "stuck" in the original logic, so thanks for improving this. >> >> So +1 from me to those changes. > > OK, committed. Please rebase the rest. Attached is v15 of the patch, rebased to current HEAD. Tomas
Attachment
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> I'm actually quite surprised that you find batching to be a better >> strategy than skimping on buckets, because I would have expect the >> opposite, almost categorically. Batching means having to write out >> the tuples we can't process right away and read them back in. If >> that involves disk I/O, I think the cost of that I/O is going to be >> FAR more than the overhead we would have incurred by searching >> slightly longer bucket chains. If it didn't, then you could've set >> work_mem higher and avoided batching in the first place. > > No, I don't find batching to be a better strategy. I just think this > really needs more discussion than a simple "let's use NTUP_PER_BUCKET=4 > to avoid batching" follow-up patch. > > For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid batching, > and then discover we need to start batching anyway. Should we keep the > settings, or should we revert NTUP_PER_BUCKET=1? Or maybe not doing that > for nbatch=2, but for nbatch=16? My first thought is to revert to NTUP_PER_BUCKET=1, but it's certainly arguable. Either method, though, figures to be better than doing nothing, so let's do something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 12.9.2014 23:22, Robert Haas wrote: > On Fri, Sep 12, 2014 at 4:55 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> I'm actually quite surprised that you find batching to be a >>> better strategy than skimping on buckets, because I would have >>> expect the opposite, almost categorically. Batching means having >>> to write out the tuples we can't process right away and read them >>> back in. If that involves disk I/O, I think the cost of that I/O >>> is going to be FAR more than the overhead we would have incurred >>> by searching slightly longer bucket chains. If it didn't, then >>> you could've set work_mem higher and avoided batching in the >>> first place. >> >> No, I don't find batching to be a better strategy. I just think >> this really needs more discussion than a simple "let's use >> NTUP_PER_BUCKET=4 to avoid batching" follow-up patch. >> >> For example, let's say we switch to NTUP_PER_BUCKET=4 to avoid >> batching, and then discover we need to start batching anyway. >> Should we keep the settings, or should we revert NTUP_PER_BUCKET=1? >> Or maybe not doing that for nbatch=2, but for nbatch=16? > > My first thought is to revert to NTUP_PER_BUCKET=1, but it's > certainly arguable. Either method, though, figures to be better than > doing nothing, so let's do something. OK, but can we commit the remaining part first? Because it'll certainly interact with this proposed part, and it's easier to tweak when the code is already there. Instead of rebasing over and over. Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Kevin Grittner
Date:
Tomas Vondra <tv@fuzzy.cz> wrote: > On 12.9.2014 23:22, Robert Haas wrote: >> My first thought is to revert to NTUP_PER_BUCKET=1, but it's >> certainly arguable. Either method, though, figures to be better than >> doing nothing, so let's do something. > > OK, but can we commit the remaining part first? Because it'll certainly > interact with this proposed part, and it's easier to tweak when the code > is already there. Instead of rebasing over and over. The patch applied and built without problem, and pass `make check-world`. The only thing that caught my eye was the addition of debug code using printf() instead of logging at a DEBUG level. Is there any reason for that? I still need to work through the (rather voluminous) email threads and the code to be totally comfortable with all the issues, but if Robert and Heikki are comfortable with it, I'm not gonna argue. Preliminary benchmarks look outstanding! I tried to control carefully to ensure consistent results (by dropping, recreating, vacuuming, analyzing, and checkpointing before each test), but there were surprising outliers in the unpatched version. It turned out that it was because slight differences in the random samples caused different numbers of buckets for both unpatched and patched, but the patched version dealt with the difference gracefully while the unpatched version's performance fluctuated randomly. My thinking is that we should get this much committed and then discuss further optimizations. I tried to throw something at it that would be something of a "worst case" because with the new code it would start with one batch and go to two. Five runs per test, alternating between unpatched and patched. First I tried with the default work_mem size of 4MB: Planning time / Execution time (ms) unpatched, work_mem = 4MB 0.694 / 10392.930 0.327 / 10388.787 0.412 / 10533.036 0.650 / 17186.719 0.338 / 10707.423 Fast: Buckets: 2048 Batches: 1 Memory Usage: 3516kB Slow: Buckets: 1024 Batches: 1 Memory Usage: 3516kB patched, work_mem = 4MB 0.764 / 5021.792 0.655 / 4991.887 0.436 / 5068.777 0.410 / 5057.902 0.444 / 5021.543 1st & 5th: Buckets: 131072 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3073kB all others: Buckets: 131072 (originally 2048) Batches: 2 (originally 1) Memory Usage: 3073kB Then, just to see how both dealt with extra memory I did it again with 1GB: unpatched, work_mem = 1GB 0.328 / 10407.836 0.319 / 10465.348 0.324 / 16606.948 0.569 / 10408.671 0.542 / 16996.209 Memory usage same as before. Guess which runs started with 1024 buckets. ;-) 0.556 / 3974.741 0.325 / 4012.613 0.334 / 3941.734 0.834 / 3894.391 0.603 / 3959.440 2nd & 4th: Buckets: 131072 (originally 2048) Batches: 1 (originally 1) Memory Usage: 4540kB all others: Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 4540kB My only concern from the benchmarks is that it seemed like there was a statistically significant increase in planning time: unpatched plan time average: 0.450 ms patched plan time average: 0.536 ms That *might* just be noise, but it seems likely to be real. For the improvement in run time, I'd put up with an extra 86us in planning time per hash join; but if there's any way to shave some of that off, all the better. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Heikki Linnakangas
Date:
On 10/02/2014 03:20 AM, Kevin Grittner wrote: > My only concern from the benchmarks is that it seemed like there > was a statistically significant increase in planning time: > > unpatched plan time average: 0.450 ms > patched plan time average: 0.536 ms > > That *might* just be noise, but it seems likely to be real. For > the improvement in run time, I'd put up with an extra 86us in > planning time per hash join; but if there's any way to shave some > of that off, all the better. The patch doesn't modify the planner at all, so it would be rather surprising if it increased planning time. I'm willing to just write that off as noise. - Heikki
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
"Tomas Vondra"
Date:
Dne 2 Říjen 2014, 2:20, Kevin Grittner napsal(a): > Tomas Vondra <tv@fuzzy.cz> wrote: >> On 12.9.2014 23:22, Robert Haas wrote: > >>> My first thought is to revert to NTUP_PER_BUCKET=1, but it's >>> certainly arguable. Either method, though, figures to be better than >>> doing nothing, so let's do something. >> >> OK, but can we commit the remaining part first? Because it'll certainly >> interact with this proposed part, and it's easier to tweak when the code >> is already there. Instead of rebasing over and over. > > The patch applied and built without problem, and pass `make > check-world`. The only thing that caught my eye was the addition > of debug code using printf() instead of logging at a DEBUG level. > Is there any reason for that? Not really. IIRC the main reason it that the other code in nodeHash.c uses the same approach. > I still need to work through the (rather voluminous) email threads > and the code to be totally comfortable with all the issues, but > if Robert and Heikki are comfortable with it, I'm not gonna argue. ;-) > Preliminary benchmarks look outstanding! I tried to control > carefully to ensure consistent results (by dropping, recreating, > vacuuming, analyzing, and checkpointing before each test), but > there were surprising outliers in the unpatched version. It turned > out that it was because slight differences in the random samples > caused different numbers of buckets for both unpatched and patched, > but the patched version dealt with the difference gracefully while > the unpatched version's performance fluctuated randomly. Can we get a bit more details on the test cases? I did a number of tests while working on the patch, and I generally tested two cases: (a) well-estimated queries (i.e. with nbucket matching NTUP_PER_BUCKET) (b) mis-estimated queries, with various levels of accuracy (say, 10x - 1000x misestimates) From the description, it seems you only tested (b) - is that correct? The thing is that while the resize is very fast and happens only once, it's not perfectly free. In my tests, this was always more than compensated by the speedups (even in the weird cases reported by Stephen Frost), so I think we're safe here. Also, I definitely recommend testing with different hash table sizes (say, work_mem=256MB and the hash table just enough to fit in without batching). The thing is the effect of CPU caches is very different for small and large hash tables. (This is not about work_mem alone, but about how much memory is used by the hash table - according to the results you posted it never gets over ~4.5MB) You tested against current HEAD, right? This patch was split into three parts, two of which were already commited (45f6240a and 8cce08f1). The logic of the patch was "this might take a of time/memory, but it's compensated by these other changes". Maybe running the tests on the original code would be interesting? Although, if this last part of the patch actually improves the performance on it's own, we're fine - it'll improve it even more compared to the old code (especially before lowering NTUP_PER_BUCKET 10 to 1). > My only concern from the benchmarks is that it seemed like there > was a statistically significant increase in planning time: > > unpatched plan time average: 0.450 ms > patched plan time average: 0.536 ms > > That *might* just be noise, but it seems likely to be real. For > the improvement in run time, I'd put up with an extra 86us in > planning time per hash join; but if there's any way to shave some > of that off, all the better. I agree with Heikki that this is probably noise, because the patch does not mess with planner at all. The only thing I can think of is adding a few fields into HashJoinTableData. Maybe this makes the structure too large to fit into a cacheline, or something? Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Kevin Grittner
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 10/02/2014 03:20 AM, Kevin Grittner wrote: >> My only concern from the benchmarks is that it seemed like there >> was a statistically significant increase in planning time: >> >> unpatched plan time average: 0.450 ms >> patched plan time average: 0.536 ms >> >> That *might* just be noise, but it seems likely to be real. For >> the improvement in run time, I'd put up with an extra 86us in >> planning time per hash join; but if there's any way to shave some >> of that off, all the better. > > The patch doesn't modify the planner at all, so it would be rather > surprising if it increased planning time. I'm willing to just write that > off as noise. Fair enough. I have seen much larger variations caused by how the executable code happened to line up relative to cacheline boundaries; and we've never made any effort to manage that. I've tried various other tests using \timing rather than EXPLAIN, and the patched version looks even better in those cases. I have seen up to 4x the performance for a query using the patched version, higher variability in run time without the patch, and have yet to devise a benchmark where the patched version came out slower (although I admit to not being as good at creating such cases as some here). When given a generous work_mem setting the patched version often uses more of what it is allowed than the unpatched version (which is probably one of the reasons it tends to do better). If someone has set a high work_mem and has gotten by only because the configured amount is not all being used when it would benefit performance, they may need to adjust work_mem down to avoid memory problems. That doesn't seem to me to be a reason to reject the patch. This is in "Waiting on Author" status only because I never got an answer about why the debug code used printf() rather the elog() at a DEBUG level. Other than that, I would say this patch is Ready for Committer. Tomas? You there? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
On 9.10.2014 16:55, Kevin Grittner wrote: > > I've tried various other tests using \timing rather than EXPLAIN, and > the patched version looks even better in those cases. I have seen up > to 4x the performance for a query using the patched version, higher > variability in run time without the patch, and have yet to devise a > benchmark where the patched version came out slower (although I admit > to not being as good at creating such cases as some here). Nice. Thanks for the testing! The only case I've been able to come up with is when the hash table fits into work_mem only thanks to not counting the buckets. The new code will start batching in this case. That is mostly luck, however, because it depends on the cardinality estimates, and the best way to fix it is increasing work_mem (which is safer thanks to reducing the overhead). Also, Robert proposed a way to mitigate this, if we realize we'd have to do batching during the initial sizing, we can peek whether reducing the number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a good approach, and will look into that after pgconf.eu (i.e. early November), unless someone else is interested. > When given a generous work_mem setting the patched version often uses > more of what it is allowed than the unpatched version (which is > probably one of the reasons it tends to do better). If someone has > set a high work_mem and has gotten by only because the configured > amount is not all being used when it would benefit performance, they > may need to adjust work_mem down to avoid memory problems. That > doesn't seem to me to be a reason to reject the patch. I'm not entirely sure I understand this paragraph. What do you mean by "configured amount is not all being used when it would benefit performance"? Can you give an example? The only thing I can think of is the batching behavior described above. > This is in "Waiting on Author" status only because I never got an > answer about why the debug code used printf() rather the elog() at > a DEBUG level. Other than that, I would say this patch is Ready > for Committer. Tomas? You there? I think I responded to that on October 2, quoting: =================================================================== On 2.10.2014 09:50, Tomas Vondra wrote: > On 2.10.2014, 2:20, Kevin Grittner wrote: >> >> The patch applied and built without problem, and pass `make >> check-world`. The only thing that caught my eye was the addition of >> debug code using printf() instead of logging at a DEBUG level. Is >> there any reason for that? > > Not really. IIRC the main reason it that the other code in > nodeHash.c uses the same approach. =================================================================== I believe it's safe to switch the logging to elog(). IMHO the printf logging is there from some very early version of the code, before elog was introduced. Or something like that. Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Kevin Grittner
Date:
Tomas Vondra <tv@fuzzy.cz> wrote: > On 9.10.2014 16:55, Kevin Grittner wrote: >> I've tried various other tests using \timing rather than EXPLAIN, and >> the patched version looks even better in those cases. I have seen up >> to 4x the performance for a query using the patched version, higher >> variability in run time without the patch, and have yet to devise a >> benchmark where the patched version came out slower (although I admit >> to not being as good at creating such cases as some here). > > Nice. Thanks for the testing! > > The only case I've been able to come up with is when the hash table fits > into work_mem only thanks to not counting the buckets. The new code will > start batching in this case. Hmm. If you look at the timings in my posting from 2014-10-01 I had cases where the patched version started with one batch and went to two, while the unpatched used just one batch, and the patched version was still more than twice as fast. I'm sure the "on disk" batch was fully cached, however; it might work out differently if disk speed actually came into the picture. > That is mostly luck, however, because it depends on the cardinality > estimates, and the best way to fix it is increasing work_mem (which is > safer thanks to reducing the overhead). > > Also, Robert proposed a way to mitigate this, if we realize we'd have to > do batching during the initial sizing, we can peek whether reducing the > number of buckets (to 1/2 or maybe 1/4) would help. I believe this is a > good approach, and will look into that after pgconf.eu (i.e. early > November), unless someone else is interested. Sure, but it would be good to confirm that it's actually needed first. >> When given a generous work_mem setting the patched version often uses >> more of what it is allowed than the unpatched version (which is >> probably one of the reasons it tends to do better). If someone has >> set a high work_mem and has gotten by only because the configured >> amount is not all being used when it would benefit performance, they >> may need to adjust work_mem down to avoid memory problems. That >> doesn't seem to me to be a reason to reject the patch. > > I'm not entirely sure I understand this paragraph. What do you mean by > "configured amount is not all being used when it would benefit > performance"? Can you give an example? Again, the earlier post showed that while the unpatched used 3516kB whether it had work_mem set to 4MB or 1GB, the patched version used 3073kB when work_mem was set to 4MB and 4540kB when work_mem was set to 1GB. The extra memory allowed the patched version to stay at 1 batch, improving performance over the other setting. > The only thing I can think of is the batching behavior described above. > >> This is in "Waiting on Author" status only because I never got an >> answer about why the debug code used printf() rather the elog() at >> a DEBUG level. Other than that, I would say this patch is Ready >> for Committer. Tomas? You there? > > I think I responded to that on October 2, quoting: > > =================================================================== > On 2.10.2014 09:50, Tomas Vondra wrote: >> On 2.10.2014, 2:20, Kevin Grittner wrote: >>> >>> The patch applied and built without problem, and pass `make >>> check-world`. The only thing that caught my eye was the addition of >>> debug code using printf() instead of logging at a DEBUG level. Is >>> there any reason for that? >> >> Not really. IIRC the main reason it that the other code in >> nodeHash.c uses the same approach. >=================================================================== Ah, I never received that email. That tends to happen every now and then. :-( > I believe it's safe to switch the logging to elog(). IMHO the printf > logging is there from some very early version of the code, before elog > was introduced. Or something like that. I guess it might be best to remain consistent in this patch and change that in a separate patch. I just wanted to make sure you didn't see any reason not to do so. With that addressed I will move this to Ready for Committer. Since both Heikki and Robert spent time on this patch earlier, I'll give either of them a shot at committing it if they want; otherwise I'll do it. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Tomas Vondra
Date:
Hi! On 9.10.2014 22:28, Kevin Grittner wrote: > Tomas Vondra <tv@fuzzy.cz> wrote: >> >> The only case I've been able to come up with is when the hash table >> fits into work_mem only thanks to not counting the buckets. The new >> code will start batching in this case. > > Hmm. If you look at the timings in my posting from 2014-10-01 I had > cases where the patched version started with one batch and went to > two, while the unpatched used just one batch, and the patched version > was still more than twice as fast. I'm sure the "on disk" batch was > fully cached, however; it might work out differently if disk speed > actually came into the picture. I think the case with large outer table and memory pressure, forcing the batches to be actually written to disk (instead of just being in the page cache) is the main concern here. >> That is mostly luck, however, because it depends on the cardinality >> estimates, and the best way to fix it is increasing work_mem (which is >> safer thanks to reducing the overhead). >> >> Also, Robert proposed a way to mitigate this, if we realize we'd >> have to do batching during the initial sizing, we can peek whether >> reducing the number of buckets (to 1/2 or maybe 1/4) would help. I >> believe this is a good approach, and will look into that after >> pgconf.eu (i.e. early November), unless someone else is interested. > > Sure, but it would be good to confirm that it's actually needed first. Yeah. But with cleverly crafted test case it's possible to confirm almost everything ;-) >>> When given a generous work_mem setting the patched version often uses >>> more of what it is allowed than the unpatched version (which is >>> probably one of the reasons it tends to do better). If someone has >>> set a high work_mem and has gotten by only because the configured >>> amount is not all being used when it would benefit performance, they >>> may need to adjust work_mem down to avoid memory problems. That >>> doesn't seem to me to be a reason to reject the patch. >> >> I'm not entirely sure I understand this paragraph. What do you mean by >> "configured amount is not all being used when it would benefit >> performance"? Can you give an example? > > Again, the earlier post showed that while the unpatched used 3516kB > whether it had work_mem set to 4MB or 1GB, the patched version used > 3073kB when work_mem was set to 4MB and 4540kB when work_mem was > set to 1GB. The extra memory allowed the patched version to stay > at 1 batch, improving performance over the other setting. OK, so with 4MB the new version was batching, while the original code keeps a single batch (likely due to not counting buckets into work_mem). I believe that's expected behavior. Also, in the e-mail from Oct 2 that got lost [1], I pointed out that by testing only with small hash tables the results are likely influenced by high CPU cache hit ratio. I think benchmarking with larger inner relation (thus increasing the memory occupied by the hash table) might be interesting. [1] http://www.postgresql.org/message-id/c84680e818f6a0f4a26223cd750ff252.squirrel@2.emaily.eu > >> The only thing I can think of is the batching behavior described above. >> >>> This is in "Waiting on Author" status only because I never got an >>> answer about why the debug code used printf() rather the elog() at >>> a DEBUG level. Other than that, I would say this patch is Ready >>> for Committer. Tomas? You there? >> >> I think I responded to that on October 2, quoting: ... > Ah, I never received that email. That tends to happen every now > and then. :-( > >> I believe it's safe to switch the logging to elog(). IMHO the printf >> logging is there from some very early version of the code, before elog >> was introduced. Or something like that. > > I guess it might be best to remain consistent in this patch and > change that in a separate patch. I just wanted to make sure you > didn't see any reason not to do so. OK, I'll put that on my TODO list for the next CF. > With that addressed I will move this to Ready for Committer. Since > both Heikki and Robert spent time on this patch earlier, I'll give > either of them a shot at committing it if they want; otherwise I'll > do it. Great, thanks! Tomas
Re: bad estimation together with large work_mem generates terrible slow hash joins
From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote: > Since both Heikki and Robert spent time on this patch earlier, > I'll give either of them a shot at committing it if they want; > otherwise I'll do it. Done. Thanks, Tomas! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company