Really bad blowups with hash outer join and nulls - Mailing list pgsql-hackers

From Andrew Gierth
Subject Really bad blowups with hash outer join and nulls
Date
Msg-id 87y4nzlmeq.fsf@news-spur.riddles.org.uk
Whole thread Raw
Responses Re: Really bad blowups with hash outer join and nulls
List pgsql-hackers
This came up today on IRC, though I suspect the general problem has been
seen before:

create table m3 (id uuid, status integer);
create table q3 (id uuid);
insert into m3 select uuid_generate_v4(), floor(random() * 4)::integer   from generate_series(1,1000000);
insert into q3 select id   from (select id, row_number() over () as rnum from m3) s  where (rnum % 4)=0;
insert into q3 select null from generate_series(1,250000);

-- at this point m3 has a million unique rows, while q3 has
-- a quarter of the ids of m3, plus as many nulls

-- note that work_mem is at the default 4MB

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
                              QUERY PLAN                                                          
 

------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=51553.62..51553.67 rows=4 width=4) (actual time=259580.168..259580.179 rows=5 loops=1)  Group Key: m3.status  ->
HashRight Join  (cost=12927.31..49596.89 rows=391347 width=4) (actual time=18804.502..258498.897 rows=500000 loops=1)
    Hash Cond: (m3.id = q3.id)        ->  Seq Scan on m3  (cost=0.00..16753.10 rows=1038310 width=20) (actual
time=0.011..1618.762rows=1000000 loops=1)        ->  Hash  (cost=6124.47..6124.47 rows=391347 width=16) (actual
time=1742.340..1742.340rows=500000 loops=1)              Buckets: 131072 (originally 131072)  Batches: 65536
(originally8)  Memory Usage: 8837kB              ->  Seq Scan on q3  (cost=0.00..6124.47 rows=391347 width=16) (actual
time=0.033..774.732rows=500000 loops=1)Planning time: 0.178 msExecution time: 259583.185 ms
 

The memory usage of this query runs to hundreds of megs despite the
small work_mem.

On 9.3.5, which the user was running, the problem was much more extreme
(this is with the default 1MB work_mem, and a data set only a quarter of
the size):

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
                              QUERY PLAN                                                           
 

-------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=15020.11..15022.11 rows=200 width=4) (actual time=3733245.942..3733245.952 rows=5 loops=1)  ->  Hash Right Join
(cost=3976.50..14395.11rows=125000 width=4) (actual time=227870.000..3732686.692 rows=125000 loops=1)        Hash Cond:
(m3.id= q3.id)        ->  Seq Scan on m3  (cost=0.00..4189.59 rows=259659 width=20) (actual time=0.024..807.162
rows=250000loops=1)        ->  Hash  (cost=1803.00..1803.00 rows=125000 width=16) (actual time=437.991..437.991
rows=125000loops=1)              Buckets: 4096  Batches: 262144 (originally 8)  Memory Usage: 1954kB              ->
SeqScan on q3  (cost=0.00..1803.00 rows=125000 width=16) (actual time=0.011..191.208 rows=125000 loops=1)Total runtime:
3733269.725ms
 
(8 rows)

This query on 9.3.5 allocated over 2.7GB of RAM on my system.

Note the explosion in the number of batches. Tracing execution showed
that as the memory limit was reached while processing the large number
of nulls, the number of batches would be doubled, then only a tiny
number of rows could be written out as being no longer in the current
batch, so the limit was then quickly reached again.

A key part of the problem here may be the fact that nulls hash to
exactly 0, and are therefore always part of the initial in-memory batch
regardless of how much splitting happens. If a large subset of the data
had some other constant value, it would likely have some non-zero bits
in its hash value and therefore stand a good chance of being written out
to a batch file before things get too desperate, avoiding the mass
explosion.

A quick test suggests that initializing the hash value to ~0 rather than
0 has a curious effect: the number of batches still explodes, but the
performance does not suffer the same way. (I think because almost all
the batches end up empty.) I think this is worth doing even in the
absence of a more general solution; nulls are common enough and
important enough that they shouldn't be the worst-case value if it can
be avoided.

(Testing this idea revealed that a regression test for rowsecurity is
missing an order by clause and changes result based on hash values)

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: David G Johnston
Date:
Subject: Re: restrict global access to be readonly
Next
From: Alexander Korotkov
Date:
Subject: Re: KNN-GiST with recheck