Really bad blowups with hash outer join and nulls - Mailing list pgsql-hackers
From | Andrew Gierth |
---|---|
Subject | Really bad blowups with hash outer join and nulls |
Date | |
Msg-id | 87y4nzlmeq.fsf@news-spur.riddles.org.uk Whole thread Raw |
Responses |
Re: Really bad blowups with hash outer join and nulls
|
List | pgsql-hackers |
This came up today on IRC, though I suspect the general problem has been seen before: create table m3 (id uuid, status integer); create table q3 (id uuid); insert into m3 select uuid_generate_v4(), floor(random() * 4)::integer from generate_series(1,1000000); insert into q3 select id from (select id, row_number() over () as rnum from m3) s where (rnum % 4)=0; insert into q3 select null from generate_series(1,250000); -- at this point m3 has a million unique rows, while q3 has -- a quarter of the ids of m3, plus as many nulls -- note that work_mem is at the default 4MB explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=51553.62..51553.67 rows=4 width=4) (actual time=259580.168..259580.179 rows=5 loops=1) Group Key: m3.status -> HashRight Join (cost=12927.31..49596.89 rows=391347 width=4) (actual time=18804.502..258498.897 rows=500000 loops=1) Hash Cond: (m3.id = q3.id) -> Seq Scan on m3 (cost=0.00..16753.10 rows=1038310 width=20) (actual time=0.011..1618.762rows=1000000 loops=1) -> Hash (cost=6124.47..6124.47 rows=391347 width=16) (actual time=1742.340..1742.340rows=500000 loops=1) Buckets: 131072 (originally 131072) Batches: 65536 (originally8) Memory Usage: 8837kB -> Seq Scan on q3 (cost=0.00..6124.47 rows=391347 width=16) (actual time=0.033..774.732rows=500000 loops=1)Planning time: 0.178 msExecution time: 259583.185 ms The memory usage of this query runs to hundreds of megs despite the small work_mem. On 9.3.5, which the user was running, the problem was much more extreme (this is with the default 1MB work_mem, and a data set only a quarter of the size): explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=15020.11..15022.11 rows=200 width=4) (actual time=3733245.942..3733245.952 rows=5 loops=1) -> Hash Right Join (cost=3976.50..14395.11rows=125000 width=4) (actual time=227870.000..3732686.692 rows=125000 loops=1) Hash Cond: (m3.id= q3.id) -> Seq Scan on m3 (cost=0.00..4189.59 rows=259659 width=20) (actual time=0.024..807.162 rows=250000loops=1) -> Hash (cost=1803.00..1803.00 rows=125000 width=16) (actual time=437.991..437.991 rows=125000loops=1) Buckets: 4096 Batches: 262144 (originally 8) Memory Usage: 1954kB -> SeqScan on q3 (cost=0.00..1803.00 rows=125000 width=16) (actual time=0.011..191.208 rows=125000 loops=1)Total runtime: 3733269.725ms (8 rows) This query on 9.3.5 allocated over 2.7GB of RAM on my system. Note the explosion in the number of batches. Tracing execution showed that as the memory limit was reached while processing the large number of nulls, the number of batches would be doubled, then only a tiny number of rows could be written out as being no longer in the current batch, so the limit was then quickly reached again. A key part of the problem here may be the fact that nulls hash to exactly 0, and are therefore always part of the initial in-memory batch regardless of how much splitting happens. If a large subset of the data had some other constant value, it would likely have some non-zero bits in its hash value and therefore stand a good chance of being written out to a batch file before things get too desperate, avoiding the mass explosion. A quick test suggests that initializing the hash value to ~0 rather than 0 has a curious effect: the number of batches still explodes, but the performance does not suffer the same way. (I think because almost all the batches end up empty.) I think this is worth doing even in the absence of a more general solution; nulls are common enough and important enough that they shouldn't be the worst-case value if it can be avoided. (Testing this idea revealed that a regression test for rowsecurity is missing an order by clause and changes result based on hash values) -- Andrew (irc:RhodiumToad)
pgsql-hackers by date: