Thread: Really bad blowups with hash outer join and nulls
This came up today on IRC, though I suspect the general problem has been seen before: create table m3 (id uuid, status integer); create table q3 (id uuid); insert into m3 select uuid_generate_v4(), floor(random() * 4)::integer from generate_series(1,1000000); insert into q3 select id from (select id, row_number() over () as rnum from m3) s where (rnum % 4)=0; insert into q3 select null from generate_series(1,250000); -- at this point m3 has a million unique rows, while q3 has -- a quarter of the ids of m3, plus as many nulls -- note that work_mem is at the default 4MB explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=51553.62..51553.67 rows=4 width=4) (actual time=259580.168..259580.179 rows=5 loops=1) Group Key: m3.status -> HashRight Join (cost=12927.31..49596.89 rows=391347 width=4) (actual time=18804.502..258498.897 rows=500000 loops=1) Hash Cond: (m3.id = q3.id) -> Seq Scan on m3 (cost=0.00..16753.10 rows=1038310 width=20) (actual time=0.011..1618.762rows=1000000 loops=1) -> Hash (cost=6124.47..6124.47 rows=391347 width=16) (actual time=1742.340..1742.340rows=500000 loops=1) Buckets: 131072 (originally 131072) Batches: 65536 (originally8) Memory Usage: 8837kB -> Seq Scan on q3 (cost=0.00..6124.47 rows=391347 width=16) (actual time=0.033..774.732rows=500000 loops=1)Planning time: 0.178 msExecution time: 259583.185 ms The memory usage of this query runs to hundreds of megs despite the small work_mem. On 9.3.5, which the user was running, the problem was much more extreme (this is with the default 1MB work_mem, and a data set only a quarter of the size): explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=15020.11..15022.11 rows=200 width=4) (actual time=3733245.942..3733245.952 rows=5 loops=1) -> Hash Right Join (cost=3976.50..14395.11rows=125000 width=4) (actual time=227870.000..3732686.692 rows=125000 loops=1) Hash Cond: (m3.id= q3.id) -> Seq Scan on m3 (cost=0.00..4189.59 rows=259659 width=20) (actual time=0.024..807.162 rows=250000loops=1) -> Hash (cost=1803.00..1803.00 rows=125000 width=16) (actual time=437.991..437.991 rows=125000loops=1) Buckets: 4096 Batches: 262144 (originally 8) Memory Usage: 1954kB -> SeqScan on q3 (cost=0.00..1803.00 rows=125000 width=16) (actual time=0.011..191.208 rows=125000 loops=1)Total runtime: 3733269.725ms (8 rows) This query on 9.3.5 allocated over 2.7GB of RAM on my system. Note the explosion in the number of batches. Tracing execution showed that as the memory limit was reached while processing the large number of nulls, the number of batches would be doubled, then only a tiny number of rows could be written out as being no longer in the current batch, so the limit was then quickly reached again. A key part of the problem here may be the fact that nulls hash to exactly 0, and are therefore always part of the initial in-memory batch regardless of how much splitting happens. If a large subset of the data had some other constant value, it would likely have some non-zero bits in its hash value and therefore stand a good chance of being written out to a batch file before things get too desperate, avoiding the mass explosion. A quick test suggests that initializing the hash value to ~0 rather than 0 has a curious effect: the number of batches still explodes, but the performance does not suffer the same way. (I think because almost all the batches end up empty.) I think this is worth doing even in the absence of a more general solution; nulls are common enough and important enough that they shouldn't be the worst-case value if it can be avoided. (Testing this idea revealed that a regression test for rowsecurity is missing an order by clause and changes result based on hash values) -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > A quick test suggests that initializing the hash value to ~0 rather than > 0 has a curious effect: the number of batches still explodes, but the > performance does not suffer the same way. (I think because almost all > the batches end up empty.) I think this is worth doing even in the > absence of a more general solution; nulls are common enough and > important enough that they shouldn't be the worst-case value if it can > be avoided. I think that's unlikely to be a path to a good solution. At least part of the problem here is that estimate_hash_bucketsize() supposes that nulls can be ignored --- which is normally true, and invalidates your claim that they're common. But in a RIGHT JOIN situation, they need to be considered as if they were regular keys. That would probably be sufficient to dissuade the planner from choosing a hash join in this example. There may also be something we can do in the executor, but it would take closer analysis to figure out what's going wrong. I don't think kluging the behavior for NULL in particular is the answer. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> A quick test suggests that initializing the hash value to ~0 rather >> than 0 has a curious effect: the number of batches still explodes, >> but the performance does not suffer the same way. (I think because >> almost all the batches end up empty.) I think this is worth doing >> even in the absence of a more general solution; nulls are common >> enough and important enough that they shouldn't be the worst-case >> value if it can be avoided. Tom> I think that's unlikely to be a path to a good solution. It wasn't really intended to be. Tom> At least part of the problem here is that Tom> estimate_hash_bucketsize() supposes that nulls can be ignored --- Tom> which is normally true, and invalidates your claim that they're Tom> common. But in a RIGHT JOIN situation, they need to be considered Tom> as if they were regular keys. That would probably be sufficient Tom> to dissuade the planner from choosing a hash join in this example. I've now tried the attached patch to correct the bucketsize estimates, and it does indeed stop the planner from considering the offending path (in this case it just does the join the other way round). One thing I couldn't immediately see how to do was account for the case where there are a lot of nulls in the table but a strict qual (or an IS NOT NULL) filters them out; this patch will be overly pessimistic about such cases. Do estimates normally try and take things like this into account? I didn't find any other relevant examples. Tom> There may also be something we can do in the executor, but it Tom> would take closer analysis to figure out what's going wrong. I Tom> don't think kluging the behavior for NULL in particular is the Tom> answer. The point with nulls is that a hash value of 0 is currently special in two distinct ways: it's always in batch 0 and bucket 0 regardless of how many batches and buckets there are, and it's the result of hashing a null. These two special cases interact in a worst-case manner, so it seems worthwhile to avoid that. -- Andrew (irc:RhodiumToad)
Attachment
Hi, On 16.2.2015 00:50, Andrew Gierth wrote: >>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > > I've now tried the attached patch to correct the bucketsize > estimates, and it does indeed stop the planner from considering the > offending path (in this case it just does the join the other way > round). > > One thing I couldn't immediately see how to do was account for the > case where there are a lot of nulls in the table but a strict qual > (or an IS NOT NULL) filters them out; this patch will be overly > pessimistic about such cases. Do estimates normally try and take > things like this into account? I didn't find any other relevant > examples. Improving the estimates is always good, but it's not going to fix the case of non-NULL values (it shouldn't be all that difficult to create such examples with a value whose hash starts with a bunch of zeroes). > Tom> There may also be something we can do in the executor, but it > Tom> would take closer analysis to figure out what's going wrong. I > Tom> don't think kluging the behavior for NULL in particular is the > Tom> answer. > > The point with nulls is that a hash value of 0 is currently special > in two distinct ways: it's always in batch 0 and bucket 0 regardless > of how many batches and buckets there are, and it's the result of > hashing a null. These two special cases interact in a worst-case > manner, so it seems worthwhile to avoid that. I think you're right this is a flaw in general - all it takes is a sufficiently common value with a hash value falling into the first batch (i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0). I think this might be solved by relaxing the check a bit. Currently we do this (see nodeHash.c:735): if (nfreed == 0 || nfreed == ninmemory) { hashtable->growEnabled = false; } which only disables nbatch growth if *all* the tuples remain in the batch. That's rather strict, and it takes a single tuple to break this. With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples staying in the batch, 1/2 moved to the new one. So a much higher ratio is suspicious and most likely mean the same hash value, so what if we did something like this: if ((nfreed >= 0.9 * (nfreed + ninmemory)) || (nfreed <= 0.1 * (nfreed + ninmemory))) { hashtable->growEnabled = false; } which disables nbatch growth if either >=90% tuples remained in the first batch, or >= 90% tuples were moved from it. The exact thresholds might be set a bit differently, but I think 10:90 sounds about good. Trivial patch implementing this attached - with it the explain analyze looks like this: test=# explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status; QUERY PLAN ---------------------------------------------------------------------- HashAggregate (cost=64717.63..64717.67 rows=4 width=4) (actual time=514.619..514.630 rows=5 loops=1) Group Key: m3.status -> Hash Right Join (cost=18100.00..62217.63 rows=500000 width=4) (actual time=75.260..467.911 rows=500108 loops=1) Hash Cond: (m3.id = q3.id) -> Seq Scan on m3 (cost=0.00..18334.00 rows=1000000 width=37) (actual time=0.003..91.799 rows=1000000 loops=1) -> Hash (cost=7943.00..7943.00 rows=500000 width=33) (actual time=74.916..74.916 rows=500000 loops=1) Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 8824kB -> Seq Scan on q3 (cost=0.00..7943.00 rows=500000 width=33) (actual time=0.005..27.395 rows=500000 loops=1) Planning time: 0.172 ms Execution time: 514.663 ms (10 rows) Without the patch this runs in ~240 seconds and the number of batches explodes to ~131k. In theory it might happen that threre just a few hash values and all of them are exactly the same within the first N bits (thus falling into the first batch), but N+1 bits are different. So the next split would actually help. But I think we can leave that up to probability, just like all the hash code. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
>>>>> "Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: Tomas> Improving the estimates is always good, but it's not going toTomas> fix the case of non-NULL values (it shouldn'tbe all thatTomas> difficult to create such examples with a value whose hash startsTomas> with a bunch of zeroes). Right now, I can't get it to plan such an example, because (a) if there are no stats to work from then the planner makes fairly pessimistic assumptions about hash bucket filling, and (b) if there _are_ stats to work from, then a frequently-occurring non-null value shows up as an MCV and the planner takes that into account to calculate bucketsize. The problem could only be demonstrated for NULLs because the planner was ignoring NULL for the purposes of estimating bucketsize, which is correct for all join types except RIGHT and FULL (which, iirc, are more recent additions to the hashjoin repertoire). If you want to try testing it, you may find this useful: select i, hashint8(i) from unnest(array[1474049294, -1779024306, -1329041947]) u(i); i | hashint8 -------------+---------- 1474049294 | 0-1779024306 | 0-1329041947 | 0 (3 rows) (those are the only three int4 values that hash to exactly 0) It's probably possible to construct pathological cases by finding a lot of different values with zeros in the high bits of the hash, but that's something that wouldn't be likely to happen by chance. Tomas> I think this might be solved by relaxing the check a bit. Yeah, that looks potentially useful. -- Andrew (irc:RhodiumToad)
On 16.2.2015 03:38, Andrew Gierth wrote: >>>>>> "Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> >>>>>> writes: > > Tomas> Improving the estimates is always good, but it's not going > to Tomas> fix the case of non-NULL values (it shouldn't be all > that Tomas> difficult to create such examples with a value whose > hash starts Tomas> with a bunch of zeroes). > > Right now, I can't get it to plan such an example, because (a) if > there are no stats to work from then the planner makes fairly > pessimistic assumptions about hash bucket filling, and (b) if > there _are_ stats to work from, then a frequently-occurring > non-null value shows up as an MCV and the planner takes that into > account to calculate bucketsize. > > The problem could only be demonstrated for NULLs because the > planner was ignoring NULL for the purposes of estimating > bucketsize, which is correct for all join types except RIGHT and > FULL (which, iirc, are more recent additions to the hashjoin > repertoire). Oh, right, the estimate fix is probably sufficient then. > > If you want to try testing it, you may find this useful: > > select i, hashint8(i) from unnest(array[1474049294, -1779024306, -1329041947]) u(i); > i | hashint8 -------------+---------- 1474049294 | 0 > -1779024306 | 0 -1329041947 | 0 (3 rows) > > (those are the only three int4 values that hash to exactly 0) > > It's probably possible to construct pathological cases by finding > a lot of different values with zeros in the high bits of the hash, > but that's something that wouldn't be likely to happen by chance. Yeah, it's probably possible, but it's admittedly considerably harder than I initially thought. For example it could be possible to create the table with no MCV values but sorted so that all the initial values have hashvalue=0, triggering (growEnabled=false). But that's rather unlikely to happen in practice I guess. A more likely failure scenario is a hash join higher up the plan, processing results of other joins etc. In that case the estimates will be tricky, although the planner chooses quite pessimistic defaults in those cases. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/15/15 7:16 PM, Tomas Vondra wrote: > Hi, > > On 16.2.2015 00:50, Andrew Gierth wrote: >>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> >> I've now tried the attached patch to correct the bucketsize >> estimates, and it does indeed stop the planner from considering the >> offending path (in this case it just does the join the other way >> round). >> >> One thing I couldn't immediately see how to do was account for the >> case where there are a lot of nulls in the table but a strict qual >> (or an IS NOT NULL) filters them out; this patch will be overly >> pessimistic about such cases. Do estimates normally try and take >> things like this into account? I didn't find any other relevant >> examples. > > Improving the estimates is always good, but it's not going to fix the > case of non-NULL values (it shouldn't be all that difficult to create > such examples with a value whose hash starts with a bunch of zeroes). > >> Tom> There may also be something we can do in the executor, but it >> Tom> would take closer analysis to figure out what's going wrong. I >> Tom> don't think kluging the behavior for NULL in particular is the >> Tom> answer. >> >> The point with nulls is that a hash value of 0 is currently special >> in two distinct ways: it's always in batch 0 and bucket 0 regardless >> of how many batches and buckets there are, and it's the result of >> hashing a null. These two special cases interact in a worst-case >> manner, so it seems worthwhile to avoid that. > > I think you're right this is a flaw in general - all it takes is a > sufficiently common value with a hash value falling into the first batch > (i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0). > > I think this might be solved by relaxing the check a bit. Currently we > do this (see nodeHash.c:735): > > if (nfreed == 0 || nfreed == ninmemory) > { > hashtable->growEnabled = false; > } > > which only disables nbatch growth if *all* the tuples remain in the > batch. That's rather strict, and it takes a single tuple to break this. > > With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples > staying in the batch, 1/2 moved to the new one. So a much higher ratio > is suspicious and most likely mean the same hash value, so what if we > did something like this: > > if ((nfreed >= 0.9 * (nfreed + ninmemory)) || > (nfreed <= 0.1 * (nfreed + ninmemory))) > { > hashtable->growEnabled = false; > } > > which disables nbatch growth if either >=90% tuples remained in the > first batch, or >= 90% tuples were moved from it. The exact thresholds > might be set a bit differently, but I think 10:90 sounds about good. > > Trivial patch implementing this attached - with it the explain analyze > looks like this: > > test=# explain analyze select status, count(*) from q3 left join m3 on > (m3.id=q3.id) group by status; > > QUERY PLAN > ---------------------------------------------------------------------- > HashAggregate (cost=64717.63..64717.67 rows=4 width=4) (actual > time=514.619..514.630 rows=5 loops=1) > Group Key: m3.status > -> Hash Right Join (cost=18100.00..62217.63 rows=500000 width=4) > (actual time=75.260..467.911 rows=500108 loops=1) > Hash Cond: (m3.id = q3.id) > -> Seq Scan on m3 (cost=0.00..18334.00 rows=1000000 width=37) > (actual time=0.003..91.799 rows=1000000 loops=1) > -> Hash (cost=7943.00..7943.00 rows=500000 width=33) > (actual time=74.916..74.916 rows=500000 loops=1) > Buckets: 65536 (originally 65536) > Batches: 32 (originally 16) Memory Usage: 8824kB > -> Seq Scan on q3 > (cost=0.00..7943.00 rows=500000 width=33) > (actual time=0.005..27.395 rows=500000 loops=1) > Planning time: 0.172 ms > Execution time: 514.663 ms > (10 rows) > > Without the patch this runs in ~240 seconds and the number of batches > explodes to ~131k. > > In theory it might happen that threre just a few hash values and all of > them are exactly the same within the first N bits (thus falling into the > first batch), but N+1 bits are different. So the next split would > actually help. But I think we can leave that up to probability, just > like all the hash code. Anything happen with this, or the patch Andrew posted? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
>>>>> "Jim" == Jim Nasby <Jim.Nasby@BlueTreble.com> writes: Jim> Anything happen with this, or the patch Andrew posted? No. And my attention has just been drawn to this, which looks like the same issue: http://www.postgresql.org/message-id/52B47B47-0926-4E15-B25E-212DF52FE695@oseberg.io -- Andrew (irc:RhodiumToad)