Re: Really bad blowups with hash outer join and nulls - Mailing list pgsql-hackers
From | Jim Nasby |
---|---|
Subject | Re: Really bad blowups with hash outer join and nulls |
Date | |
Msg-id | 5524B395.4010303@BlueTreble.com Whole thread Raw |
In response to | Re: Really bad blowups with hash outer join and nulls (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Really bad blowups with hash outer join and nulls
|
List | pgsql-hackers |
On 2/15/15 7:16 PM, Tomas Vondra wrote: > Hi, > > On 16.2.2015 00:50, Andrew Gierth wrote: >>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: >> >> I've now tried the attached patch to correct the bucketsize >> estimates, and it does indeed stop the planner from considering the >> offending path (in this case it just does the join the other way >> round). >> >> One thing I couldn't immediately see how to do was account for the >> case where there are a lot of nulls in the table but a strict qual >> (or an IS NOT NULL) filters them out; this patch will be overly >> pessimistic about such cases. Do estimates normally try and take >> things like this into account? I didn't find any other relevant >> examples. > > Improving the estimates is always good, but it's not going to fix the > case of non-NULL values (it shouldn't be all that difficult to create > such examples with a value whose hash starts with a bunch of zeroes). > >> Tom> There may also be something we can do in the executor, but it >> Tom> would take closer analysis to figure out what's going wrong. I >> Tom> don't think kluging the behavior for NULL in particular is the >> Tom> answer. >> >> The point with nulls is that a hash value of 0 is currently special >> in two distinct ways: it's always in batch 0 and bucket 0 regardless >> of how many batches and buckets there are, and it's the result of >> hashing a null. These two special cases interact in a worst-case >> manner, so it seems worthwhile to avoid that. > > I think you're right this is a flaw in general - all it takes is a > sufficiently common value with a hash value falling into the first batch > (i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0). > > I think this might be solved by relaxing the check a bit. Currently we > do this (see nodeHash.c:735): > > if (nfreed == 0 || nfreed == ninmemory) > { > hashtable->growEnabled = false; > } > > which only disables nbatch growth if *all* the tuples remain in the > batch. That's rather strict, and it takes a single tuple to break this. > > With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples > staying in the batch, 1/2 moved to the new one. So a much higher ratio > is suspicious and most likely mean the same hash value, so what if we > did something like this: > > if ((nfreed >= 0.9 * (nfreed + ninmemory)) || > (nfreed <= 0.1 * (nfreed + ninmemory))) > { > hashtable->growEnabled = false; > } > > which disables nbatch growth if either >=90% tuples remained in the > first batch, or >= 90% tuples were moved from it. The exact thresholds > might be set a bit differently, but I think 10:90 sounds about good. > > Trivial patch implementing this attached - with it the explain analyze > looks like this: > > test=# explain analyze select status, count(*) from q3 left join m3 on > (m3.id=q3.id) group by status; > > QUERY PLAN > ---------------------------------------------------------------------- > HashAggregate (cost=64717.63..64717.67 rows=4 width=4) (actual > time=514.619..514.630 rows=5 loops=1) > Group Key: m3.status > -> Hash Right Join (cost=18100.00..62217.63 rows=500000 width=4) > (actual time=75.260..467.911 rows=500108 loops=1) > Hash Cond: (m3.id = q3.id) > -> Seq Scan on m3 (cost=0.00..18334.00 rows=1000000 width=37) > (actual time=0.003..91.799 rows=1000000 loops=1) > -> Hash (cost=7943.00..7943.00 rows=500000 width=33) > (actual time=74.916..74.916 rows=500000 loops=1) > Buckets: 65536 (originally 65536) > Batches: 32 (originally 16) Memory Usage: 8824kB > -> Seq Scan on q3 > (cost=0.00..7943.00 rows=500000 width=33) > (actual time=0.005..27.395 rows=500000 loops=1) > Planning time: 0.172 ms > Execution time: 514.663 ms > (10 rows) > > Without the patch this runs in ~240 seconds and the number of batches > explodes to ~131k. > > In theory it might happen that threre just a few hash values and all of > them are exactly the same within the first N bits (thus falling into the > first batch), but N+1 bits are different. So the next split would > actually help. But I think we can leave that up to probability, just > like all the hash code. Anything happen with this, or the patch Andrew posted? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
pgsql-hackers by date: