Home > mailing lists

Re: Really bad blowups with hash outer join and nulls - Mailing list pgsql-hackers

From	Jim Nasby
Subject	Re: Really bad blowups with hash outer join and nulls
Date	April 8, 2015 07:50:49
Msg-id	5524B395.4010303@BlueTreble.com Whole thread Raw
In response to	Re: Really bad blowups with hash outer join and nulls (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Really bad blowups with hash outer join and nulls
List	pgsql-hackers

Tree view

On 2/15/15 7:16 PM, Tomas Vondra wrote:
> Hi,
>
> On 16.2.2015 00:50, Andrew Gierth wrote:
>>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>> I've now tried the attached patch to correct the bucketsize
>> estimates, and it does indeed stop the planner from considering the
>> offending path (in this case it just does the join the other way
>> round).
>>
>> One thing I couldn't immediately see how to do was account for the
>> case where there are a lot of nulls in the table but a strict qual
>> (or an IS NOT NULL) filters them out; this patch will be overly
>> pessimistic about such cases. Do estimates normally try and take
>> things like this into account? I didn't find any other relevant
>> examples.
>
> Improving the estimates is always good, but it's not going to fix the
> case of non-NULL values (it shouldn't be all that difficult to create
> such examples with a value whose hash starts with a bunch of zeroes).
>
>>   Tom> There may also be something we can do in the executor, but it
>>   Tom> would take closer analysis to figure out what's going wrong.  I
>>   Tom> don't think kluging the behavior for NULL in particular is the
>>   Tom> answer.
>>
>> The point with nulls is that a hash value of 0 is currently special
>> in two distinct ways: it's always in batch 0 and bucket 0 regardless
>> of how many batches and buckets there are, and it's the result of
>> hashing a null. These two special cases interact in a worst-case
>> manner, so it seems worthwhile to avoid that.
>
> I think you're right this is a flaw in general - all it takes is a
> sufficiently common value with a hash value falling into the first batch
> (i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).
>
> I think this might be solved by relaxing the check a bit. Currently we
> do this (see nodeHash.c:735):
>
>      if (nfreed == 0 || nfreed == ninmemory)
>      {
>          hashtable->growEnabled = false;
>      }
>
> which only disables nbatch growth if *all* the tuples remain in the
> batch. That's rather strict, and it takes a single tuple to break this.
>
> With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
> staying in the batch, 1/2 moved to the new one. So a much higher ratio
> is suspicious and most likely mean the same hash value, so what if we
> did something like this:
>
>      if ((nfreed >= 0.9 * (nfreed + ninmemory)) ||
>          (nfreed <= 0.1 * (nfreed + ninmemory)))
>      {
>          hashtable->growEnabled = false;
>      }
>
> which disables nbatch growth if either >=90% tuples remained in the
> first batch, or >= 90% tuples were moved from it. The exact thresholds
> might be set a bit differently, but I think 10:90 sounds about good.
>
> Trivial patch implementing this attached - with it the explain analyze
> looks like this:
>
> test=# explain analyze select status, count(*) from q3 left join m3 on
> (m3.id=q3.id) group by status;
>
>                                 QUERY PLAN
> ----------------------------------------------------------------------
>   HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
> time=514.619..514.630 rows=5 loops=1)
>     Group Key: m3.status
>     ->  Hash Right Join  (cost=18100.00..62217.63 rows=500000 width=4)
>                      (actual time=75.260..467.911 rows=500108 loops=1)
>           Hash Cond: (m3.id = q3.id)
>           ->  Seq Scan on m3  (cost=0.00..18334.00 rows=1000000 width=37)
>                          (actual time=0.003..91.799 rows=1000000 loops=1)
>           ->  Hash  (cost=7943.00..7943.00 rows=500000 width=33)
>                     (actual time=74.916..74.916 rows=500000 loops=1)
>                 Buckets: 65536 (originally 65536)
>                 Batches: 32 (originally 16)  Memory Usage: 8824kB
>                 ->  Seq Scan on q3
>                     (cost=0.00..7943.00 rows=500000 width=33)
>                     (actual time=0.005..27.395 rows=500000 loops=1)
>   Planning time: 0.172 ms
>   Execution time: 514.663 ms
> (10 rows)
>
> Without the patch this runs in ~240 seconds and the number of batches
> explodes to ~131k.
>
> In theory it might happen that threre just a few hash values and all of
> them are exactly the same within the first N bits (thus falling into the
> first batch), but N+1 bits are different. So the next split would
> actually help. But I think we can leave that up to probability, just
> like all the hash code.

Anything happen with this, or the patch Andrew posted?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

pgsql-hackers by date:

From: Fabrízio de Royes Mello
Date: 08 April 2015, 07:17:54
Subject: Re: Removal of FORCE option in REINDEX

From: Michael Paquier
Date: 08 April 2015, 07:59:54
Subject: Re: Replication identifiers, take 4

Re: Really bad blowups with hash outer join and nulls - Mailing list pgsql-hackers

Previous

Next