Thread: Really bad blowups with hash outer join and nulls

Really bad blowups with hash outer join and nulls

From

Andrew Gierth

Date:

15 February 2015, 09:54:36

This came up today on IRC, though I suspect the general problem has been
seen before:

create table m3 (id uuid, status integer);
create table q3 (id uuid);
insert into m3 select uuid_generate_v4(), floor(random() * 4)::integer   from generate_series(1,1000000);
insert into q3 select id   from (select id, row_number() over () as rnum from m3) s  where (rnum % 4)=0;
insert into q3 select null from generate_series(1,250000);

-- at this point m3 has a million unique rows, while q3 has
-- a quarter of the ids of m3, plus as many nulls

-- note that work_mem is at the default 4MB

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
                              QUERY PLAN                                                          
 

------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=51553.62..51553.67 rows=4 width=4) (actual time=259580.168..259580.179 rows=5 loops=1)  Group Key: m3.status  ->
HashRight Join  (cost=12927.31..49596.89 rows=391347 width=4) (actual time=18804.502..258498.897 rows=500000 loops=1)
    Hash Cond: (m3.id = q3.id)        ->  Seq Scan on m3  (cost=0.00..16753.10 rows=1038310 width=20) (actual
time=0.011..1618.762rows=1000000 loops=1)        ->  Hash  (cost=6124.47..6124.47 rows=391347 width=16) (actual
time=1742.340..1742.340rows=500000 loops=1)              Buckets: 131072 (originally 131072)  Batches: 65536
(originally8)  Memory Usage: 8837kB              ->  Seq Scan on q3  (cost=0.00..6124.47 rows=391347 width=16) (actual
time=0.033..774.732rows=500000 loops=1)Planning time: 0.178 msExecution time: 259583.185 ms
 

The memory usage of this query runs to hundreds of megs despite the
small work_mem.

On 9.3.5, which the user was running, the problem was much more extreme
(this is with the default 1MB work_mem, and a data set only a quarter of
the size):

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
                              QUERY PLAN                                                           
 

-------------------------------------------------------------------------------------------------------------------------------HashAggregate
(cost=15020.11..15022.11 rows=200 width=4) (actual time=3733245.942..3733245.952 rows=5 loops=1)  ->  Hash Right Join
(cost=3976.50..14395.11rows=125000 width=4) (actual time=227870.000..3732686.692 rows=125000 loops=1)        Hash Cond:
(m3.id= q3.id)        ->  Seq Scan on m3  (cost=0.00..4189.59 rows=259659 width=20) (actual time=0.024..807.162
rows=250000loops=1)        ->  Hash  (cost=1803.00..1803.00 rows=125000 width=16) (actual time=437.991..437.991
rows=125000loops=1)              Buckets: 4096  Batches: 262144 (originally 8)  Memory Usage: 1954kB              ->
SeqScan on q3  (cost=0.00..1803.00 rows=125000 width=16) (actual time=0.011..191.208 rows=125000 loops=1)Total runtime:
3733269.725ms
 
(8 rows)

This query on 9.3.5 allocated over 2.7GB of RAM on my system.

Note the explosion in the number of batches. Tracing execution showed
that as the memory limit was reached while processing the large number
of nulls, the number of batches would be doubled, then only a tiny
number of rows could be written out as being no longer in the current
batch, so the limit was then quickly reached again.

A key part of the problem here may be the fact that nulls hash to
exactly 0, and are therefore always part of the initial in-memory batch
regardless of how much splitting happens. If a large subset of the data
had some other constant value, it would likely have some non-zero bits
in its hash value and therefore stand a good chance of being written out
to a batch file before things get too desperate, avoiding the mass
explosion.

A quick test suggests that initializing the hash value to ~0 rather than
0 has a curious effect: the number of batches still explodes, but the
performance does not suffer the same way. (I think because almost all
the batches end up empty.) I think this is worth doing even in the
absence of a more general solution; nulls are common enough and
important enough that they shouldn't be the worst-case value if it can
be avoided.

(Testing this idea revealed that a regression test for rowsecurity is
missing an order by clause and changes result based on hash values)

-- 
Andrew (irc:RhodiumToad)

Re: Really bad blowups with hash outer join and nulls

From

Tom Lane

Date:

15 February 2015, 15:33:40

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> A quick test suggests that initializing the hash value to ~0 rather than
> 0 has a curious effect: the number of batches still explodes, but the
> performance does not suffer the same way. (I think because almost all
> the batches end up empty.) I think this is worth doing even in the
> absence of a more general solution; nulls are common enough and
> important enough that they shouldn't be the worst-case value if it can
> be avoided.

I think that's unlikely to be a path to a good solution.

At least part of the problem here is that estimate_hash_bucketsize()
supposes that nulls can be ignored --- which is normally true, and
invalidates your claim that they're common.  But in a RIGHT JOIN
situation, they need to be considered as if they were regular keys.
That would probably be sufficient to dissuade the planner from choosing
a hash join in this example.

There may also be something we can do in the executor, but it would
take closer analysis to figure out what's going wrong.  I don't think
kluging the behavior for NULL in particular is the answer.
        regards, tom lane

Re: Really bad blowups with hash outer join and nulls

From

Andrew Gierth

Date:

15 February 2015, 23:50:46

>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 >> A quick test suggests that initializing the hash value to ~0 rather
 >> than 0 has a curious effect: the number of batches still explodes,
 >> but the performance does not suffer the same way. (I think because
 >> almost all the batches end up empty.) I think this is worth doing
 >> even in the absence of a more general solution; nulls are common
 >> enough and important enough that they shouldn't be the worst-case
 >> value if it can be avoided.

 Tom> I think that's unlikely to be a path to a good solution.

It wasn't really intended to be.

 Tom> At least part of the problem here is that
 Tom> estimate_hash_bucketsize() supposes that nulls can be ignored ---
 Tom> which is normally true, and invalidates your claim that they're
 Tom> common.  But in a RIGHT JOIN situation, they need to be considered
 Tom> as if they were regular keys.  That would probably be sufficient
 Tom> to dissuade the planner from choosing a hash join in this example.

I've now tried the attached patch to correct the bucketsize estimates,
and it does indeed stop the planner from considering the offending path
(in this case it just does the join the other way round).

One thing I couldn't immediately see how to do was account for the case
where there are a lot of nulls in the table but a strict qual (or an IS
NOT NULL) filters them out; this patch will be overly pessimistic about
such cases. Do estimates normally try and take things like this into
account? I didn't find any other relevant examples.

 Tom> There may also be something we can do in the executor, but it
 Tom> would take closer analysis to figure out what's going wrong.  I
 Tom> don't think kluging the behavior for NULL in particular is the
 Tom> answer.

The point with nulls is that a hash value of 0 is currently special in
two distinct ways: it's always in batch 0 and bucket 0 regardless of how
many batches and buckets there are, and it's the result of hashing a
null.  These two special cases interact in a worst-case manner, so it
seems worthwhile to avoid that.

--
Andrew (irc:RhodiumToad)

Attachment

hjfix2.patch

Re: Really bad blowups with hash outer join and nulls

From

Tomas Vondra

Date:

16 February 2015, 01:16:54

Hi,

On 16.2.2015 00:50, Andrew Gierth wrote:
>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> I've now tried the attached patch to correct the bucketsize
> estimates, and it does indeed stop the planner from considering the
> offending path (in this case it just does the join the other way
> round).
>
> One thing I couldn't immediately see how to do was account for the
> case where there are a lot of nulls in the table but a strict qual
> (or an IS NOT NULL) filters them out; this patch will be overly
> pessimistic about such cases. Do estimates normally try and take
> things like this into account? I didn't find any other relevant
> examples.

Improving the estimates is always good, but it's not going to fix the
case of non-NULL values (it shouldn't be all that difficult to create
such examples with a value whose hash starts with a bunch of zeroes).

>  Tom> There may also be something we can do in the executor, but it
>  Tom> would take closer analysis to figure out what's going wrong.  I
>  Tom> don't think kluging the behavior for NULL in particular is the
>  Tom> answer.
>
> The point with nulls is that a hash value of 0 is currently special
> in two distinct ways: it's always in batch 0 and bucket 0 regardless
> of how many batches and buckets there are, and it's the result of
> hashing a null. These two special cases interact in a worst-case
> manner, so it seems worthwhile to avoid that.

I think you're right this is a flaw in general - all it takes is a
sufficiently common value with a hash value falling into the first batch
(i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).

I think this might be solved by relaxing the check a bit. Currently we
do this (see nodeHash.c:735):

    if (nfreed == 0 || nfreed == ninmemory)
    {
        hashtable->growEnabled = false;
    }

which only disables nbatch growth if *all* the tuples remain in the
batch. That's rather strict, and it takes a single tuple to break this.

With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
staying in the batch, 1/2 moved to the new one. So a much higher ratio
is suspicious and most likely mean the same hash value, so what if we
did something like this:

    if ((nfreed >= 0.9 * (nfreed + ninmemory)) ||
        (nfreed <= 0.1 * (nfreed + ninmemory)))
    {
        hashtable->growEnabled = false;
    }

which disables nbatch growth if either >=90% tuples remained in the
first batch, or >= 90% tuples were moved from it. The exact thresholds
might be set a bit differently, but I think 10:90 sounds about good.

Trivial patch implementing this attached - with it the explain analyze
looks like this:

test=# explain analyze select status, count(*) from q3 left join m3 on
(m3.id=q3.id) group by status;

                               QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
time=514.619..514.630 rows=5 loops=1)
   Group Key: m3.status
   ->  Hash Right Join  (cost=18100.00..62217.63 rows=500000 width=4)
                    (actual time=75.260..467.911 rows=500108 loops=1)
         Hash Cond: (m3.id = q3.id)
         ->  Seq Scan on m3  (cost=0.00..18334.00 rows=1000000 width=37)
                        (actual time=0.003..91.799 rows=1000000 loops=1)
         ->  Hash  (cost=7943.00..7943.00 rows=500000 width=33)
                   (actual time=74.916..74.916 rows=500000 loops=1)
               Buckets: 65536 (originally 65536)
               Batches: 32 (originally 16)  Memory Usage: 8824kB
               ->  Seq Scan on q3
                   (cost=0.00..7943.00 rows=500000 width=33)
                   (actual time=0.005..27.395 rows=500000 loops=1)
 Planning time: 0.172 ms
 Execution time: 514.663 ms
(10 rows)

Without the patch this runs in ~240 seconds and the number of batches
explodes to ~131k.

In theory it might happen that threre just a few hash values and all of
them are exactly the same within the first N bits (thus falling into the
first batch), but N+1 bits are different. So the next split would
actually help. But I think we can leave that up to probability, just
like all the hash code.

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

hash-growth.patch

Re: Really bad blowups with hash outer join and nulls

From

Andrew Gierth

Date:

16 February 2015, 02:38:17

>>>>> "Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
Tomas> Improving the estimates is always good, but it's not going toTomas> fix the case of non-NULL values (it
shouldn'tbe all thatTomas> difficult to create such examples with a value whose hash startsTomas> with a bunch of
zeroes).

Right now, I can't get it to plan such an example, because (a) if there
are no stats to work from then the planner makes fairly pessimistic
assumptions about hash bucket filling, and (b) if there _are_ stats to
work from, then a frequently-occurring non-null value shows up as an MCV
and the planner takes that into account to calculate bucketsize.

The problem could only be demonstrated for NULLs because the planner was
ignoring NULL for the purposes of estimating bucketsize, which is
correct for all join types except RIGHT and FULL (which, iirc, are more
recent additions to the hashjoin repertoire).

If you want to try testing it, you may find this useful:

select i, hashint8(i) from unnest(array[1474049294, -1779024306, -1329041947]) u(i);     i      | hashint8 
-------------+---------- 1474049294 |        0-1779024306 |        0-1329041947 |        0
(3 rows)

(those are the only three int4 values that hash to exactly 0)

It's probably possible to construct pathological cases by finding a lot
of different values with zeros in the high bits of the hash, but that's
something that wouldn't be likely to happen by chance.
Tomas> I think this might be solved by relaxing the check a bit.

Yeah, that looks potentially useful.

-- 
Andrew (irc:RhodiumToad)

Re: Really bad blowups with hash outer join and nulls

From

Tomas Vondra

Date:

16 February 2015, 14:45:03

On 16.2.2015 03:38, Andrew Gierth wrote:
>>>>>> "Tomas" == Tomas Vondra <tomas.vondra@2ndquadrant.com>
>>>>>> writes:
>
> Tomas> Improving the estimates is always good, but it's not going
> to Tomas> fix the case of non-NULL values (it shouldn't be all
> that Tomas> difficult to create such examples with a value whose
> hash starts Tomas> with a bunch of zeroes).
>
> Right now, I can't get it to plan such an example, because (a) if
> there are no stats to work from then the planner makes fairly
> pessimistic assumptions about hash bucket filling, and (b) if
> there _are_ stats to work from, then a frequently-occurring
> non-null value shows up as an MCV and the planner takes that into
> account to calculate bucketsize.
>
> The problem could only be demonstrated for NULLs because the
> planner was ignoring NULL for the purposes of estimating
> bucketsize, which is correct for all join types except RIGHT and
> FULL (which, iirc, are more recent additions to the hashjoin
> repertoire).

Oh, right, the estimate fix is probably sufficient then.

>
> If you want to try testing it, you may find this useful:
>
> select i, hashint8(i) from unnest(array[1474049294, -1779024306,
-1329041947]) u(i);
> i      | hashint8 -------------+---------- 1474049294 |        0
> -1779024306 |        0 -1329041947 |        0 (3 rows)
>
> (those are the only three int4 values that hash to exactly 0)
>
> It's probably possible to construct pathological cases by finding
> a lot of different values with zeros in the high bits of the hash,
> but that's something that wouldn't be likely to happen by chance.

Yeah, it's probably possible, but it's admittedly considerably harder
than I initially thought. For example it could be possible to create
the table with no MCV values but sorted so that all the initial values
have hashvalue=0, triggering (growEnabled=false). But that's rather
unlikely to happen in practice I guess.

A more likely failure scenario is a hash join higher up the plan,
processing results of other joins etc. In that case the estimates will
be tricky, although the planner chooses quite pessimistic defaults in
those cases.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Really bad blowups with hash outer join and nulls

From

Jim Nasby

Date:

08 April 2015, 04:50:49

On 2/15/15 7:16 PM, Tomas Vondra wrote:
> Hi,
>
> On 16.2.2015 00:50, Andrew Gierth wrote:
>>>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>> I've now tried the attached patch to correct the bucketsize
>> estimates, and it does indeed stop the planner from considering the
>> offending path (in this case it just does the join the other way
>> round).
>>
>> One thing I couldn't immediately see how to do was account for the
>> case where there are a lot of nulls in the table but a strict qual
>> (or an IS NOT NULL) filters them out; this patch will be overly
>> pessimistic about such cases. Do estimates normally try and take
>> things like this into account? I didn't find any other relevant
>> examples.
>
> Improving the estimates is always good, but it's not going to fix the
> case of non-NULL values (it shouldn't be all that difficult to create
> such examples with a value whose hash starts with a bunch of zeroes).
>
>>   Tom> There may also be something we can do in the executor, but it
>>   Tom> would take closer analysis to figure out what's going wrong.  I
>>   Tom> don't think kluging the behavior for NULL in particular is the
>>   Tom> answer.
>>
>> The point with nulls is that a hash value of 0 is currently special
>> in two distinct ways: it's always in batch 0 and bucket 0 regardless
>> of how many batches and buckets there are, and it's the result of
>> hashing a null. These two special cases interact in a worst-case
>> manner, so it seems worthwhile to avoid that.
>
> I think you're right this is a flaw in general - all it takes is a
> sufficiently common value with a hash value falling into the first batch
> (i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).
>
> I think this might be solved by relaxing the check a bit. Currently we
> do this (see nodeHash.c:735):
>
>      if (nfreed == 0 || nfreed == ninmemory)
>      {
>          hashtable->growEnabled = false;
>      }
>
> which only disables nbatch growth if *all* the tuples remain in the
> batch. That's rather strict, and it takes a single tuple to break this.
>
> With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
> staying in the batch, 1/2 moved to the new one. So a much higher ratio
> is suspicious and most likely mean the same hash value, so what if we
> did something like this:
>
>      if ((nfreed >= 0.9 * (nfreed + ninmemory)) ||
>          (nfreed <= 0.1 * (nfreed + ninmemory)))
>      {
>          hashtable->growEnabled = false;
>      }
>
> which disables nbatch growth if either >=90% tuples remained in the
> first batch, or >= 90% tuples were moved from it. The exact thresholds
> might be set a bit differently, but I think 10:90 sounds about good.
>
> Trivial patch implementing this attached - with it the explain analyze
> looks like this:
>
> test=# explain analyze select status, count(*) from q3 left join m3 on
> (m3.id=q3.id) group by status;
>
>                                 QUERY PLAN
> ----------------------------------------------------------------------
>   HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
> time=514.619..514.630 rows=5 loops=1)
>     Group Key: m3.status
>     ->  Hash Right Join  (cost=18100.00..62217.63 rows=500000 width=4)
>                      (actual time=75.260..467.911 rows=500108 loops=1)
>           Hash Cond: (m3.id = q3.id)
>           ->  Seq Scan on m3  (cost=0.00..18334.00 rows=1000000 width=37)
>                          (actual time=0.003..91.799 rows=1000000 loops=1)
>           ->  Hash  (cost=7943.00..7943.00 rows=500000 width=33)
>                     (actual time=74.916..74.916 rows=500000 loops=1)
>                 Buckets: 65536 (originally 65536)
>                 Batches: 32 (originally 16)  Memory Usage: 8824kB
>                 ->  Seq Scan on q3
>                     (cost=0.00..7943.00 rows=500000 width=33)
>                     (actual time=0.005..27.395 rows=500000 loops=1)
>   Planning time: 0.172 ms
>   Execution time: 514.663 ms
> (10 rows)
>
> Without the patch this runs in ~240 seconds and the number of batches
> explodes to ~131k.
>
> In theory it might happen that threre just a few hash values and all of
> them are exactly the same within the first N bits (thus falling into the
> first batch), but N+1 bits are different. So the next split would
> actually help. But I think we can leave that up to probability, just
> like all the hash code.

Anything happen with this, or the patch Andrew posted?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Really bad blowups with hash outer join and nulls

From

Andrew Gierth

Date:

01 June 2015, 18:39:14

>>>>> "Jim" == Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
Jim> Anything happen with this, or the patch Andrew posted?

No.

And my attention has just been drawn to this, which looks like the same
issue:

http://www.postgresql.org/message-id/52B47B47-0926-4E15-B25E-212DF52FE695@oseberg.io

-- 
Andrew (irc:RhodiumToad)