Home > mailing lists

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date	August 9, 2014 13:14:07
Msg-id	53E61E7C.3050301@fuzzy.cz Whole thread Raw
In response to	Re: bad estimation together with large work_mem generates terrible slow hash joins (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: bad estimation together with large work_mem generates terrible slow hash joins
List	pgsql-hackers

Tree view

On 20.7.2014 18:29, Tomas Vondra wrote:
> Attached v9 of the patch. Aside from a few minor fixes, the main  change
> is that this is assumed to be combined with the "dense allocation" patch.
> 
> It also rewrites the ExecHashIncreaseNumBuckets to follow the same
> pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
> instead of buckets). It's cleaner and more consistent.
> 
> hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
> to grab the hashjoin-alloc-v4.patch from a different thread and apply it
> first)
> 
> hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

I just noticed that the changes made to ExecHashGetBucketAndBatch may
not be perfectly fine - it does not "break" the hash join (i.e. the
results are still correct), but I believe it may cause more batches. But
I'm not entirely sure about it, or how to fix that.

First, an intro into ExecHashGetBucketAndBatch internals, and how the
patch changes that. If not interested, skip to the "problem" section below.

The "old" ExecHashGetBucketAndBatch did this:
 *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);

i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used
to determine bucket number. The rest of the hash (after removing the
bits used for buckets) is used for batch. I.e. something like this

31               23               15                7                0
|----------------|----------------|----------------|----------------||                  |   <- batches    |
buckets           |

This worked fine, because the number of bits for buckets was fixed, and
only the number of batches was growing. This was done by adding
most-significant bits (as illustrated by the tiny arrow. So when there
were 4 bits for batch number, after adding another bit (doubling
nbatches) batch '0000' would be split either into '00000' or '10000'.

With dynamic bucket count this does not work, because the batches and
buckets need to be independend (i.e. derived from non-overlapping parts
of the hash). The simplest approach of 'moving batches around' does not
work, because that would break this assert:
  Assert(batchno > curbatch);

In other words "don't move tuples to already-processed batches". So the
batch number for a tuple needs to be sufficiently stable, and only ever
increase (never decrease).

So what I did is this:

31               23               15                7                0
|----------------|----------------|----------------|----------------||   batches ->  |                   |         <-
buckets           |

and this is what happens in ExecHashGetBucketAndBatch:
 *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> (32 - hashtable->log2_nbatch));

So the "bucketno" expression is exactly the same (but now the nbuckets
may change dynamically), but "batchno" is now computed from the other
end of the hash (highest bits), and grows by adding a least-significant bit.

Problem
-------

The original implementation had the nice feature, that each increase of
number of batches (nbatches *= 2) split each batch in half. Half the
tuples stayed in the batch, half the tuples moved to one of the newly
created batches, thanks to adding a most-significant bit. The tuples
getting 0 bit stayed in the same batch, tuples getting 1 moved to the
new one (and it was guaranteed to be one of the new ones).

It's slightly more complicated because of the lazy behavior when
repeatedly incrementing the number of batches, but this is the
principle. Keep 1/2 the tuples, move 1/2 to another bucket.

The new implementation changes this, because the batch number grows in
the opposite direction - adds a lest-significant bit, so for example
batch '1111' gets split into '11111' and '11110'.

It's rougly equal to
  (batchno << 1) + K

where K is either 0 or 1, which is always >= than the old batchno. So it
does not violate the Assert (which is why I haven't noticed this earlier).

But it breaks the desirable behavior that 1/2 the tuples remain in the
current batch, and instead moves a lot of them to batches further down
the road, and needlessly increases the number of batches.

At least that's how I understand it ...

Possible solutions
------------------

Adding least-significant bit does not work, we need get back to adding
the most-significant one. Not sure what's the least complex way to do
that, though.

I'm thinking about computing the nbuckets limit (how many buckets may
fit into work_mem):
tupsize = HJTUPLE_OVERHEAD +    MAXALIGN(sizeof(MinimalTupleData)) +    MAXALIGN(tupwidth);
       nbuckets_bits_max = my_log2(work_mem / tupsize)

and then start with nbatches from this bit, like this:
31               23     max       15                7                0
|----------------|----------------|----------------|----------------||      |   <- batches    |       free       | <-
  buckets       |

That might work, I guess. But maybe there's something better (using some
secret bitwise-operator-fu ...).

Further concerns
----------------

I'm wondering whether we should deploy some safe-guards against
exceeding the 32 bits in the hash. It probably was not a big issue
before, but with large work_mem values and NTUP_PER_BUCKET=1 this might
become a problem.

With work_mem=1GB, and 50B tuples (including overhead):
  buckets = 21474836

which means ~25 bits reserved for nbucketno. That leaves only 7 bits for
batch number. Yeah, that's ~128 batches (so ~128GB hash table), but
maybe it's not that far.

Also, what if the tupwidth estimate is too low. In that case the
nbuckets_bits_max would be artificially high (and we'd have to do more
batches).

regards
Tomas

pgsql-hackers by date:

From: Guillaume Lelarge
Date: 09 August 2014, 08:20:59
Subject: Re: PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?

From: Andres Freund
Date: 09 August 2014, 16:29:25
Subject: Re: Proposal to add a QNX 6.5 port to PostgreSQL

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

Previous

Next