Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: bad estimation together with large work_mem generates terrible slow hash joins |
Date | |
Msg-id | 53E61E7C.3050301@fuzzy.cz Whole thread Raw |
In response to | Re: bad estimation together with large work_mem generates terrible slow hash joins (Tomas Vondra <tv@fuzzy.cz>) |
Responses |
Re: bad estimation together with large work_mem generates
terrible slow hash joins
|
List | pgsql-hackers |
On 20.7.2014 18:29, Tomas Vondra wrote: > Attached v9 of the patch. Aside from a few minor fixes, the main change > is that this is assumed to be combined with the "dense allocation" patch. > > It also rewrites the ExecHashIncreaseNumBuckets to follow the same > pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly, > instead of buckets). It's cleaner and more consistent. > > hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need > to grab the hashjoin-alloc-v4.patch from a different thread and apply it > first) > > hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined I just noticed that the changes made to ExecHashGetBucketAndBatch may not be perfectly fine - it does not "break" the hash join (i.e. the results are still correct), but I believe it may cause more batches. But I'm not entirely sure about it, or how to fix that. First, an intro into ExecHashGetBucketAndBatch internals, and how the patch changes that. If not interested, skip to the "problem" section below. The "old" ExecHashGetBucketAndBatch did this: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); i.e. given the 32-bit hash value, the lowest log2_nbuckets bits are used to determine bucket number. The rest of the hash (after removing the bits used for buckets) is used for batch. I.e. something like this 31 23 15 7 0 |----------------|----------------|----------------|----------------|| | <- batches | buckets | This worked fine, because the number of bits for buckets was fixed, and only the number of batches was growing. This was done by adding most-significant bits (as illustrated by the tiny arrow. So when there were 4 bits for batch number, after adding another bit (doubling nbatches) batch '0000' would be split either into '00000' or '10000'. With dynamic bucket count this does not work, because the batches and buckets need to be independend (i.e. derived from non-overlapping parts of the hash). The simplest approach of 'moving batches around' does not work, because that would break this assert: Assert(batchno > curbatch); In other words "don't move tuples to already-processed batches". So the batch number for a tuple needs to be sufficiently stable, and only ever increase (never decrease). So what I did is this: 31 23 15 7 0 |----------------|----------------|----------------|----------------|| batches -> | | <- buckets | and this is what happens in ExecHashGetBucketAndBatch: *bucketno = hashvalue & (nbuckets - 1); *batchno = (hashvalue >> (32 - hashtable->log2_nbatch)); So the "bucketno" expression is exactly the same (but now the nbuckets may change dynamically), but "batchno" is now computed from the other end of the hash (highest bits), and grows by adding a least-significant bit. Problem ------- The original implementation had the nice feature, that each increase of number of batches (nbatches *= 2) split each batch in half. Half the tuples stayed in the batch, half the tuples moved to one of the newly created batches, thanks to adding a most-significant bit. The tuples getting 0 bit stayed in the same batch, tuples getting 1 moved to the new one (and it was guaranteed to be one of the new ones). It's slightly more complicated because of the lazy behavior when repeatedly incrementing the number of batches, but this is the principle. Keep 1/2 the tuples, move 1/2 to another bucket. The new implementation changes this, because the batch number grows in the opposite direction - adds a lest-significant bit, so for example batch '1111' gets split into '11111' and '11110'. It's rougly equal to (batchno << 1) + K where K is either 0 or 1, which is always >= than the old batchno. So it does not violate the Assert (which is why I haven't noticed this earlier). But it breaks the desirable behavior that 1/2 the tuples remain in the current batch, and instead moves a lot of them to batches further down the road, and needlessly increases the number of batches. At least that's how I understand it ... Possible solutions ------------------ Adding least-significant bit does not work, we need get back to adding the most-significant one. Not sure what's the least complex way to do that, though. I'm thinking about computing the nbuckets limit (how many buckets may fit into work_mem): tupsize = HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData)) + MAXALIGN(tupwidth); nbuckets_bits_max = my_log2(work_mem / tupsize) and then start with nbatches from this bit, like this: 31 23 max 15 7 0 |----------------|----------------|----------------|----------------|| | <- batches | free | <- buckets | That might work, I guess. But maybe there's something better (using some secret bitwise-operator-fu ...). Further concerns ---------------- I'm wondering whether we should deploy some safe-guards against exceeding the 32 bits in the hash. It probably was not a big issue before, but with large work_mem values and NTUP_PER_BUCKET=1 this might become a problem. With work_mem=1GB, and 50B tuples (including overhead): buckets = 21474836 which means ~25 bits reserved for nbucketno. That leaves only 7 bits for batch number. Yeah, that's ~128 batches (so ~128GB hash table), but maybe it's not that far. Also, what if the tupwidth estimate is too low. In that case the nbuckets_bits_max would be artificially high (and we'd have to do more batches). regards Tomas
pgsql-hackers by date: