Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: bad estimation together with large work_mem generates terrible slow hash joins |
Date | |
Msg-id | 53F39983.2020303@fuzzy.cz Whole thread Raw |
In response to | Re: bad estimation together with large work_mem generates terrible slow hash joins (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: bad estimation together with large work_mem generates
terrible slow hash joins
|
List | pgsql-hackers |
On 19.8.2014 19:05, Robert Haas wrote: > On Sat, Aug 16, 2014 at 9:31 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 12.8.2014 00:30, Tomas Vondra wrote: >>> On 11.8.2014 20:25, Robert Haas wrote: >>>> It also strikes me that when there's only 1 batch, the set of bits >>>> that map onto the batch number is zero-width, and one zero-width bit >>>> range is as good as another. In other words, if you're only planning >>>> to do one batch, you can easily grow the number of buckets on the fly. >>>> Growing the number of buckets only becomes difficult once you have >>>> more than one batch. >> >> ... >> >>> I was considering using reversing the bits of the hash, because that's >>> pretty much the simplest solution. But I think you're right it might >>> actually work like this: >>> >>> * Are more batches needed? >>> >>> (yes) => just use nbuckets = my_log2(work_mem / tuple_size) >>> >>> (no) => go ahead, until processing all tuples or hitting work_mem >>> >>> (work_mem) => meh, use the same nbuckets above >>> >>> (all tuples) => compute optimal nbuckets / resize >>> >>> >>> But I need to think about this a bit. So far it seems to me there's no >>> way additional batches might benefit from increasing nbuckets further. >> >> I think this is a simple and solid solution, solving the batchno >> computation issues quite nicely. Attached is v10 patch (bare and >> combined with the dense allocation), that does this: >> >> 1) when we know we'll need batching, buckets are sized for full work_mem >> (using the estimated tuple width, etc.) >> >> 2) without the batching, we estimate the 'right number of buckets' for >> the estimated number of tuples, and keep track of the optimal number >> as tuples are added to the hash table >> >> - if we discover we need to start batching, we keep the current >> optimal value (which should be the same as the max number of >> buckets) and don't mess with it anymore (making it possible to >> compute batch IDs just like before) >> >> - also, on the fist rebatch (nbatch=1 => nbatch=2) the hash table >> is resized as part of the rebatch >> >> - if the hash build completes without batching, we do the resize >> >> I believe the patch is pretty much perfect. I plan to do more thorough >> testing on a wide range of queries in the next few days. >> >> I also removed the 'enable_hash_resize' GUC, because it would be more >> complex to implement this properly after doing the resize as part of >> rebatch etc.. So either it would make the patch more complex, or it >> wouldn't do what the name promises. > > A variety of trivial comments on this: > > PostgreSQL style is un-cuddled curly braces. Also, multi-line > comments need to start with a line containing only /* and end with a > line containing only */. In one place you've added curly braces > around a single-line block that is otherwise unmodified; please don't > do that. In one place, you have "becase" instead of "because". In > another place, you write "add if after it" but it should say "add it > after it" or maybe better "add the new one after it". Avoid using > punctuation like "=>" in comments to illustrate the connection between > sentences; instead, use a connecting word like "then" or "therefore" > or whatever is appropriate; in this instance, a period followed by the > start of a new sentence seems sufficient. Revert the removal of a > single line of whitespace near the top of nodeHash.c. > > There are too many things marked XXX in this patch. They should > either be fixed, if they are real problems, or they should be > commented in a way that doesn't give rise to the idea that they're > problems if they aren't. OK, thanks for pointing this out. Attached is v11 of the patch (both separate and combined with the dense allocation, as before). I fixed as many of those issues as possible. All the XXX items were obsolete, except for one in the chunk_alloc function. I have also removed one constant > > OK, now on to some more substantive stuff: > > 1. It's not clear to me what the overall effect of this patch on > memory utilization is. Reducing NTUP_PER_BUCKET from 10 to 1 is going > to use, on the average, 10x as much bucket-header memory per tuple. > Specifically, I think it means we'll use about 8 bytes of > bucket-header memory per tuple instead of 0.8 bytes per tuple. If the > tuples are narrow, that could be significant; concerns have been > expressed about that here in the past. Increasing the number of > buckets could also increase memory usage. On the other hand, the > dense allocation stuff probably saves a ton of memory, so maybe we end > up overall, but I'm not sure. Your thoughts, and maybe some test > results with narrow and wide tuples, would be appreciated. The effect of the dense allocation was briefly discussed in this thread, along with some quick measurements: http://www.postgresql.org/message-id/53BEEA9E.2080009@fuzzy.cz The dense allocation removes pretty much all the palloc overhead. For a 40B tuple, I did get this before the dense allocation HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11 chunks); 1448394448 used and this with the dense allocation patch applied HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks); 729651520 used So it's pretty much 50% reduction in memory consumption. Now, the palloc header is >=16B, and removing this more than compensates the increased number of buckets (in the worst case, there may be ~2x buckets per tuple, which is 2x8B pointers). I did a number of tests, and the results are quite consistent with this. > 2. But, on the positive side, modulo the memory utilization questions > mentioned above, I would expect the impact on hash join performance to > be positive. Going from 10 tuples per bucket to just 1 should help, > and on cases where the actual load factor would have ended up much > higher because of poor estimation, increasing the number of buckets on > the fly should help even more. I haven't tested this, though. Yes, that's the results I was getting. I've done a number of tests, and not a single one showed a slowdown so far. Most well-estimated queries were 2-3x faster, and the poorly estimated ones were orders of magnitude faster. We're working on getting this fixed on a range of workloads, I'll post some results once I have that. > I haven't had a chance to completely go through this yet, so these are > just some initial thoughts. Thanks! Tomas
Attachment
pgsql-hackers by date: