Hi,
attached is v2 of the patch, with a bugfix and two significant improvements:
1) bugfix: forgotten memset() in ExecHashIncreaseNumBatches()
Caused segfaults whenever we started with a single batch and then
had to increase the number of batches.
2) 0002: postpone the batching (not just build of buckets)
When ExecChooseHashTableSize finds out we'll probably need batching,
we start with nbatch=1 anyway, and only switch to batching (with the
estimated number of batches) after filling work_mem.
This helps in two basic cases - firstly, when we do over-estimates
we may end up doing batching even when we could do with a single
batch (i.e. without writing to temp files at all). That's really
expensive, and this helps with that - not entirely, because we can
only distinguish "no batching vs. batching" case and not the number
of batches, but that's the more interesting case anyway (the
difference between batching with 16 or 32 batches is not large, at
least in my experience).
The other use case is the patch adding bloom filters, because it
allows with properly sizing the bloom filter, which needs the number
of distinct values. So while the filter might be sized using the
estimated number of tuples passed to the Hash, it's likely to be
larger than needed and thus more expensive. So sizing the bloom
filter is crucial, and this change allows first accumulating enough
data to estimate the size of bloom filter first. More discussion in
the other patch.
3) 0003: eliminate the MaxAlloc limit
Currently the size of buckets is limited my MaxAlloc, which means it
can't exceed 512MB (assuming I've done my math correctly), which
means ~67M buckets. This removes the MaxAlloc limit and just keeps
the (INT_MAX/2) limit, so ~2B rows. I don't think it's worth
extending further at this point.
There was a discussion about this, quoting the f2fc98fb message:
Limit the size of the hashtable pointer array to not more than
MaxAllocSize, per reports from Kouhei Kaigai and others of
"invalid memory alloc request size" failures. There was
discussion of allowing the array to get larger than that by using
the "huge" palloc API, but so far no proof that that is actually
a good idea, and at this point in the 9.5 cycle major changes
from old behavior don't seem like the way to go.
I believe the objections were correct, because it'd really waste a
lot of memory in case of over-estimates. I do however think that
doing (1) and (2) fixes this issue, because with those changes we
never allocate the buckets based on the initial estimate. We pretty
much get the exact number of buckets necessary.
I haven't done any performance testing yet (well, I did, but not
reliable enough for sharing here). I'll post more data early January,
once the machine completes other tests it's running right now.
I've also been thinking about what other optimizations might be
possible, and the one thing I'm wondering about is adding HyperLogLog
counter. The thing is that we do size buckets based on number of rows,
but that may be nonsense - to size the hash table, we actually need
number of distinct values. So when the Hash contains duplicate rows (say
10 rows for each value), we end up with only ~10% of buckets containing
any data (about 10 tuples in a list).
If we knew the number of distinct values, we could make the buckets
smaller and thus reduce the memory requirements significantly.
I mention this because the patch with bloom filters actually uses HLL to
size the bloom filters, so this would not be really introducing any new
code.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services