Re: PATCH: postpone building buckets to the end of Hash (in HashJoin) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)
Date
Msg-id 56858BF6.2000002@2ndquadrant.com
Whole thread Raw
In response to PATCH: postpone building buckets to the end of Hash (in HashJoin)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hi,

attached is v2 of the patch, with a bugfix and two significant improvements:

1) bugfix: forgotten memset() in ExecHashIncreaseNumBatches()

    Caused segfaults whenever we started with a single batch and then
    had to increase the number of batches.

2) 0002: postpone the batching (not just build of buckets)

    When ExecChooseHashTableSize finds out we'll probably need batching,
    we start with nbatch=1 anyway, and only switch to batching (with the
    estimated number of batches) after filling work_mem.

    This helps in two basic cases - firstly, when we do over-estimates
    we may end up doing batching even when we could do with a single
    batch (i.e. without writing to temp files at all). That's really
    expensive, and this helps with that - not entirely, because we can
    only distinguish "no batching vs. batching" case and not the number
    of batches, but that's the more interesting case anyway (the
    difference between batching with 16 or 32 batches is not large, at
    least in my experience).

    The other use case is the patch adding bloom filters, because it
    allows with properly sizing the bloom filter, which needs the number
    of distinct values. So while the filter might be sized using the
    estimated number of tuples passed to the Hash, it's likely to be
    larger than needed and thus more expensive. So sizing the bloom
    filter is crucial, and this change allows first accumulating enough
    data to estimate the size of bloom filter first. More discussion in
    the other patch.

3) 0003: eliminate the MaxAlloc limit

    Currently the size of buckets is limited my MaxAlloc, which means it
    can't exceed 512MB (assuming I've done my math correctly), which
    means ~67M buckets. This removes the MaxAlloc limit and just keeps
    the (INT_MAX/2) limit, so ~2B rows. I don't think it's worth
    extending further at this point.

    There was a discussion about this, quoting the f2fc98fb message:

       Limit the size of the hashtable pointer array to not more than
       MaxAllocSize, per reports from Kouhei Kaigai and others of
       "invalid memory alloc request size" failures.  There was
       discussion of allowing the array to get larger than that by using
       the "huge" palloc API, but so far no proof that that is actually
       a good idea, and at this point in the 9.5 cycle major changes
       from old behavior don't seem like the way to go.

    I believe the objections were correct, because it'd really waste a
    lot of memory in case of over-estimates. I do however think that
    doing (1) and (2) fixes this issue, because with those changes we
    never allocate the buckets based on the initial estimate. We pretty
    much get the exact number of buckets necessary.

I haven't done any performance testing yet (well, I did, but not
reliable enough for sharing here). I'll post more data early January,
once the machine completes other tests it's running right now.

I've also been thinking about what other optimizations might be
possible, and the one thing I'm wondering about is adding HyperLogLog
counter. The thing is that we do size buckets based on number of rows,
but that may be nonsense - to size the hash table, we actually need
number of distinct values. So when the Hash contains duplicate rows (say
10 rows for each value), we end up with only ~10% of buckets containing
any data (about 10 tuples in a list).

If we knew the number of distinct values, we could make the buckets
smaller and thus reduce the memory requirements significantly.

I mention this because the patch with bloom filters actually uses HLL to
size the bloom filters, so this would not be really introducing any new
code.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [PROPOSAL] Backup and recovery of pg_statistic
Next
From: Tom Lane
Date:
Subject: Keyword classifications