Re: WIP: bloom filter in Hash Joins with batches - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: bloom filter in Hash Joins with batches
Date
Msg-id 5692BF72.5010401@2ndquadrant.com
Whole thread Raw
In response to Re: WIP: bloom filter in Hash Joins with batches  (Peter Geoghegan <pg@heroku.com>)
Responses Re: WIP: bloom filter in Hash Joins with batches  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
Hi,

On 01/10/2016 04:03 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Also, have you considered Hash join conditions with multiple
>> attributes as a special case? I'm thinking of cases like this:
>
> Sorry, accidentally fat-fingered my enter key before I was finished
> drafting that mail. That example isn't useful, because there is no
> early/cheap elimination of tuples while scanning the outer relation.
>
> My point about bloom filters on only one attribute (or perhaps
> multiple bloom filters on multiple attributes) is that you might be
> able to make the bloom filter more effective, particularly relative
> to the amount of memory available, by only building it for one
> attribute where that distinction (one attribute vs multiple) happens
> to exist.

I'm not sure what you mean. The patch builds bloom filter on hashvalue, 
not the attribute values directly. I find this very efficient as we 
don't have to hash the original values again and instead work with just 
a short 32-bit hashvalue.

I really doubt doing some magic by only building the bloom filter on one 
of the attributes is useful - it's way more complicated, requires 
estimating the ndistinct for each attribute, and also somehow estimate 
the impact on the accuracy of the bloom filter. That seems rather 
complex and I don't have a good idea how to do that.

>
> Simon said: "So the objective must be to get a Bloom Filter that is
> small enough that it lives in a higher/faster level of cache than the
> main Hash table". I agree that that's really important here, although
> I did note that Tomas wasn't so sure, emphasizing the importance of
> avoiding serialization and deserialization overhead -- and certainly,
> that's why the patch helps some cases a lot.

I think the impact of the bloom filter size really depends on the size 
of the hash join. I see three cases, as indicated in the benchmarks I posted

(a) single batch (1M:10M)
    - fitting into CPU cache really important (otherwise almost no      improvement)
    - impossible to speed up for very small hash tables (that fit into      L3 cache)

(b) multiple batches w. medium data set (10M:100M)
    - important, but not as much as for (a), as we still eliminate the      (de)serialization overhead

(c) large data set (5M:250M)
    - not important, it's mostly about eliminating (de)serialization      overhead and/or I/O overhead
> But Simon's point applies more to the worst case than the best or> average cases, and evidently we have plenty of
worryingabout the> worst case left to do here.
 

So what do you consider to be the worst case?

> So, one attribute could have relatively low cardinality, and thus
> fewer distinct items in the bloom filter's set, making it far slower
> to degrade (i.e. have increased probability of false positives) as
> compared to a composite of two or more attributes, and yet perhaps
> not significantly less useful in terms of its ability to eliminate
> (de)serialization overhead. Or, with two bloom filters (one per
> attribute), one bloom filter could degrade far faster than the other
> (due to having more distinct values), often very unpredictably, and
> yet it wouldn't matter because we'd discard the bad one (maybe
> really early, during the scan of the inner relation, using HLL).

I don't think so.

Firstly, I don't know what you mean by "slower to degrade" as the false 
positive rate of a bloom filter does not depend on the cardinality of 
the column used to build the bloom filter, but rather on the "load 
factor" of the bloom filter (i.e. number of items added to the filter 
compared to the initial capacity). Not to mention that you still have to 
estimate the cardinality of the columns, which is tricky (and one of the 
main issues of the current patch, IMHO).

Secondly, by splitting the "composite" bloom filter into per-column 
filters you make it impossible to track correlation between the columns.

For example let's say we have two attributes (a,b), 'a' having 1000 
distinct values while 'b' only has 10. But let's assume that there's a 
correlation between 'a' and 'b' so that (mod(a,10) = b). If you build 
the bloom filter on the columns separately, it's going to be entirely 
useless. Sure, this is a rather trivial (and perhaps naive) example, but 
it shows how easy it's to break it.

I'd also like to point out that vast majority of joins is on a single 
column, so perhaps we should try to solve those first, and then perhaps 
improve the multi-column case if possible.

>
> I notice that all these example queries involve less-than operators
> (inner relation tuples are thereby prevented from being entered into
> the hash table). It might not be a terrible idea to hash abbreviated
> keys (or even truncated abbreviated keys) for the bloom filter in
> certain cases, in particular when there is a correlation in the inner
> relation attributes (equi-joined attribute, and attribute that is
> other part of qual). There might not be a correlation, of course, but
> with some care hashing abbreviated keys may have virtually no
> additional overhead, making it worth doing despite the general
> uncertainty about any correlation existing. If you're going to accept
> that the bloom filter can't be very large, which I think you must,
> then hashing an entire value may be no better than hashing an
> abbreviated key (those 8 bytes tend to have a lot of entropy). This
> point is rather hand-wavy, though.

I have to admit that I have zero idea on how to do that. Also, there's 
nothing really special about the '<' operator - the reason why I used it 
is that it makes it very simple to filter arbitrary fraction of the 
table (i.e. specify how many tuples of the outer relation have a 
matching tuple in the hash table).

I'm not saying we can't apply the abbreviated keys somehow, but at this 
point it seems rather like a hammer looking for a nail.

>
> I suspect that BLOOM_ERROR_RATE isn't the most useful constraint
> here. Is the degree to which it helps in sympathetic cases at all
> commensurate with the degree to which it hurts in unsympathetic
> cases? In your "low selectivity" cases (the sympathetic cases), there
> are relatively few values represented in the main hash table, and
> therefore in the bloom filter, and so a smaller bloom filter results
> automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
> constraint just wastes memory bandwidth for no gain, I think. If the
> number of elements is so large that a reasonable fixed sized bloom
> filter has a poor false positive rate anyway, we may have already
> lost.

I agree that having a fixed error rate is somewhat annoying. What I 
imagined would be something like this:
   (1) estimate the number of distinct values the bloom filter
   (2) see what error rate would make the bloom filter "small enough"       (so that it fits into L3)
   (3) see if the error rate makes the bloom filter efficient (some       simple costing, or perhaps hard-coded
threshold)

This however requires knowledge of the L3 cache size for (2), or rather 
how much of it is available for the process (which is tricky to get). 
And then (3) requires some costing function, which seems tricky too.

>
> My sense is that we should try to make the bloom filter as cheap as
> possible more so than trying to make sure it's useful before much
> work has been done.

Well, yeah. Fail fast. But how?

> Is it okay that you don't treat MCV/skew values as special? Actually,
> looking at this code, I think I notice something:
>
> @@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
>                                                                   hashvalue);
>                  node->hj_CurTuple = NULL;
>
> +               /* If still in the first batch, we check the bloom filter. */
> +               if ((hashtable->curbatch == 0) &&
> +                   (! ExecHashBloomCheckValue(hashtable, hashvalue)))
> +               {
> +                       /* no matches; check for possible outer-join fill */
> +                       node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
> +                       continue;
> +               }
>
> Haven't we already established by now that the value of
> "node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
> the value certainly was found in the inner relation scan? Why bother
> doing anything with the bloom filter in that common case? (Note that
> I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
> first).

Hmmmm, maybe. The bloom filter should contain all values, including 
those from skew buckets. So what the code could / should do is check the 
bloom filter first, and only do ExecHashGetSkewBucket() if we still 
think the tuple may be in the hash table. So something like this:

    node->hj_CurHashValue = hashvalue;    node->hj_CurTuple = NULL;
    /* If still in the first batch, we check the bloom filter. */    if ((hashtable->curbatch == 0) &&
ExecHashBloomCheckValue(hashtable,hashvalue)))    {        /* no matches; check for possible outer-join fill */
node->hj_JoinState= HJ_FILL_OUTER_TUPLE;        continue;    }
 
    ExecHashGetBucketAndBatch(hashtable, hashvalue,                              &node->hj_CurBucketNo, &batchno);
node->hj_CurSkewBucketNo= ExecHashGetSkewBucket(hashtable,
hashvalue);

Not sure how expensive those two functions (ExecHashGetBucketAndBatch 
and ExecHashGetSkewBucket) are, but this seems like a good idea. Haven't 
tried that, though, so maybe HJ_FILL_OUTER_TUPLE needs that info or 
something like that.

Makes sense?

>
> Also, are you aware of this?
>
> http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
>
> It talks about bloom filters for hash joins in PostgreSQL
> specifically. Interestingly, they talk about specific TPC-H queries.

Interesting. The way that paper uses bloom filters is very different 
from what I do in the patch. They build the bloom filters and then 
propagate them into the scan nodes to eliminate the tuples early.

The problem is they'd be facing mostly the same problems with sizing the 
bloom filter I'm facing in the patch. The only way they managed to get 
around them is that they used a dataset with fixed size (1GB, which is a 
bit small), and bloom 1kB or 8kB bloom filters.

I suspect they'd get much worse results with datasets of "reasonable" 
size (say, 100GB or more) because the bloom fixed-size filters would 
stop being effective.

>
> BTW, I think code like this needs better comments:

Yeah.

>
> +static BloomFilter
> +BloomFilterInit(double nrows, double error)
> +{
> +   /* perhaps we should round nbits to multiples of 8 ? */
> +   int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
> +   int nhashes = round(log(2.0) * nbits / nrows);
> +
> +   BloomFilter filter = palloc0(offsetof(BloomFilterData, data) +
> ((nbits + 7) / 8));
> +
> +   filter->nbits = nbits;
> +   filter->nhashes = nhashes;
>
> Have you experimentally verified that you get the expected probability
> of false positives in practice?

I have seen that when the bloom filter is properly sized, and the data 
set is not somehow skewed, the bloom filters have about the right false 
positive rate.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Christian Ullrich
Date:
Subject: Close handle leak in SSPI auth
Next
From: Simon Riggs
Date:
Subject: Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM