Thread: WIP: bloom filter in Hash Joins with batches

WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

while working on the Hash Join improvements, I've been repeatedly
running into the idea of bloom filter - various papers on hash joins
mention bloom filters as a way to optimize access to the hash table by
doing fewer lookups, etc.

Sadly, I've been unable to actually see any real benefit of using a
bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which
makes the lookups much more efficient (so the room for bloom filter
improvements is narrow).

The one case where bloom filter might still help, and that's when the
bloom filter fits into L3 cache (a few MBs) while the hash table (or
more accurately the buckets) do not. Then there's a chance that the
bloom filter (which needs to do multiple lookups) might help.

But I think there's another case where bloom filter might be way more
useful in Hash Join - when we do batching. What we do currently is that
we simply

     1) build the batches for the hash table (inner relation)

     2) read the outer relation (usually the larger one), and split it
        into batches just like the hash table

     3) while doing (2) we join the first batch, and write the remaining
        batches to disk (temporary files)

     4) we read the batches one by one (for both tables) and do the join

Now, imagine that only some of the rows in the outer table actually
match a row in the hash table. Currently, we do write those rows into
the temporary file, but with a bloom filter on the whole hash table (all
the batches at once) we can skip that for some types of joins.

For inner join we can immediately discard the outer row, for left join
we can immediately output the row. In both cases we can completely
eliminate the overhead with writing the tuple to the temporary file and
then reading it again.

The attached patch is a PoC of this approach - I'm pretty sure it's not
perfectly correct (e.g. I only tried it with inner join), but it's good
enough for demonstrating the benefits. It's rather incomplete (see the
end of this e-mail), and I'm mostly soliciting some early feedback at
this point.

The numbers presented here are for a test case like this:

     CREATE TABLE dim (id INT, dval TEXT);
     CREATE TABLE fact (id INT, fval TEXT);

     INSERT INTO dim SELECT i, md5(i::text)
                       FROM generate_series(1,10000000) s(i);

     -- repeat 10x
     INSERT INTO fact SELECT * FROM dim;

and a query like this

     SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';

with different values in the WHERE condition to select a fraction of the
inner 'dim' table - this directly affects what portion of the 'fact'
table has a matching row, and also the size of the hash table (and
number of batches).

Now, some numbers from a machine with 8GB of RAM (the 'fact' table has
~6.5GB of data, so there's actually quite a bit of memory pressure,
forcing the temp files to disk etc.).

With work_mem=16MB, it looks like this:

     batches   filter   select.    bloom  master    bloom/master
     -----------------------------------------------------------
       4        1     6.25%    23871   48631         49.09%
       8        2    12.50%    25752   56692         45.42%
       8        3    18.75%    31273   57455         54.43%
      16        4    25.01%    37430   62325         60.06%
      16        5    31.25%    39005   61143         63.79%
      16        6    37.50%    46157   63533         72.65%
      16        7    43.75%    53500   65483         81.70%
      32        8    49.99%    53952   65730         82.08%
      32        9    56.23%    55187   67521         81.73%
      32        a    62.49%    64454   69448         92.81%
      32        b    68.73%    66937   71692         93.37%
      32        c    74.97%    73323   72060        101.75%
      32        d    81.23%    76703   73513        104.34%
      32        e    87.48%    81970   74890        109.45%
      32        f    93.74%    86102   76257        112.91%

The 'batches' means how many batches were used for the join, 'filter' is
the value used in the WHERE condition, selectivity is the fraction of
the 'dim' table that matches the condition (and also the 'fact'). Bloom
and master are timings of the query in miliseconds, and bloom/master is
comparison of the runtimes - so for example 49% means the hash join with
bloom filter was running ~2x as fast.

Admittedly, work_mem=16MB is quite low, but that's just a way to force
batching. What really matters is the number of batches and selectivity
(how many tuples we can eliminate using the bloom filter).

For work_mem=64MB it looks like this:

     batches   filter   select.    bloom  master    bloom/master
     -----------------------------------------------------------
           1       1     6.25%     24846   23854        104.16%
           2       2    12.50%     24369   45672         53.36%
           2       3    18.75%     30432   47454         64.13%
           4       4    25.01%     36175   59741         60.55%
           4       5    31.25%     43103   62631         68.82%
           4       6    37.50%     48179   64079         75.19%

So initially it's a bit slower (it's not doing any batching in this
case, but the code is a bit silly and while not building the bloom
filter it still does some checks). But once we start batching, it gets
2x as fast again, and then slowly degrades as the selectivity increases.

Attached is a spreadsheet with results for various work_mem values, and
also with a smaller data set (just 30M rows in the fact table), which
easily fits into memory. Yet it shows similar gains, shaving off ~40% in
the best case, suggesting that this is not just thanks to reduction of
I/O when forcing the temp files to disk.

As I mentioned, the patch is incomplete in several ways:

   1) It does not count the bloom filter (which may be quite big) into
      work_mem properly.

   2) It probably does not work for outer joins at this point.

   3) Currently the bloom filter is used whenever we do batching, but it
      should really be driven by selectivity too - it'd be good to (a)
      estimate the fraction of 'fact' tuples having a match in the hash
      table, and not to do bloom if it's over ~60% or so. Also, maybe
      the could should count the matches at runtime, and disable the
      bloom filter if we reach some threshold.

But overall, this seems like a nice optimization opportunity for hash
joins on large data sets, where batching is necessary.

Ideas?

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: bloom filter in Hash Joins with batches

From
"Shulgin, Oleksandr"
Date:
On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk.
 
A neat idea!  Have you possibly tried to also collect statistics about actual false-positive rates and filter allocation sizes in every of the collected data points?

--
Alex

Re: WIP: bloom filter in Hash Joins with batches

From
Simon Riggs
Date:
On 15 December 2015 at 22:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

  3) Currently the bloom filter is used whenever we do batching, but it
     should really be driven by selectivity too - it'd be good to (a)
     estimate the fraction of 'fact' tuples having a match in the hash
     table, and not to do bloom if it's over ~60% or so. Also, maybe
     the could should count the matches at runtime, and disable the
     bloom filter if we reach some threshold.

Cool results.

It seems a good idea to build the bloom filter always, then discard it if it would be ineffective.

My understanding is that the bloom filter would be ineffective in any of these cases
* Hash table is too small
* Bloom filter too large
* Bloom selectivity > 50% - perhaps that can be applied dynamically, so stop using it if it becomes ineffective

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

On 12/17/2015 10:50 AM, Shulgin, Oleksandr wrote:
> On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>> Attached is a spreadsheet with results for various work_mem
>> values, and also with a smaller data set (just 30M rows in the fact
>> table), which easily fits into memory. Yet it shows similar gains,
>> shaving off ~40% in the best case, suggesting that this is not just
>> thanks to reduction of I/O when forcing the temp files to disk.
>
> A neat idea! Have you possibly tried to also collect statistics
> about actual false-positive rates and filter allocation sizes in
> every of the collected data points?

The patch counts and prints the total number of lookups, and the number 
of eliminated rows. The bloom filter is currently sized for 5% false 
positives rate, and the numbers I've seen match that.

I think ultimately we'll need to measure the false positive rate, so 
that we can use it to dynamically disable the bloom filter if it gets 
inefficient. Also maybe put some of that into EXPLAIN ANALYZE.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
On 12/17/2015 11:44 AM, Simon Riggs wrote:
>
> My understanding is that the bloom filter would be ineffective in any of
> these cases
> * Hash table is too small

Yes, although it depends what you mean by "too small".

Essentially if we can do with a single batch, then it's cheaper to do a 
single lookup in the hash table instead of multiple lookups in the bloom 
filter. The bloom filter might still win if it fits into L3 cache, but 
that seems rather unlikely.

> * Bloom filter too large

Too large with respect to what?

One obvious problem is that the bloom filter is built for all batches at 
once, i.e. for all tuples, so it may be so big won't fit into work_mem 
(or takes a significant part of it). Currently it's not accounted for, 
but that'll need to change.

> * Bloom selectivity > 50% - perhaps that can be applied dynamically,
> so stop using it if it becomes ineffective

Yes. I think doing some preliminary selectivity estimation should not be 
difficult - that's pretty much what calc_joinrel_size_estimate() already 
does.

Doing that at dynamically is also possible, but quite tricky. Imagine 
for example the outer relation is sorted - in that case we may get long 
sequences of the same value (hash), and all of them will either have a 
match in the inner relation, or not have a match. That may easily skew 
the counters used for disabling the bloom filter dynamically.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Oleg Bartunov
Date:
I have very limited internet connection (no graphics) , so I may miss something

Oleg

On Wed, Dec 16, 2015 at 4:15 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

while working on the Hash Join improvements, I've been repeatedly running into the idea of bloom filter - various papers on hash joins mention bloom filters as a way to optimize access to the hash table by doing fewer lookups, etc.

Sadly, I've been unable to actually see any real benefit of using a bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the lookups much more efficient (so the room for bloom filter improvements is narrow).

The one case where bloom filter might still help, and that's when the bloom filter fits into L3 cache (a few MBs) while the hash table (or more accurately the buckets) do not. Then there's a chance that the bloom filter (which needs to do multiple lookups) might help.

But I think there's another case where bloom filter might be way more useful in Hash Join - when we do batching. What we do currently is that we simply

    1) build the batches for the hash table (inner relation)

    2) read the outer relation (usually the larger one), and split it
       into batches just like the hash table

    3) while doing (2) we join the first batch, and write the remaining
       batches to disk (temporary files)

    4) we read the batches one by one (for both tables) and do the join

Now, imagine that only some of the rows in the outer table actually match a row in the hash table. Currently, we do write those rows into the temporary file, but with a bloom filter on the whole hash table (all the batches at once) we can skip that for some types of joins.

For inner join we can immediately discard the outer row, for left join we can immediately output the row. In both cases we can completely eliminate the overhead with writing the tuple to the temporary file and then reading it again.

The attached patch is a PoC of this approach - I'm pretty sure it's not perfectly correct (e.g. I only tried it with inner join), but it's good enough for demonstrating the benefits. It's rather incomplete (see the end of this e-mail), and I'm mostly soliciting some early feedback at this point.

The numbers presented here are for a test case like this:

    CREATE TABLE dim (id INT, dval TEXT);
    CREATE TABLE fact (id INT, fval TEXT);

    INSERT INTO dim SELECT i, md5(i::text)
                      FROM generate_series(1,10000000) s(i);

    -- repeat 10x
    INSERT INTO fact SELECT * FROM dim;

and a query like this

    SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';

with different values in the WHERE condition to select a fraction of the inner 'dim' table - this directly affects what portion of the 'fact' table has a matching row, and also the size of the hash table (and number of batches).

Now, some numbers from a machine with 8GB of RAM (the 'fact' table has ~6.5GB of data, so there's actually quite a bit of memory pressure, forcing the temp files to disk etc.).

With work_mem=16MB, it looks like this:

    batches   filter   select.    bloom  master    bloom/master
    -----------------------------------------------------------
      4        1         6.25%    23871   48631         49.09%
      8        2        12.50%    25752   56692         45.42%
      8        3        18.75%    31273 57455         54.43%
     16        4        25.01%    37430   62325         60.06%
     16        5        31.25%    39005   61143         63.79%
     16        6        37.50%    46157   63533         72.65%
     16        7        43.75%    53500   65483         81.70%
     32        8        49.99%    53952 65730         82.08%
     32        9        56.23%    55187 67521         81.73%
     32        a        62.49%    64454   69448         92.81%
     32        b        68.73%    66937 71692         93.37%
     32        c        74.97%    73323   72060        101.75%
     32        d        81.23%    76703   73513        104.34%
     32        e        87.48%    81970 74890        109.45%
     32        f        93.74%    86102   76257        112.91%

The 'batches' means how many batches were used for the join, 'filter' is the value used in the WHERE condition, selectivity is the fraction of the 'dim' table that matches the condition (and also the 'fact'). Bloom and master are timings of the query in miliseconds, and bloom/master is comparison of the runtimes - so for example 49% means the hash join with bloom filter was running ~2x as fast.

Admittedly, work_mem=16MB is quite low, but that's just a way to force batching. What really matters is the number of batches and selectivity (how many tuples we can eliminate using the bloom filter).

For work_mem=64MB it looks like this:

    batches   filter   select.    bloom  master    bloom/master
    -----------------------------------------------------------
          1       1     6.25%     24846 23854        104.16%
          2       2    12.50%     24369   45672         53.36%
          2       3    18.75%     30432 47454         64.13%
          4       4    25.01%     36175 59741         60.55%
          4       5    31.25%     43103   62631         68.82%
          4       6    37.50%     48179   64079         75.19%

So initially it's a bit slower (it's not doing any batching in this case, but the code is a bit silly and while not building the bloom filter it still does some checks). But once we start batching, it gets 2x as fast again, and then slowly degrades as the selectivity increases.

Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk.

As I mentioned, the patch is incomplete in several ways:

  1) It does not count the bloom filter (which may be quite big) into
     work_mem properly.

  2) It probably does not work for outer joins at this point.

  3) Currently the bloom filter is used whenever we do batching, but it
     should really be driven by selectivity too - it'd be good to (a)
     estimate the fraction of 'fact' tuples having a match in the hash
     table, and not to do bloom if it's over ~60% or so. Also, maybe
     the could should count the matches at runtime, and disable the
     bloom filter if we reach some threshold.

But overall, this seems like a nice optimization opportunity for hash joins on large data sets, where batching is necessary.

Ideas?

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

On 12/20/2015 05:46 AM, Oleg Bartunov wrote:
> Tomas,
>
> have you seen
> http://www.postgresql.org/message-id/4B4DD67F.9010506@sigaev.ru
> I have very limited internet connection (no graphics) , so I may miss
> something

I haven't seen that, but I don't really see how that's related - your 
post is about indexes, mine is about building temporary bloom filters 
when executing hash joins.

FWIW, I think bloom filters should be easy to add to BRIN indexes, as 
another type of 'summary'. That should address most of the missing 
pieces in your implementation (e.g. WAL).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Aleksander Alekseev
Date:
Hello, Tomas.

Great idea!

Did you consider to cache bloom filter or at least part(s) of it
somehow? I think this way we could gain some more TPS. This of course
assuming that creating a bloom filter is really a bottleneck here,
which would be nice be investigated first.

Best regards,
Aleksander



Re: WIP: bloom filter in Hash Joins with batches

From
Simon Riggs
Date:
On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/17/2015 11:44 AM, Simon Riggs wrote:

My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small

Yes, although it depends what you mean by "too small".

Essentially if we can do with a single batch, then it's cheaper to do a single lookup in the hash table instead of multiple lookups in the bloom filter. The bloom filter might still win if it fits into L3 cache, but that seems rather unlikely.

* Bloom filter too large

Too large with respect to what?

One obvious problem is that the bloom filter is built for all batches at once, i.e. for all tuples, so it may be so big won't fit into work_mem (or takes a significant part of it). Currently it's not accounted for, but that'll need to change.

The benefit seems to be related to cacheing, or at least that memory speed is critical. If the hash table is too small, or the bloom filter too large then there would be no benefit from performing the action (Lookup Bloom then maybe Lookup Hash) compared with just doing (LookupHash).

So the objective must be to get a Bloom Filter that is small enough that it lives in a higher/faster level of cache than the main Hash table. Or possibly that we separate that into a two stage process so that the first level can be applied by a GPU and then later checked against hash outside of a GPU.

I think you also need to consider whether we use a hash bloom filter or just simply apply an additional range predicate. The latter idea is similar to my earlier thoughts here http://www.postgresql.org/message-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:

On 12/24/2015 02:51 PM, Simon Riggs wrote:
> On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     On 12/17/2015 11:44 AM, Simon Riggs wrote:
>
>
>         My understanding is that the bloom filter would be ineffective
>         in any of
>         these cases
>         * Hash table is too small
>
>
>     Yes, although it depends what you mean by "too small".
>
>     Essentially if we can do with a single batch, then it's cheaper to
>     do a single lookup in the hash table instead of multiple lookups in
>     the bloom filter. The bloom filter might still win if it fits into
>     L3 cache, but that seems rather unlikely.
>
>         * Bloom filter too large
>
>
>     Too large with respect to what?
>
>     One obvious problem is that the bloom filter is built for all
>     batches at once, i.e. for all tuples, so it may be so big won't fit
>     into work_mem (or takes a significant part of it). Currently it's
>     not accounted for, but that'll need to change.
>
>
> The benefit seems to be related to cacheing, or at least that memory
> speed is critical. If the hash table is too small, or the bloom
> filter too large then there would be no benefit from performing the
> action (Lookup Bloom then maybe Lookup Hash) compared with just
> doing(LookupHash).
>
> So the objective must be to get a Bloom Filter that is small enough
> that it lives in a higher/faster level of cache than the main Hash
> table. Or possibly that we separate that into a two stage process so
> that the first level can be applied by a GPU and then later checked
> against hash outside of a GPU.

I don't think that's quite true.

Had the primary goal been to replace expensive hashtable lookups with 
cheap bloom filter checks, then sure - the size of the bloom filter 
would be crucial (namely getting the whole bloom filter into L3, which 
is an order of magnitude faster compared to RAM).

But that's not what the patch does - it aims to reduce the amount of 
data that need to be serialized/deserialized when batching is necessary, 
trading those costs for the bloom overhead. The size of the bloom filter 
may still matter, but it's way less important because the 
(de)serialization overhead is much more expensive than the hash lookup.

FWIW I've been trying to use the bloom filters in the "traditional" way 
(doing "cheap" bloom checks before hashtable lookups), but I've been 
unable to get any measurable benefit. Either I was doing it wrong 
somehow, or perhaps NTUP_PER_BUCKET=1 made the hashtable too fast (it 
essentially means we need a single lookup to see if the entry exists, 
while bloom filter needs several lookups, depending on the desired false 
positive ratio).

So I concluded that the attempt to use the bloom filter that way is 
futile, and that the batching case seems like the only scenario where 
the bloom filters may still be useful.

There's a number of additional problems with sizing the bloom filter to 
fit into L3:

(1) We have no idea how much L3 there is, although this might be fixed    by either adding a GUC, read it from
/proc/cpuinfoor just use some    reasonable default (say, most Xeon CPUs today have ~20 MB)
 

(2) The L3 is not per-process, but shared by all processes, so we don't    really have all the L3, but perhaps just
L3/Nwhere N is the number    of backends.
 

(3) The size of the bloom filter depends on the number of items it    should handle and false positive rate. I've used
5%in the patch,    but maybe it'd make sense to increase this in order to get smaller    filter?
 

Also, the bloom filter does not compete with the whole hash table, but 
"just" with the "buckets" as that's sufficient to determine if there's a 
tuple for a key or not.

Let's do some math for a hash table with 100k rows (so a fairly small 
hash table):
   * nbuckets = 131072 (2^17), so 1MB   * bloom filter (5% false positive rate [1]): 76kB, k=4 (see [1])

So both the buckets and bloom fit into L3, and the bloom filter is 
unlikely to speed things up.

The difference between buckets and bloom filter is about 10x, and both 
sizes scale about linearly (with 1M items it'll be ~10MB vs. 760kB), so 
there clearly is an area where bloom filter fits into L3 while buckets 
don't.

Sadly I haven't been able to exploit that, but I'll try again.

In any case, none of this can help unless the bloom filter really 
eliminates many lookups.

[1] http://hur.st/bloomfilter?n=100000&p=0.05


>
> I think you also need to consider whether we use a hash bloom filter or
> just simply apply an additional range predicate. The latter idea is
> similar to my earlier thoughts here
> http://www.postgresql.org/message-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com

Interesting, but I don't really see how this is related to bloom filters?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
David Rowley
Date:
On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I think ultimately we'll need to measure the false positive rate, so that we can use it to dynamically disable the bloom filter if it gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE.

I'm not so convinced that will be a good idea. What if the filter does not help much to start with, we disable it because of that, then we get some different probe values later in the scan which the bloom filter would have helped to eliminate earlier. 

Maybe it would be better to, once the filter is built, simply count the number of 1 bits and only use the filter if there's less than <threshold> 1 bits compared to the size of the filter in bits. There's functionality in bms_num_members() to do this, and there's also __builtin_popcount() in newer version of GCC, which we could have some wrapper around, perhaps.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
On 12/28/2015 03:15 AM, David Rowley wrote:
> On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     I think ultimately we'll need to measure the false positive rate, so
>     that we can use it to dynamically disable the bloom filter if it
>     gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE.
>
>
> I'm not so convinced that will be a good idea. What if the filter does
> not help much to start with, we disable it because of that, then we get
> some different probe values later in the scan which the bloom filter
> would have helped to eliminate earlier.

Yes, that's true. This might happen quite easily for example when then 
outer table is sorted - in that case what we'll see are (possibly long) 
sequences of the same value, which might bias the decision quite easily.

I don't know how to fix this perfectly (if at all possible). But maybe 
we don't really need to do that. The overall goal is to improve the 
overall performance, and this check (disabling the bloom filter) is 
meant to limit the loss when the filter returns "true" for most values 
(making it somewhat pointless).

The costs however are asymmetric - once we build the hash table (and 
build the filter), the cost for dropping the bloom filter is constant, 
as we simply loose the time we spent building it. But the cost for 
continuing to use of inefficient filter is O(N) where N is the number of 
lookups (i.e. rows of outer table). So it's probably better to drop the 
filter a bit too early and sacrifice the small amount of time we spent 
building it, so that we have a change of not being much slower than the 
current implementation.

That is not to say we can't come up with better solutions - for example 
we can count (or estimate) the number of distinct hash values we've 
seen, and only do the decision once we see enough of them. For example 
we could use HyperLogLog to do that, or simply see the number of times a 
value changed (hash value not equal to the previous value). That should 
be better than simply waiting to X lookups, but I'm sure it's still 
possible to come up with counter-examples.

>
> Maybe it would be better to, once the filter is built, simply count the
> number of 1 bits and only use the filter if there's less than
> <threshold> 1 bits compared to the size of the filter in bits. There's
> functionality in bms_num_members() to do this, and there's
> also __builtin_popcount() in newer version of GCC, which we could have
> some wrapper around, perhaps.

I don't really see how this is relevant with the previous point. The 
number of 1s in the bitmap can tell you the false positive rate for the 
bloom filter, not what fraction of lookups will be answered with "true".

So while this needs to be watched, so that we stop using the bloom 
filter when it gets too full (e.g. due to under-estimate), it's 
completely unrelated to the previous issue.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
David Rowley
Date:
On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/28/2015 03:15 AM, David Rowley wrote:
Maybe it would be better to, once the filter is built, simply count the
number of 1 bits and only use the filter if there's less than
<threshold> 1 bits compared to the size of the filter in bits. There's
functionality in bms_num_members() to do this, and there's
also __builtin_popcount() in newer version of GCC, which we could have
some wrapper around, perhaps.

I don't really see how this is relevant with the previous point. The number of 1s in the bitmap can tell you the false positive rate for the bloom filter, not what fraction of lookups will be answered with "true".

So while this needs to be watched, so that we stop using the bloom filter when it gets too full (e.g. due to under-estimate), it's completely unrelated to the previous issue.

Why is it not related? this has got me confused. If a bloom filter has all of the bits set to 1, then it will never filter any Tuples right?

If so, then a filter with all 1 bits set should be thrown away, as it'll never help us, and the filter should generally become more worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of course we don't have a count of how many Tuples matched each bit, so this is based on the assumption that each bit matches an equal number of Tuples. Are you saying this is not an assumption that we should make?


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
On 12/28/2015 11:38 AM, David Rowley wrote:
> On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     On 12/28/2015 03:15 AM, David Rowley wrote:
>
>         Maybe it would be better to, once the filter is built, simply
>         count the
>
>         number of 1 bits and only use the filter if there's less than
>         <threshold> 1 bits compared to the size of the filter in bits.
>         There's
>         functionality in bms_num_members() to do this, and there's
>         also __builtin_popcount() in newer version of GCC, which we
>         could have
>         some wrapper around, perhaps.
>
>
>     I don't really see how this is relevant with the previous point. The
>     number of 1s in the bitmap can tell you the false positive rate for
>     the bloom filter, not what fraction of lookups will be answered with
>     "true".
>
>     So while this needs to be watched, so that we stop using the bloom
>     filter when it gets too full (e.g. due to under-estimate), it's
>     completely unrelated to the previous issue.
>
>
> Why is it not related? this has got me confused. If a bloom filter has
> all of the bits set to 1, then it will never filter any Tuples right?

Because the false positive rate can be computed without having to look 
at the lookups. So it's inherently independent on the ordering of outer 
relation, and so on.

> If so, then a filter with all 1 bits set should be thrown away, as
> it'll never help us, and the filter should generally become more
> worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
> course we don't have a count of how many Tuples matched each bit, so
> this is based on the assumption that each bit matches an equal number
> of Tuples. Are you saying this is not an assumption that we should
> make?

Sure we should check that. All I'm saying is it has nothing to do with 
the first problem described in the first part of the e-mail.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
David Rowley
Date:
On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/28/2015 11:38 AM, David Rowley wrote:
If so, then a filter with all 1 bits set should be thrown away, as
it'll never help us, and the filter should generally become more
worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
course we don't have a count of how many Tuples matched each bit, so
this is based on the assumption that each bit matches an equal number
of Tuples. Are you saying this is not an assumption that we should
make?

Sure we should check that. All I'm saying is it has nothing to do with the first problem described in the first part of the e-mail.

Okay. I was merely suggesting this method as an alternative to checking tracking and checking the usefulness of the filter during the hash probe. I assumed that tracking and checking the usefulness during the hash probe won't be free, and that it may be better if we can estimate or determine the expected usefulness of the filter before the probe stage, and throw it away before we waste cycles using it.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
On 12/28/2015 11:52 AM, David Rowley wrote:
> On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>     On 12/28/2015 11:38 AM, David Rowley wrote:
>
>         If so, then a filter with all 1 bits set should be thrown away, as
>
>         it'll never help us, and the filter should generally become more
>         worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
>         course we don't have a count of how many Tuples matched each bit, so
>         this is based on the assumption that each bit matches an equal
>         number
>         of Tuples. Are you saying this is not an assumption that we should
>         make?
>
>
>     Sure we should check that. All I'm saying is it has nothing to do
>     with the first problem described in the first part of the e-mail.
>
>
> Okay. I was merely suggesting this method as an alternative to checking
> tracking and checking the usefulness of the filter during the hash
> probe. I assumed that tracking and checking the usefulness during the
> hash probe won't be free, and that it may be better if we can estimate
> or determine the expected usefulness of the filter before the probe
> stage, and throw it away before we waste cycles using it.

Consider this example:

CREATE TABLE t (id INT);
INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i);
ANALYZE;

SELECT * FROM t AS t1 JOIN t AS t2 ON (t1.id = t2.id) WHERE t1.id < 10000 AND t2.id < 10000;

This will be executed like this:
                              QUERY PLAN
----------------------------------------------------------------------- Hash Join  (cost=17046.26..34008.58 rows=94
width=8)  Hash Cond: (t1.id = t2.id)   ->  Seq Scan on t t1  (cost=0.00..16925.00 rows=9701 width=4)         Filter:
(id< 10000)   ->  Hash  (cost=16925.00..16925.00 rows=9701 width=4)         ->  Seq Scan on t t2  (cost=0.00..16925.00
rows=9701width=4)               Filter: (id < 10000)
 
(7 rows)

But of course the problem is that the two relations are (trivially) 
correlated, which means that in reality it works like this:
                       QUERY PLAN
--------------------------------------------------------- Hash Join (actual rows=9999 loops=1)   Hash Cond: (t1.id =
t2.id)  ->  Seq Scan on t t1 (actual rows=9999 loops=1)         Filter: (id < 10000)         Rows Removed by Filter:
990001  ->  Hash (actual rows=9999 loops=1)         Buckets: 16384  Batches: 1  Memory Usage: 480kB         ->  Seq
Scanon t t2 (actual rows=9999 loops=1)               Filter: (id < 10000)               Rows Removed by Filter: 990001
Planningtime: 0.316 ms Execution time: 176.283 ms
 
(12 rows)

So while we have very good estimates on the scan nodes, the final 
estimate is off - we expect about the bloom filter to eliminate ~99% of 
rows, in reality 100% of rows matches (due to the correlation). And 
that's even if the bloom filter is "perfect" in the sense that it has 
very low false probability etc.

This example illustrates that such cases can't be really solved before 
actually doing the lookups. Which does not make the checks you propose 
pointless, but they simply address different cases (and I indeed planned 
to implement them).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

attached is v2 of the patch, with a number of improvements:

0) This relies on the the other hashjoin patches (delayed build of
    buckets and batching), as it allows sizing the bloom filter.

1) enable_hashjoin_bloom GUC

    This is mostly meant for debugging and testing, not for committing.

2) Outer joins should be working fine now. That is, the results should
    be correct and faster as the outer rows without matches should not
    be batched at all.

3) The bloom filter is now built for all hash joins, not just when
    batching is happening. I've been a bit skeptical about the
    non-batched cases, but it seems that I can get a sizable speedup
    (~20-30%, depending on the selectivity of the join).

4) The code is refactored quite a bit, adding BloomFilterData instead
    of just sprinkling the fields on HashJoinState or HashJoinTableData.

5) To size the bloom filter, we now use HyperLogLog couter, which we
    now have in core thanks to the sorting improvements done by Peter
    Geoghegan. This allows making the bloom filter much smaller when
    possible.

    The patch also extends the HyperLogLog API a bit (which I'll submit
    to the CF independently).


There's a bunch of comments in the code, mostly with ideas about more
possible improvements.

The main piece missing in the patch (IMHO) is optimizer code making
decisions whether to enable bloom filters for the hash join, based on
cardinality estimates. And also the executor code disabling the bloom
filter if they turn inefficient. I don't think that's a major issue at
this point, though, and I think it'll be easier to do based on testing
the current patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

attached are results for some basic performance evaluation of the patch
(and also scripts used for the evaluation). I've used a simple join of
two tables ("fact" and "dim"), with three different dataset sizes.

CREATE TABLE dim (id INT, r INT, val TEXT);
CREATE TABLE fact (id INT, val TEXT);

-- 1M rows into "dim"
INSERT INTO dim
SELECT i, mod(i,100), md5(i::text) FROM generate_series(1,1000000) s(i);

-- 10M rows into "fact"
INSERT INTO fact
SELECT mod(i,1000000)+1, md5((mod(i,1000000)+1)::text)
   FROM generate_series(1,10000000) s(i);

Which means the "dim.r" column has 100 different values (0-99) with
uniform distribution. So e.g. "WHERE r < 15" matches 15%.

There are three dataset sizes:

1) small: dim 1M rows (73MB), fact 10M rows (650MB)

2) medium: dim: 10M rows (730MB), fact 100M rows (6.5GB)

3) large: dim: 5M rows (365MB), fact 250M rows (16GB)

The machine has 8GB of RAM, so "small" easily fits into RAM, "medium" is
just at the border, and "large" is clearly over (especially when batching).

For each dataset size there are two queries - one with inner and one
with outer join, with a filter on the smaller one ("dim") determining
the "selectivity".

-- inner join
SELECT COUNT(dim.val), COUNT(fact.val)
   FROM fact JOIN dim ON (fact.id = dim.id) WHERE r < $1

-- outer join
SELECT COUNT(dim.val), COUNT(fact.val)
   FROM fact LEFT JOIN (SELECT * FROM dim WHERE r < $1) dim
                    ON (fact.id = dim.id)

Those queries were executed with various work_mem sizes (to either use
no batching or different number of batches), and also with different
filter values (to filter to 10%, 20%, 30%, ...., 100% of data).

After collecting data both on master and with all the patches applied,
I've computed the speedup factor as

     (time with patches) / (time on master)

and plotted that as a pivot tables with heat map. Values <1.0 (red) mean
"bloom filter made the query faster" while values > 1.0 (blue) means it
slowed the query down.

Let's quickly skim through the results for each dataset size. The
attached charts only show results for the "inner" queries, but the
results for "outer" queries are pretty much exactly the same.


1) small (hash-joins-bloom-1-10.png)
------------------------------------

There's a clear speedup as long as the hash join requires batching and
the selectivity is below 50-60%. In the best case (10% selectivity,
small work_mem and thus multiple batches) we save up to ~30% of time.

Once we get to non-batching mode, the bloom filter is ineffective no
matter the selectivity. I assume this is because for the small
selectivities (e.g. 10%) it's simply cheaper to check the hash table,
which is small and likely fits into L3 on the CPU (which is ~6MB on the
i5-2500 CPU in the system).


2) medium (hash-joins-bloom-10-100.png)
---------------------------------------

The results are similar to the "small" results, except that the speedup
is somewhat larger (up to 40%), and there's no sudden jump when the
queries stop batching (work_mem=1GB is enough to run the query in a
single batch even with 100% selectivity). The "break-even" selectivity
is again somewhere around 50% (a bit lower in non-batching mode).


3) large (hash-joins-bloom-5-250.png)
-------------------------------------

Pretty much the same as "medium".


So, this seems to bring reasonable speedup, as long as the selectivity
is below 50%, and the data set is sufficiently large.

It's however clear that the patch won't work without some sort of
selectivity estimation - either at planning time or optimization time
(or both). I've speculated about how to do that, but the current patches
don't implement any of that yet and any ideas are welcome.

Another open question is sizing the bloom filter in the multi-batch
case, which turns out to be quite tricky. What the patch does right now
is estimating the number of distinct values to store in the filter based
on the first batch (the other hashjoin patch in this CF makes this
possible). That is, there's a hyperloglog counter that gives us (very
accurate) ndistinct estimate for the first batch, and then we do (nbatch
* ndistinct), to estimate the ndistinct values in the whole data set.
That estimate is of course rather naive, and often produces too high -
for example we might have already seen all the distinct values in the
first batch. So the result is a bloom filter much larger than necessary.
Not sure how to fix this :-(


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: bloom filter in Hash Joins with batches

From
Peter Geoghegan
Date:
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, this seems to bring reasonable speedup, as long as the selectivity is
> below 50%, and the data set is sufficiently large.

What about semijoins? Apparently they can use bloom filters
particularly effectively. Have you considered them as a special case?

Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:

regression=# set enable_mergejoin = off;
SET
regression=# explain analyze select * from tenk1 o join tenk2 t on
o.twenty = t.twenty and t.hundred = o.hundred;                                                      QUERY PLAN
──────────────────────────────────────────────────────────────────────Hash Join  (cost=595.00..4103.00 rows=50000
width=488)(actual 
time=12.086..1026.194 rows=1000000 loops=1)  Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))  ->  Seq
Scanon tenk1 o  (cost=0.00..458.00 rows=10000 width=244) 
(actual time=0.017..4.212 rows=10000 loops=1)  ->  Hash  (cost=445.00..445.00 rows=10000 width=244) (actual
time=12.023..12.023 rows=10000 loops=1)        Buckets: 16384  Batches: 1  Memory Usage: 2824kB        ->  Seq Scan on
tenk2t  (cost=0.00..445.00 rows=10000 
width=244) (actual time=0.006..3.453 rows=10000 loops=1)Planning time: 0.567 msExecution time: 1116.094 ms
(8 rows)

(Note that while the optimizer has a slight preference for a merge
join in this case, the plan I show here is a bit faster on my
machine).


--
Peter Geoghegan



Re: WIP: bloom filter in Hash Joins with batches

From
Peter Geoghegan
Date:
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Also, have you considered Hash join conditions with multiple
> attributes as a special case? I'm thinking of cases like this:

Sorry, accidentally fat-fingered my enter key before I was finished
drafting that mail. That example isn't useful, because there is no
early/cheap elimination of tuples while scanning the outer relation.

My point about bloom filters on only one attribute (or perhaps
multiple bloom filters on multiple attributes) is that you might be
able to make the bloom filter more effective, particularly relative to
the amount of memory available, by only building it for one attribute
where that distinction (one attribute vs multiple) happens to exist.

Simon said: "So the objective must be to get a Bloom Filter that is
small enough that it lives in a higher/faster level of cache than the
main Hash table". I agree that that's really important here, although
I did note that Tomas wasn't so sure, emphasizing the importance of
avoiding serialization and deserialization overhead -- and certainly,
that's why the patch helps some cases a lot. But Simon's point applies
more to the worst case than the best or average cases, and evidently
we have plenty of worrying about the worst case left to do here.

So, one attribute could have relatively low cardinality, and thus
fewer distinct items in the bloom filter's set, making it far slower
to degrade (i.e. have increased probability of false positives) as
compared to a composite of two or more attributes, and yet perhaps not
significantly less useful in terms of its ability to eliminate
(de)serialization overhead. Or, with two bloom filters (one per
attribute), one bloom filter could degrade far faster than the other
(due to having more distinct values), often very unpredictably, and
yet it wouldn't matter because we'd discard the bad one (maybe really
early, during the scan of the inner relation, using HLL).

I notice that all these example queries involve less-than operators
(inner relation tuples are thereby prevented from being entered into
the hash table). It might not be a terrible idea to hash abbreviated
keys (or even truncated abbreviated keys) for the bloom filter in
certain cases, in particular when there is a correlation in the inner
relation attributes (equi-joined attribute, and attribute that is
other part of qual). There might not be a correlation, of course, but
with some care hashing abbreviated keys may have virtually no
additional overhead, making it worth doing despite the general
uncertainty about any correlation existing. If you're going to accept
that the bloom filter can't be very large, which I think you must,
then hashing an entire value may be no better than hashing an
abbreviated key (those 8 bytes tend to have a lot of entropy). This
point is rather hand-wavy, though.

I suspect that BLOOM_ERROR_RATE isn't the most useful constraint here.
Is the degree to which it helps in sympathetic cases at all
commensurate with the degree to which it hurts in unsympathetic cases?
In your "low selectivity" cases (the sympathetic cases), there are
relatively few values represented in the main hash table, and
therefore in the bloom filter, and so a smaller bloom filter results
automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
constraint just wastes memory bandwidth for no gain, I think. If the
number of elements is so large that a reasonable fixed sized bloom
filter has a poor false positive rate anyway, we may have already
lost.

My sense is that we should try to make the bloom filter as cheap as
possible more so than trying to make sure it's useful before much work
has been done.

Is it okay that you don't treat MCV/skew values as special? Actually,
looking at this code, I think I notice something:

@@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
hashvalue);              node->hj_CurTuple = NULL;
 

+               /* If still in the first batch, we check the bloom filter. */
+               if ((hashtable->curbatch == 0) &&
+                   (! ExecHashBloomCheckValue(hashtable, hashvalue)))
+               {
+                       /* no matches; check for possible outer-join fill */
+                       node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
+                       continue;
+               }

Haven't we already established by now that the value of
"node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
the value certainly was found in the inner relation scan? Why bother
doing anything with the bloom filter in that common case? (Note that
I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
first).

Also, are you aware of this?

http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.

BTW, I think code like this needs better comments:

+static BloomFilter
+BloomFilterInit(double nrows, double error)
+{
+   /* perhaps we should round nbits to multiples of 8 ? */
+   int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
+   int nhashes = round(log(2.0) * nbits / nrows);
+
+   BloomFilter filter = palloc0(offsetof(BloomFilterData, data) +
((nbits + 7) / 8));
+
+   filter->nbits = nbits;
+   filter->nhashes = nhashes;

Have you experimentally verified that you get the expected probability
of false positives in practice?

-- 
Peter Geoghegan



Re: WIP: bloom filter in Hash Joins with batches

From
Peter Geoghegan
Date:
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Which means the "dim.r" column has 100 different values (0-99) with uniform
> distribution. So e.g. "WHERE r < 15" matches 15%.

I think that the use of a uniform distribution to demonstrate this
patch is a bad idea, unless you want to have a conversation about the
worst case.

Look at the use cases for bloom filters in general. They're almost all
some variation on the same theme: checking a bloom filter
inexpensively, to avoid an expensive cache miss (of some fashion). The
Google Chrome web browser uses a bloom filter for malicious URLs. It
usually avoids consulting Google's servers about whether or not any
given URL that is visited is malicious by using the bloom filter. This
is effective in large part because the top 100 websites ranked by
popularity have a huge falloff in popularity as you go down the list.
It looks like a Zipfian distribution, and so I imagine they get pretty
far with a very small bloom filter, even though in theory the chances
of any given valid URL resulting in a cache miss is very high.

Generally, uniform distributions are rare in a non-canonical list of
things, like a hash join outer relation's attribute.

-- 
Peter Geoghegan



Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

On 01/10/2016 01:08 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> So, this seems to bring reasonable speedup, as long as the selectivity is
>> below 50%, and the data set is sufficiently large.
>
> What about semijoins? Apparently they can use bloom filters
> particularly effectively. Have you considered them as a special case?

You mean to handle them in a special way in the code, or just to perform 
benchmark semijoins (and not just regular joins)?

> Also, have you considered Hash join conditions with multiple
> attributes as a special case? I'm thinking of cases like this:
>
> regression=# set enable_mergejoin = off;
> SET
> regression=# explain analyze select * from tenk1 o join tenk2 t on
> o.twenty = t.twenty and t.hundred = o.hundred;
>                                                         QUERY PLAN
> ──────────────────────────────────────────────────────────────────────
>   Hash Join  (cost=595.00..4103.00 rows=50000 width=488) (actual
> time=12.086..1026.194 rows=1000000 loops=1)
>     Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))
>     ->  Seq Scan on tenk1 o  (cost=0.00..458.00 rows=10000 width=244)
> (actual time=0.017..4.212 rows=10000 loops=1)
>     ->  Hash  (cost=445.00..445.00 rows=10000 width=244) (actual
> time=12.023..12.023 rows=10000 loops=1)
>           Buckets: 16384  Batches: 1  Memory Usage: 2824kB
>           ->  Seq Scan on tenk2 t  (cost=0.00..445.00 rows=10000
> width=244) (actual time=0.006..3.453 rows=10000 loops=1)
>   Planning time: 0.567 ms
>   Execution time: 1116.094 ms
> (8 rows)
>
> (Note that while the optimizer has a slight preference for a merge
> join in this case, the plan I show here is a bit faster on my
> machine).

The patch I posted actually does not build bloom filter on the values 
directly, but on the hashvalue we already use. So it handles hashjoins 
with arbitrary number of attributes just fine.

Or perhaps you're thinking about some special optimization that might 
improve such cases (I can't think of any)?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

On 01/10/2016 04:03 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Also, have you considered Hash join conditions with multiple
>> attributes as a special case? I'm thinking of cases like this:
>
> Sorry, accidentally fat-fingered my enter key before I was finished
> drafting that mail. That example isn't useful, because there is no
> early/cheap elimination of tuples while scanning the outer relation.
>
> My point about bloom filters on only one attribute (or perhaps
> multiple bloom filters on multiple attributes) is that you might be
> able to make the bloom filter more effective, particularly relative
> to the amount of memory available, by only building it for one
> attribute where that distinction (one attribute vs multiple) happens
> to exist.

I'm not sure what you mean. The patch builds bloom filter on hashvalue, 
not the attribute values directly. I find this very efficient as we 
don't have to hash the original values again and instead work with just 
a short 32-bit hashvalue.

I really doubt doing some magic by only building the bloom filter on one 
of the attributes is useful - it's way more complicated, requires 
estimating the ndistinct for each attribute, and also somehow estimate 
the impact on the accuracy of the bloom filter. That seems rather 
complex and I don't have a good idea how to do that.

>
> Simon said: "So the objective must be to get a Bloom Filter that is
> small enough that it lives in a higher/faster level of cache than the
> main Hash table". I agree that that's really important here, although
> I did note that Tomas wasn't so sure, emphasizing the importance of
> avoiding serialization and deserialization overhead -- and certainly,
> that's why the patch helps some cases a lot.

I think the impact of the bloom filter size really depends on the size 
of the hash join. I see three cases, as indicated in the benchmarks I posted

(a) single batch (1M:10M)
    - fitting into CPU cache really important (otherwise almost no      improvement)
    - impossible to speed up for very small hash tables (that fit into      L3 cache)

(b) multiple batches w. medium data set (10M:100M)
    - important, but not as much as for (a), as we still eliminate the      (de)serialization overhead

(c) large data set (5M:250M)
    - not important, it's mostly about eliminating (de)serialization      overhead and/or I/O overhead
> But Simon's point applies more to the worst case than the best or> average cases, and evidently we have plenty of
worryingabout the> worst case left to do here.
 

So what do you consider to be the worst case?

> So, one attribute could have relatively low cardinality, and thus
> fewer distinct items in the bloom filter's set, making it far slower
> to degrade (i.e. have increased probability of false positives) as
> compared to a composite of two or more attributes, and yet perhaps
> not significantly less useful in terms of its ability to eliminate
> (de)serialization overhead. Or, with two bloom filters (one per
> attribute), one bloom filter could degrade far faster than the other
> (due to having more distinct values), often very unpredictably, and
> yet it wouldn't matter because we'd discard the bad one (maybe
> really early, during the scan of the inner relation, using HLL).

I don't think so.

Firstly, I don't know what you mean by "slower to degrade" as the false 
positive rate of a bloom filter does not depend on the cardinality of 
the column used to build the bloom filter, but rather on the "load 
factor" of the bloom filter (i.e. number of items added to the filter 
compared to the initial capacity). Not to mention that you still have to 
estimate the cardinality of the columns, which is tricky (and one of the 
main issues of the current patch, IMHO).

Secondly, by splitting the "composite" bloom filter into per-column 
filters you make it impossible to track correlation between the columns.

For example let's say we have two attributes (a,b), 'a' having 1000 
distinct values while 'b' only has 10. But let's assume that there's a 
correlation between 'a' and 'b' so that (mod(a,10) = b). If you build 
the bloom filter on the columns separately, it's going to be entirely 
useless. Sure, this is a rather trivial (and perhaps naive) example, but 
it shows how easy it's to break it.

I'd also like to point out that vast majority of joins is on a single 
column, so perhaps we should try to solve those first, and then perhaps 
improve the multi-column case if possible.

>
> I notice that all these example queries involve less-than operators
> (inner relation tuples are thereby prevented from being entered into
> the hash table). It might not be a terrible idea to hash abbreviated
> keys (or even truncated abbreviated keys) for the bloom filter in
> certain cases, in particular when there is a correlation in the inner
> relation attributes (equi-joined attribute, and attribute that is
> other part of qual). There might not be a correlation, of course, but
> with some care hashing abbreviated keys may have virtually no
> additional overhead, making it worth doing despite the general
> uncertainty about any correlation existing. If you're going to accept
> that the bloom filter can't be very large, which I think you must,
> then hashing an entire value may be no better than hashing an
> abbreviated key (those 8 bytes tend to have a lot of entropy). This
> point is rather hand-wavy, though.

I have to admit that I have zero idea on how to do that. Also, there's 
nothing really special about the '<' operator - the reason why I used it 
is that it makes it very simple to filter arbitrary fraction of the 
table (i.e. specify how many tuples of the outer relation have a 
matching tuple in the hash table).

I'm not saying we can't apply the abbreviated keys somehow, but at this 
point it seems rather like a hammer looking for a nail.

>
> I suspect that BLOOM_ERROR_RATE isn't the most useful constraint
> here. Is the degree to which it helps in sympathetic cases at all
> commensurate with the degree to which it hurts in unsympathetic
> cases? In your "low selectivity" cases (the sympathetic cases), there
> are relatively few values represented in the main hash table, and
> therefore in the bloom filter, and so a smaller bloom filter results
> automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
> constraint just wastes memory bandwidth for no gain, I think. If the
> number of elements is so large that a reasonable fixed sized bloom
> filter has a poor false positive rate anyway, we may have already
> lost.

I agree that having a fixed error rate is somewhat annoying. What I 
imagined would be something like this:
   (1) estimate the number of distinct values the bloom filter
   (2) see what error rate would make the bloom filter "small enough"       (so that it fits into L3)
   (3) see if the error rate makes the bloom filter efficient (some       simple costing, or perhaps hard-coded
threshold)

This however requires knowledge of the L3 cache size for (2), or rather 
how much of it is available for the process (which is tricky to get). 
And then (3) requires some costing function, which seems tricky too.

>
> My sense is that we should try to make the bloom filter as cheap as
> possible more so than trying to make sure it's useful before much
> work has been done.

Well, yeah. Fail fast. But how?

> Is it okay that you don't treat MCV/skew values as special? Actually,
> looking at this code, I think I notice something:
>
> @@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
>                                                                   hashvalue);
>                  node->hj_CurTuple = NULL;
>
> +               /* If still in the first batch, we check the bloom filter. */
> +               if ((hashtable->curbatch == 0) &&
> +                   (! ExecHashBloomCheckValue(hashtable, hashvalue)))
> +               {
> +                       /* no matches; check for possible outer-join fill */
> +                       node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
> +                       continue;
> +               }
>
> Haven't we already established by now that the value of
> "node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
> the value certainly was found in the inner relation scan? Why bother
> doing anything with the bloom filter in that common case? (Note that
> I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
> first).

Hmmmm, maybe. The bloom filter should contain all values, including 
those from skew buckets. So what the code could / should do is check the 
bloom filter first, and only do ExecHashGetSkewBucket() if we still 
think the tuple may be in the hash table. So something like this:

    node->hj_CurHashValue = hashvalue;    node->hj_CurTuple = NULL;
    /* If still in the first batch, we check the bloom filter. */    if ((hashtable->curbatch == 0) &&
ExecHashBloomCheckValue(hashtable,hashvalue)))    {        /* no matches; check for possible outer-join fill */
node->hj_JoinState= HJ_FILL_OUTER_TUPLE;        continue;    }
 
    ExecHashGetBucketAndBatch(hashtable, hashvalue,                              &node->hj_CurBucketNo, &batchno);
node->hj_CurSkewBucketNo= ExecHashGetSkewBucket(hashtable,
hashvalue);

Not sure how expensive those two functions (ExecHashGetBucketAndBatch 
and ExecHashGetSkewBucket) are, but this seems like a good idea. Haven't 
tried that, though, so maybe HJ_FILL_OUTER_TUPLE needs that info or 
something like that.

Makes sense?

>
> Also, are you aware of this?
>
> http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
>
> It talks about bloom filters for hash joins in PostgreSQL
> specifically. Interestingly, they talk about specific TPC-H queries.

Interesting. The way that paper uses bloom filters is very different 
from what I do in the patch. They build the bloom filters and then 
propagate them into the scan nodes to eliminate the tuples early.

The problem is they'd be facing mostly the same problems with sizing the 
bloom filter I'm facing in the patch. The only way they managed to get 
around them is that they used a dataset with fixed size (1GB, which is a 
bit small), and bloom 1kB or 8kB bloom filters.

I suspect they'd get much worse results with datasets of "reasonable" 
size (say, 100GB or more) because the bloom fixed-size filters would 
stop being effective.

>
> BTW, I think code like this needs better comments:

Yeah.

>
> +static BloomFilter
> +BloomFilterInit(double nrows, double error)
> +{
> +   /* perhaps we should round nbits to multiples of 8 ? */
> +   int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
> +   int nhashes = round(log(2.0) * nbits / nrows);
> +
> +   BloomFilter filter = palloc0(offsetof(BloomFilterData, data) +
> ((nbits + 7) / 8));
> +
> +   filter->nbits = nbits;
> +   filter->nhashes = nhashes;
>
> Have you experimentally verified that you get the expected probability
> of false positives in practice?

I have seen that when the bloom filter is properly sized, and the data 
set is not somehow skewed, the bloom filters have about the right false 
positive rate.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
Tomas Vondra
Date:
Hi,

On 01/10/2016 05:11 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Which means the "dim.r" column has 100 different values (0-99) with uniform
>> distribution. So e.g. "WHERE r < 15" matches 15%.
>
> I think that the use of a uniform distribution to demonstrate this
> patch is a bad idea, unless you want to have a conversation about the
> worst case.
>
> Look at the use cases for bloom filters in general. They're almost
> all some variation on the same theme: checking a bloom filter
> inexpensively, to avoid an expensive cache miss (of some fashion).

Right.

> The Google Chrome web browser uses a bloom filter for malicious URLs.

FWIW I don't think Chrome is using bloom filter for this purpose 
anymore. Chromium certainly does not (it's using PrefixSet instead).

> It usually avoids consulting Google's servers about whether or not
> any given URL that is visited is malicious by using the bloom filter.
> This is effective in large part because the top 100 websites ranked
> by popularity have a huge falloff in popularity as you go down the
> list. It looks like a Zipfian distribution, and so I imagine they get
> pretty far with a very small bloom filter, even though in theory the
> chances of any given valid URL resulting in a cache miss is very
> high.

I'm not familiar with how Chrome used bloom filters, but I'd expect them 
to be very careful about false positives (and also false negatives, as 
the bloom filter can't contain all malicious URLs).

My assumptions is that they've been able to make that work because they 
do have detailed stats about how frequently people visit those URLs, and 
use that when building the bloom filter. But we don't have such 
information in hashjoin.

>
> Generally, uniform distributions are rare in a non-canonical list of
> things, like a hash join outer relation's attribute.

Well, I'm not claiming testing uniform distribution is enough, but 
surely it's one of the cases we should handle just fine.

The problem with non-uniform cases is that it really depends on the 
outer side of the join.

For example let's say the hash table contains 1000 values, and the bloom 
filter has 1% false positive rate. But let's assume that the outer side 
has a value that triggers the false positive rate, and that it's 
actually 99% of the outer table. Suddenly, you have 99% false positive 
rate, rendering the bloom filter pointless.

I don't think this is fixable while creating the bloom filter. All we 
can do is watch the bloom filter lookups and disable the bloom filter 
once the false positive rate reaches some threshold.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: bloom filter in Hash Joins with batches

From
David Rowley
Date:
On 11 January 2016 at 09:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 01/10/2016 04:03 AM, Peter Geoghegan wrote:
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
Also, are you aware of this?

http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.

Interesting. The way that paper uses bloom filters is very different from what I do in the patch. They build the bloom filters and then propagate them into the scan nodes to eliminate the tuples early.


That does sound interesting, but unless I'm somehow mistaken, I guess to do that you'd have to abandon the more efficient hashing of the hash value that you're doing in the current patch, and hash the complete value in the scan node, then hash them again if they make it into the hash join node. That does not sound like it would be a win if hashing longer varlana values.
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: bloom filter in Hash Joins with batches

From
Alvaro Herrera
Date:
I'm closing this as returned-with-feedback; AFAICS even the last version
submitted is still in research stage.  Please resubmit once you make
further progress.

Thanks,

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services