Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Hash Joins vs. Bloom Filters / take 2 |
Date | |
Msg-id | CAH2-Wzn9xgoo4Xu624wKmsE4B20e77Ew0BW_d9tUor4OFr69JQ@mail.gmail.com Whole thread Raw |
In response to | Hash Joins vs. Bloom Filters / take 2 (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Hash Joins vs. Bloom Filters / take 2
Re: Hash Joins vs. Bloom Filters / take 2 |
List | pgsql-hackers |
On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > In 2015/2016 I've been exploring if we could improve hash joins by > leveraging bloom filters [1], and I was reminded about this idea in a > thread about amcheck [2]. I also see that bloom filters were briefly > mentioned in the thread about parallel hash [3]. > > So I've decided to revive the old patch, rebase it to current master, > and see if we can resolve the issues that killed it in 2016. Cool. > The issues are essentially about predicting when the bloom filter can > improve anything, which is more a question of the hash join selectivity > than the bloom filter parameters. Absolutely. A Bloom filter could be considered an opportunistic thing. The false positive rate that you estimate is going to be based on the planner's rowcount estimate for the inner side of the join, which is liable to be wrong anyway. It's important to be resilient to misestimations, which Bloom filters are good at. > This is not the same as "false positive rate" which is a feature of the > bloom filter itself, and can be measured by simply counting bits set in > the filter. That is important too, of course, but simple to determine. Perhaps I'm being pedantic, but it's not exactly true that you can measure the false positive rate by counting the bits. I do agree with what I think you meant here, though, which is that the precise false positive rate is not necessarily that important, while the selectivity of the join is very important. In general, hash joins work best when the join qual is highly selective. This is not the same thing as having a small hash table, of course. > After thinking about this a bit more, I think this is pretty much what > we do during join cardinality estimation - we estimate how many of the > rows in "fact" have a matching row in "dim". I do think we can reuse > that to decide if to use a bloom filter or not. > > Of course, the decision may need to be more elaborate (and perhaps > formulated as part of costing), but the filter selectivity is the > crucial piece. The attached patch does not do that yet, though. I suspect that it could make sense to use a Bloom filter to summarize the entire inner side of the join all at once, even when there are multiple batches. I also suspect that this is particularly beneficial with parallel hash joins, where IPC/synchronization overhead can be a big problem. > 4) Initially, the patch was aimed at hashjoins with batches, on the > premise that it can save some of the serialize/deserialize costs. But as > mentioned in the discussion, bloom filters can be beneficial even in the > single-batch case, when three conditions are met: > > a) join is selective (i.e. some rows don't have match in hash table) > > b) hash table > CPU cache > > c) bloom filter < CPU cache Seems plausible. CPU cache efficiency is also affected by how many hash functions you end up using -- use too many, and it slows things down. > We don't have a good way to determine size of the CPU cache, but I think > we can use some crude arbitrary limits. E.g. require that hash hash > table is larger than 8MB and bloom filter is smaller than 4MB, or > something like that. FWIW, the logical way to size the Bloom filter is based on the number of (distinct) values, not based on the projected size of the hash table. The lossy nature of Bloom filters makes the average size of the original, hashed elements irrelevant to a sizing calculation that targets a certain final false positive rate. Making Bloom filter size any fixed fraction of hash table size seems wrong to me for that reason alone. You should try to exploit the fact that a Bloom filter can summarize a large set reasonably well with a very compact, simple representation. A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but for cases where Bloom probes will save a lot of work, it probably doesn't make all that much difference -- the hash join is still much faster. If you're willing to accept a 10% false positive rate, then you can summarize a set of 40 million distinct values with only 22.85MB of memory and 3 hash functions. I think that the smallest possible amount of memory that any hash table of 40 million elements would require a much greater amount of memory than 22.85MB -- certainly closer to 20x than to 8x. Even that is pretty conservative. I bet there are plenty of hash joins were the ratio could be 30x or more. Not to mention cases with multiple batches. -- Peter Geoghegan
pgsql-hackers by date: