Thread: WIP: bloom filter in Hash Joins with batches
Hi, while working on the Hash Join improvements, I've been repeatedly running into the idea of bloom filter - various papers on hash joins mention bloom filters as a way to optimize access to the hash table by doing fewer lookups, etc. Sadly, I've been unable to actually see any real benefit of using a bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the lookups much more efficient (so the room for bloom filter improvements is narrow). The one case where bloom filter might still help, and that's when the bloom filter fits into L3 cache (a few MBs) while the hash table (or more accurately the buckets) do not. Then there's a chance that the bloom filter (which needs to do multiple lookups) might help. But I think there's another case where bloom filter might be way more useful in Hash Join - when we do batching. What we do currently is that we simply 1) build the batches for the hash table (inner relation) 2) read the outer relation (usually the larger one), and split it into batches just like the hash table 3) while doing (2) we join the first batch, and write the remaining batches to disk (temporary files) 4) we read the batches one by one (for both tables) and do the join Now, imagine that only some of the rows in the outer table actually match a row in the hash table. Currently, we do write those rows into the temporary file, but with a bloom filter on the whole hash table (all the batches at once) we can skip that for some types of joins. For inner join we can immediately discard the outer row, for left join we can immediately output the row. In both cases we can completely eliminate the overhead with writing the tuple to the temporary file and then reading it again. The attached patch is a PoC of this approach - I'm pretty sure it's not perfectly correct (e.g. I only tried it with inner join), but it's good enough for demonstrating the benefits. It's rather incomplete (see the end of this e-mail), and I'm mostly soliciting some early feedback at this point. The numbers presented here are for a test case like this: CREATE TABLE dim (id INT, dval TEXT); CREATE TABLE fact (id INT, fval TEXT); INSERT INTO dim SELECT i, md5(i::text) FROM generate_series(1,10000000) s(i); -- repeat 10x INSERT INTO fact SELECT * FROM dim; and a query like this SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a'; with different values in the WHERE condition to select a fraction of the inner 'dim' table - this directly affects what portion of the 'fact' table has a matching row, and also the size of the hash table (and number of batches). Now, some numbers from a machine with 8GB of RAM (the 'fact' table has ~6.5GB of data, so there's actually quite a bit of memory pressure, forcing the temp files to disk etc.). With work_mem=16MB, it looks like this: batches filter select. bloom master bloom/master ----------------------------------------------------------- 4 1 6.25% 23871 48631 49.09% 8 2 12.50% 25752 56692 45.42% 8 3 18.75% 31273 57455 54.43% 16 4 25.01% 37430 62325 60.06% 16 5 31.25% 39005 61143 63.79% 16 6 37.50% 46157 63533 72.65% 16 7 43.75% 53500 65483 81.70% 32 8 49.99% 53952 65730 82.08% 32 9 56.23% 55187 67521 81.73% 32 a 62.49% 64454 69448 92.81% 32 b 68.73% 66937 71692 93.37% 32 c 74.97% 73323 72060 101.75% 32 d 81.23% 76703 73513 104.34% 32 e 87.48% 81970 74890 109.45% 32 f 93.74% 86102 76257 112.91% The 'batches' means how many batches were used for the join, 'filter' is the value used in the WHERE condition, selectivity is the fraction of the 'dim' table that matches the condition (and also the 'fact'). Bloom and master are timings of the query in miliseconds, and bloom/master is comparison of the runtimes - so for example 49% means the hash join with bloom filter was running ~2x as fast. Admittedly, work_mem=16MB is quite low, but that's just a way to force batching. What really matters is the number of batches and selectivity (how many tuples we can eliminate using the bloom filter). For work_mem=64MB it looks like this: batches filter select. bloom master bloom/master ----------------------------------------------------------- 1 1 6.25% 24846 23854 104.16% 2 2 12.50% 24369 45672 53.36% 2 3 18.75% 30432 47454 64.13% 4 4 25.01% 36175 59741 60.55% 4 5 31.25% 43103 62631 68.82% 4 6 37.50% 48179 64079 75.19% So initially it's a bit slower (it's not doing any batching in this case, but the code is a bit silly and while not building the bloom filter it still does some checks). But once we start batching, it gets 2x as fast again, and then slowly degrades as the selectivity increases. Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk. As I mentioned, the patch is incomplete in several ways: 1) It does not count the bloom filter (which may be quite big) into work_mem properly. 2) It probably does not work for outer joins at this point. 3) Currently the bloom filter is used whenever we do batching, but it should really be driven by selectivity too - it'd be good to (a) estimate the fraction of 'fact' tuples having a match in the hash table, and not to do bloom if it's over ~60% or so. Also, maybe the could should count the matches at runtime, and disable the bloom filter if we reach some threshold. But overall, this seems like a nice optimization opportunity for hash joins on large data sets, where batching is necessary. Ideas? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk.
A neat idea! Have you possibly tried to also collect statistics about actual false-positive rates and filter allocation sizes in every of the collected data points?
--
Alex
On 15 December 2015 at 22:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
--
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.
Cool results.
It seems a good idea to build the bloom filter always, then discard it if it would be ineffective.
My understanding is that the bloom filter would be ineffective in any of these cases
* Hash table is too small
* Bloom filter too large
* Bloom selectivity > 50% - perhaps that can be applied dynamically, so stop using it if it becomes ineffective
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 12/17/2015 10:50 AM, Shulgin, Oleksandr wrote: > On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > >> Attached is a spreadsheet with results for various work_mem >> values, and also with a smaller data set (just 30M rows in the fact >> table), which easily fits into memory. Yet it shows similar gains, >> shaving off ~40% in the best case, suggesting that this is not just >> thanks to reduction of I/O when forcing the temp files to disk. > > A neat idea! Have you possibly tried to also collect statistics > about actual false-positive rates and filter allocation sizes in > every of the collected data points? The patch counts and prints the total number of lookups, and the number of eliminated rows. The bloom filter is currently sized for 5% false positives rate, and the numbers I've seen match that. I think ultimately we'll need to measure the false positive rate, so that we can use it to dynamically disable the bloom filter if it gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/17/2015 11:44 AM, Simon Riggs wrote: > > My understanding is that the bloom filter would be ineffective in any of > these cases > * Hash table is too small Yes, although it depends what you mean by "too small". Essentially if we can do with a single batch, then it's cheaper to do a single lookup in the hash table instead of multiple lookups in the bloom filter. The bloom filter might still win if it fits into L3 cache, but that seems rather unlikely. > * Bloom filter too large Too large with respect to what? One obvious problem is that the bloom filter is built for all batches at once, i.e. for all tuples, so it may be so big won't fit into work_mem (or takes a significant part of it). Currently it's not accounted for, but that'll need to change. > * Bloom selectivity > 50% - perhaps that can be applied dynamically, > so stop using it if it becomes ineffective Yes. I think doing some preliminary selectivity estimation should not be difficult - that's pretty much what calc_joinrel_size_estimate() already does. Doing that at dynamically is also possible, but quite tricky. Imagine for example the outer relation is sorted - in that case we may get long sequences of the same value (hash), and all of them will either have a match in the inner relation, or not have a match. That may easily skew the counters used for disabling the bloom filter dynamically. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I have very limited internet connection (no graphics) , so I may miss something
On Wed, Dec 16, 2015 at 4:15 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
while working on the Hash Join improvements, I've been repeatedly running into the idea of bloom filter - various papers on hash joins mention bloom filters as a way to optimize access to the hash table by doing fewer lookups, etc.
Sadly, I've been unable to actually see any real benefit of using a bloom filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the lookups much more efficient (so the room for bloom filter improvements is narrow).
The one case where bloom filter might still help, and that's when the bloom filter fits into L3 cache (a few MBs) while the hash table (or more accurately the buckets) do not. Then there's a chance that the bloom filter (which needs to do multiple lookups) might help.
But I think there's another case where bloom filter might be way more useful in Hash Join - when we do batching. What we do currently is that we simply
1) build the batches for the hash table (inner relation)
2) read the outer relation (usually the larger one), and split it
into batches just like the hash table
3) while doing (2) we join the first batch, and write the remaining
batches to disk (temporary files)
4) we read the batches one by one (for both tables) and do the join
Now, imagine that only some of the rows in the outer table actually match a row in the hash table. Currently, we do write those rows into the temporary file, but with a bloom filter on the whole hash table (all the batches at once) we can skip that for some types of joins.
For inner join we can immediately discard the outer row, for left join we can immediately output the row. In both cases we can completely eliminate the overhead with writing the tuple to the temporary file and then reading it again.
The attached patch is a PoC of this approach - I'm pretty sure it's not perfectly correct (e.g. I only tried it with inner join), but it's good enough for demonstrating the benefits. It's rather incomplete (see the end of this e-mail), and I'm mostly soliciting some early feedback at this point.
The numbers presented here are for a test case like this:
CREATE TABLE dim (id INT, dval TEXT);
CREATE TABLE fact (id INT, fval TEXT);
INSERT INTO dim SELECT i, md5(i::text)
FROM generate_series(1,10000000) s(i);
-- repeat 10x
INSERT INTO fact SELECT * FROM dim;
and a query like this
SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';
with different values in the WHERE condition to select a fraction of the inner 'dim' table - this directly affects what portion of the 'fact' table has a matching row, and also the size of the hash table (and number of batches).
Now, some numbers from a machine with 8GB of RAM (the 'fact' table has ~6.5GB of data, so there's actually quite a bit of memory pressure, forcing the temp files to disk etc.).
With work_mem=16MB, it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
4 1 6.25% 23871 48631 49.09%
8 2 12.50% 25752 56692 45.42%
8 3 18.75% 31273 57455 54.43%
16 4 25.01% 37430 62325 60.06%
16 5 31.25% 39005 61143 63.79%
16 6 37.50% 46157 63533 72.65%
16 7 43.75% 53500 65483 81.70%
32 8 49.99% 53952 65730 82.08%
32 9 56.23% 55187 67521 81.73%
32 a 62.49% 64454 69448 92.81%
32 b 68.73% 66937 71692 93.37%
32 c 74.97% 73323 72060 101.75%
32 d 81.23% 76703 73513 104.34%
32 e 87.48% 81970 74890 109.45%
32 f 93.74% 86102 76257 112.91%
The 'batches' means how many batches were used for the join, 'filter' is the value used in the WHERE condition, selectivity is the fraction of the 'dim' table that matches the condition (and also the 'fact'). Bloom and master are timings of the query in miliseconds, and bloom/master is comparison of the runtimes - so for example 49% means the hash join with bloom filter was running ~2x as fast.
Admittedly, work_mem=16MB is quite low, but that's just a way to force batching. What really matters is the number of batches and selectivity (how many tuples we can eliminate using the bloom filter).
For work_mem=64MB it looks like this:
batches filter select. bloom master bloom/master
-----------------------------------------------------------
1 1 6.25% 24846 23854 104.16%
2 2 12.50% 24369 45672 53.36%
2 3 18.75% 30432 47454 64.13%
4 4 25.01% 36175 59741 60.55%
4 5 31.25% 43103 62631 68.82%
4 6 37.50% 48179 64079 75.19%
So initially it's a bit slower (it's not doing any batching in this case, but the code is a bit silly and while not building the bloom filter it still does some checks). But once we start batching, it gets 2x as fast again, and then slowly degrades as the selectivity increases.
Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk.
As I mentioned, the patch is incomplete in several ways:
1) It does not count the bloom filter (which may be quite big) into
work_mem properly.
2) It probably does not work for outer joins at this point.
3) Currently the bloom filter is used whenever we do batching, but it
should really be driven by selectivity too - it'd be good to (a)
estimate the fraction of 'fact' tuples having a match in the hash
table, and not to do bloom if it's over ~60% or so. Also, maybe
the could should count the matches at runtime, and disable the
bloom filter if we reach some threshold.
But overall, this seems like a nice optimization opportunity for hash joins on large data sets, where batching is necessary.
Ideas?
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 12/20/2015 05:46 AM, Oleg Bartunov wrote: > Tomas, > > have you seen > http://www.postgresql.org/message-id/4B4DD67F.9010506@sigaev.ru > I have very limited internet connection (no graphics) , so I may miss > something I haven't seen that, but I don't really see how that's related - your post is about indexes, mine is about building temporary bloom filters when executing hash joins. FWIW, I think bloom filters should be easy to add to BRIN indexes, as another type of 'summary'. That should address most of the missing pieces in your implementation (e.g. WAL). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Tomas. Great idea! Did you consider to cache bloom filter or at least part(s) of it somehow? I think this way we could gain some more TPS. This of course assuming that creating a bloom filter is really a bottleneck here, which would be nice be investigated first. Best regards, Aleksander
On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
--
On 12/17/2015 11:44 AM, Simon Riggs wrote:
My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small
Yes, although it depends what you mean by "too small".
Essentially if we can do with a single batch, then it's cheaper to do a single lookup in the hash table instead of multiple lookups in the bloom filter. The bloom filter might still win if it fits into L3 cache, but that seems rather unlikely.* Bloom filter too large
Too large with respect to what?
One obvious problem is that the bloom filter is built for all batches at once, i.e. for all tuples, so it may be so big won't fit into work_mem (or takes a significant part of it). Currently it's not accounted for, but that'll need to change.
The benefit seems to be related to cacheing, or at least that memory speed is critical. If the hash table is too small, or the bloom filter too large then there would be no benefit from performing the action (Lookup Bloom then maybe Lookup Hash) compared with just doing (LookupHash).
So the objective must be to get a Bloom Filter that is small enough that it lives in a higher/faster level of cache than the main Hash table. Or possibly that we separate that into a two stage process so that the first level can be applied by a GPU and then later checked against hash outside of a GPU.
I think you also need to consider whether we use a hash bloom filter or just simply apply an additional range predicate. The latter idea is similar to my earlier thoughts here http://www.postgresql.org/message-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/24/2015 02:51 PM, Simon Riggs wrote: > On 17 December 2015 at 16:00, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On 12/17/2015 11:44 AM, Simon Riggs wrote: > > > My understanding is that the bloom filter would be ineffective > in any of > these cases > * Hash table is too small > > > Yes, although it depends what you mean by "too small". > > Essentially if we can do with a single batch, then it's cheaper to > do a single lookup in the hash table instead of multiple lookups in > the bloom filter. The bloom filter might still win if it fits into > L3 cache, but that seems rather unlikely. > > * Bloom filter too large > > > Too large with respect to what? > > One obvious problem is that the bloom filter is built for all > batches at once, i.e. for all tuples, so it may be so big won't fit > into work_mem (or takes a significant part of it). Currently it's > not accounted for, but that'll need to change. > > > The benefit seems to be related to cacheing, or at least that memory > speed is critical. If the hash table is too small, or the bloom > filter too large then there would be no benefit from performing the > action (Lookup Bloom then maybe Lookup Hash) compared with just > doing(LookupHash). > > So the objective must be to get a Bloom Filter that is small enough > that it lives in a higher/faster level of cache than the main Hash > table. Or possibly that we separate that into a two stage process so > that the first level can be applied by a GPU and then later checked > against hash outside of a GPU. I don't think that's quite true. Had the primary goal been to replace expensive hashtable lookups with cheap bloom filter checks, then sure - the size of the bloom filter would be crucial (namely getting the whole bloom filter into L3, which is an order of magnitude faster compared to RAM). But that's not what the patch does - it aims to reduce the amount of data that need to be serialized/deserialized when batching is necessary, trading those costs for the bloom overhead. The size of the bloom filter may still matter, but it's way less important because the (de)serialization overhead is much more expensive than the hash lookup. FWIW I've been trying to use the bloom filters in the "traditional" way (doing "cheap" bloom checks before hashtable lookups), but I've been unable to get any measurable benefit. Either I was doing it wrong somehow, or perhaps NTUP_PER_BUCKET=1 made the hashtable too fast (it essentially means we need a single lookup to see if the entry exists, while bloom filter needs several lookups, depending on the desired false positive ratio). So I concluded that the attempt to use the bloom filter that way is futile, and that the batching case seems like the only scenario where the bloom filters may still be useful. There's a number of additional problems with sizing the bloom filter to fit into L3: (1) We have no idea how much L3 there is, although this might be fixed by either adding a GUC, read it from /proc/cpuinfoor just use some reasonable default (say, most Xeon CPUs today have ~20 MB) (2) The L3 is not per-process, but shared by all processes, so we don't really have all the L3, but perhaps just L3/Nwhere N is the number of backends. (3) The size of the bloom filter depends on the number of items it should handle and false positive rate. I've used 5%in the patch, but maybe it'd make sense to increase this in order to get smaller filter? Also, the bloom filter does not compete with the whole hash table, but "just" with the "buckets" as that's sufficient to determine if there's a tuple for a key or not. Let's do some math for a hash table with 100k rows (so a fairly small hash table): * nbuckets = 131072 (2^17), so 1MB * bloom filter (5% false positive rate [1]): 76kB, k=4 (see [1]) So both the buckets and bloom fit into L3, and the bloom filter is unlikely to speed things up. The difference between buckets and bloom filter is about 10x, and both sizes scale about linearly (with 1M items it'll be ~10MB vs. 760kB), so there clearly is an area where bloom filter fits into L3 while buckets don't. Sadly I haven't been able to exploit that, but I'll try again. In any case, none of this can help unless the bloom filter really eliminates many lookups. [1] http://hur.st/bloomfilter?n=100000&p=0.05 > > I think you also need to consider whether we use a hash bloom filter or > just simply apply an additional range predicate. The latter idea is > similar to my earlier thoughts here > http://www.postgresql.org/message-id/CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com Interesting, but I don't really see how this is related to bloom filters? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I think ultimately we'll need to measure the false positive rate, so that we can use it to dynamically disable the bloom filter if it gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE.
I'm not so convinced that will be a good idea. What if the filter does not help much to start with, we disable it because of that, then we get some different probe values later in the scan which the bloom filter would have helped to eliminate earlier.
Maybe it would be better to, once the filter is built, simply count the number of 1 bits and only use the filter if there's less than <threshold> 1 bits compared to the size of the filter in bits. There's functionality in bms_num_members() to do this, and there's also __builtin_popcount() in newer version of GCC, which we could have some wrapper around, perhaps.
On 12/28/2015 03:15 AM, David Rowley wrote: > On 18 December 2015 at 04:34, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > I think ultimately we'll need to measure the false positive rate, so > that we can use it to dynamically disable the bloom filter if it > gets inefficient. Also maybe put some of that into EXPLAIN ANALYZE. > > > I'm not so convinced that will be a good idea. What if the filter does > not help much to start with, we disable it because of that, then we get > some different probe values later in the scan which the bloom filter > would have helped to eliminate earlier. Yes, that's true. This might happen quite easily for example when then outer table is sorted - in that case what we'll see are (possibly long) sequences of the same value, which might bias the decision quite easily. I don't know how to fix this perfectly (if at all possible). But maybe we don't really need to do that. The overall goal is to improve the overall performance, and this check (disabling the bloom filter) is meant to limit the loss when the filter returns "true" for most values (making it somewhat pointless). The costs however are asymmetric - once we build the hash table (and build the filter), the cost for dropping the bloom filter is constant, as we simply loose the time we spent building it. But the cost for continuing to use of inefficient filter is O(N) where N is the number of lookups (i.e. rows of outer table). So it's probably better to drop the filter a bit too early and sacrifice the small amount of time we spent building it, so that we have a change of not being much slower than the current implementation. That is not to say we can't come up with better solutions - for example we can count (or estimate) the number of distinct hash values we've seen, and only do the decision once we see enough of them. For example we could use HyperLogLog to do that, or simply see the number of times a value changed (hash value not equal to the previous value). That should be better than simply waiting to X lookups, but I'm sure it's still possible to come up with counter-examples. > > Maybe it would be better to, once the filter is built, simply count the > number of 1 bits and only use the filter if there's less than > <threshold> 1 bits compared to the size of the filter in bits. There's > functionality in bms_num_members() to do this, and there's > also __builtin_popcount() in newer version of GCC, which we could have > some wrapper around, perhaps. I don't really see how this is relevant with the previous point. The number of 1s in the bitmap can tell you the false positive rate for the bloom filter, not what fraction of lookups will be answered with "true". So while this needs to be watched, so that we stop using the bloom filter when it gets too full (e.g. due to under-estimate), it's completely unrelated to the previous issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/28/2015 03:15 AM, David Rowley wrote:Maybe it would be better to, once the filter is built, simply count thenumber of 1 bits and only use the filter if there's less than
<threshold> 1 bits compared to the size of the filter in bits. There's
functionality in bms_num_members() to do this, and there's
also __builtin_popcount() in newer version of GCC, which we could have
some wrapper around, perhaps.
I don't really see how this is relevant with the previous point. The number of 1s in the bitmap can tell you the false positive rate for the bloom filter, not what fraction of lookups will be answered with "true".
So while this needs to be watched, so that we stop using the bloom filter when it gets too full (e.g. due to under-estimate), it's completely unrelated to the previous issue.
Why is it not related? this has got me confused. If a bloom filter has all of the bits set to 1, then it will never filter any Tuples right?
If so, then a filter with all 1 bits set should be thrown away, as it'll never help us, and the filter should generally become more worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of course we don't have a count of how many Tuples matched each bit, so this is based on the assumption that each bit matches an equal number of Tuples. Are you saying this is not an assumption that we should make?
On 12/28/2015 11:38 AM, David Rowley wrote: > On 28 December 2015 at 23:23, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On 12/28/2015 03:15 AM, David Rowley wrote: > > Maybe it would be better to, once the filter is built, simply > count the > > number of 1 bits and only use the filter if there's less than > <threshold> 1 bits compared to the size of the filter in bits. > There's > functionality in bms_num_members() to do this, and there's > also __builtin_popcount() in newer version of GCC, which we > could have > some wrapper around, perhaps. > > > I don't really see how this is relevant with the previous point. The > number of 1s in the bitmap can tell you the false positive rate for > the bloom filter, not what fraction of lookups will be answered with > "true". > > So while this needs to be watched, so that we stop using the bloom > filter when it gets too full (e.g. due to under-estimate), it's > completely unrelated to the previous issue. > > > Why is it not related? this has got me confused. If a bloom filter has > all of the bits set to 1, then it will never filter any Tuples right? Because the false positive rate can be computed without having to look at the lookups. So it's inherently independent on the ordering of outer relation, and so on. > If so, then a filter with all 1 bits set should be thrown away, as > it'll never help us, and the filter should generally become more > worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of > course we don't have a count of how many Tuples matched each bit, so > this is based on the assumption that each bit matches an equal number > of Tuples. Are you saying this is not an assumption that we should > make? Sure we should check that. All I'm saying is it has nothing to do with the first problem described in the first part of the e-mail. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 12/28/2015 11:38 AM, David Rowley wrote:If so, then a filter with all 1 bits set should be thrown away, asit'll never help us, and the filter should generally become more
worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of
course we don't have a count of how many Tuples matched each bit, so
this is based on the assumption that each bit matches an equal number
of Tuples. Are you saying this is not an assumption that we should
make?
Sure we should check that. All I'm saying is it has nothing to do with the first problem described in the first part of the e-mail.
Okay. I was merely suggesting this method as an alternative to checking tracking and checking the usefulness of the filter during the hash probe. I assumed that tracking and checking the usefulness during the hash probe won't be free, and that it may be better if we can estimate or determine the expected usefulness of the filter before the probe stage, and throw it away before we waste cycles using it.
On 12/28/2015 11:52 AM, David Rowley wrote: > On 28 December 2015 at 23:44, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On 12/28/2015 11:38 AM, David Rowley wrote: > > If so, then a filter with all 1 bits set should be thrown away, as > > it'll never help us, and the filter should generally become more > worthwhile as it contains a higher ratio of 0 bits vs 1 bits. Of > course we don't have a count of how many Tuples matched each bit, so > this is based on the assumption that each bit matches an equal > number > of Tuples. Are you saying this is not an assumption that we should > make? > > > Sure we should check that. All I'm saying is it has nothing to do > with the first problem described in the first part of the e-mail. > > > Okay. I was merely suggesting this method as an alternative to checking > tracking and checking the usefulness of the filter during the hash > probe. I assumed that tracking and checking the usefulness during the > hash probe won't be free, and that it may be better if we can estimate > or determine the expected usefulness of the filter before the probe > stage, and throw it away before we waste cycles using it. Consider this example: CREATE TABLE t (id INT); INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i); ANALYZE; SELECT * FROM t AS t1 JOIN t AS t2 ON (t1.id = t2.id) WHERE t1.id < 10000 AND t2.id < 10000; This will be executed like this: QUERY PLAN ----------------------------------------------------------------------- Hash Join (cost=17046.26..34008.58 rows=94 width=8) Hash Cond: (t1.id = t2.id) -> Seq Scan on t t1 (cost=0.00..16925.00 rows=9701 width=4) Filter: (id< 10000) -> Hash (cost=16925.00..16925.00 rows=9701 width=4) -> Seq Scan on t t2 (cost=0.00..16925.00 rows=9701width=4) Filter: (id < 10000) (7 rows) But of course the problem is that the two relations are (trivially) correlated, which means that in reality it works like this: QUERY PLAN --------------------------------------------------------- Hash Join (actual rows=9999 loops=1) Hash Cond: (t1.id = t2.id) -> Seq Scan on t t1 (actual rows=9999 loops=1) Filter: (id < 10000) Rows Removed by Filter: 990001 -> Hash (actual rows=9999 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 480kB -> Seq Scanon t t2 (actual rows=9999 loops=1) Filter: (id < 10000) Rows Removed by Filter: 990001 Planningtime: 0.316 ms Execution time: 176.283 ms (12 rows) So while we have very good estimates on the scan nodes, the final estimate is off - we expect about the bloom filter to eliminate ~99% of rows, in reality 100% of rows matches (due to the correlation). And that's even if the bloom filter is "perfect" in the sense that it has very low false probability etc. This example illustrates that such cases can't be really solved before actually doing the lookups. Which does not make the checks you propose pointless, but they simply address different cases (and I indeed planned to implement them). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, attached is v2 of the patch, with a number of improvements: 0) This relies on the the other hashjoin patches (delayed build of buckets and batching), as it allows sizing the bloom filter. 1) enable_hashjoin_bloom GUC This is mostly meant for debugging and testing, not for committing. 2) Outer joins should be working fine now. That is, the results should be correct and faster as the outer rows without matches should not be batched at all. 3) The bloom filter is now built for all hash joins, not just when batching is happening. I've been a bit skeptical about the non-batched cases, but it seems that I can get a sizable speedup (~20-30%, depending on the selectivity of the join). 4) The code is refactored quite a bit, adding BloomFilterData instead of just sprinkling the fields on HashJoinState or HashJoinTableData. 5) To size the bloom filter, we now use HyperLogLog couter, which we now have in core thanks to the sorting improvements done by Peter Geoghegan. This allows making the bloom filter much smaller when possible. The patch also extends the HyperLogLog API a bit (which I'll submit to the CF independently). There's a bunch of comments in the code, mostly with ideas about more possible improvements. The main piece missing in the patch (IMHO) is optimizer code making decisions whether to enable bloom filters for the hash join, based on cardinality estimates. And also the executor code disabling the bloom filter if they turn inefficient. I don't think that's a major issue at this point, though, and I think it'll be easier to do based on testing the current patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, attached are results for some basic performance evaluation of the patch (and also scripts used for the evaluation). I've used a simple join of two tables ("fact" and "dim"), with three different dataset sizes. CREATE TABLE dim (id INT, r INT, val TEXT); CREATE TABLE fact (id INT, val TEXT); -- 1M rows into "dim" INSERT INTO dim SELECT i, mod(i,100), md5(i::text) FROM generate_series(1,1000000) s(i); -- 10M rows into "fact" INSERT INTO fact SELECT mod(i,1000000)+1, md5((mod(i,1000000)+1)::text) FROM generate_series(1,10000000) s(i); Which means the "dim.r" column has 100 different values (0-99) with uniform distribution. So e.g. "WHERE r < 15" matches 15%. There are three dataset sizes: 1) small: dim 1M rows (73MB), fact 10M rows (650MB) 2) medium: dim: 10M rows (730MB), fact 100M rows (6.5GB) 3) large: dim: 5M rows (365MB), fact 250M rows (16GB) The machine has 8GB of RAM, so "small" easily fits into RAM, "medium" is just at the border, and "large" is clearly over (especially when batching). For each dataset size there are two queries - one with inner and one with outer join, with a filter on the smaller one ("dim") determining the "selectivity". -- inner join SELECT COUNT(dim.val), COUNT(fact.val) FROM fact JOIN dim ON (fact.id = dim.id) WHERE r < $1 -- outer join SELECT COUNT(dim.val), COUNT(fact.val) FROM fact LEFT JOIN (SELECT * FROM dim WHERE r < $1) dim ON (fact.id = dim.id) Those queries were executed with various work_mem sizes (to either use no batching or different number of batches), and also with different filter values (to filter to 10%, 20%, 30%, ...., 100% of data). After collecting data both on master and with all the patches applied, I've computed the speedup factor as (time with patches) / (time on master) and plotted that as a pivot tables with heat map. Values <1.0 (red) mean "bloom filter made the query faster" while values > 1.0 (blue) means it slowed the query down. Let's quickly skim through the results for each dataset size. The attached charts only show results for the "inner" queries, but the results for "outer" queries are pretty much exactly the same. 1) small (hash-joins-bloom-1-10.png) ------------------------------------ There's a clear speedup as long as the hash join requires batching and the selectivity is below 50-60%. In the best case (10% selectivity, small work_mem and thus multiple batches) we save up to ~30% of time. Once we get to non-batching mode, the bloom filter is ineffective no matter the selectivity. I assume this is because for the small selectivities (e.g. 10%) it's simply cheaper to check the hash table, which is small and likely fits into L3 on the CPU (which is ~6MB on the i5-2500 CPU in the system). 2) medium (hash-joins-bloom-10-100.png) --------------------------------------- The results are similar to the "small" results, except that the speedup is somewhat larger (up to 40%), and there's no sudden jump when the queries stop batching (work_mem=1GB is enough to run the query in a single batch even with 100% selectivity). The "break-even" selectivity is again somewhere around 50% (a bit lower in non-batching mode). 3) large (hash-joins-bloom-5-250.png) ------------------------------------- Pretty much the same as "medium". So, this seems to bring reasonable speedup, as long as the selectivity is below 50%, and the data set is sufficiently large. It's however clear that the patch won't work without some sort of selectivity estimation - either at planning time or optimization time (or both). I've speculated about how to do that, but the current patches don't implement any of that yet and any ideas are welcome. Another open question is sizing the bloom filter in the multi-batch case, which turns out to be quite tricky. What the patch does right now is estimating the number of distinct values to store in the filter based on the first batch (the other hashjoin patch in this CF makes this possible). That is, there's a hyperloglog counter that gives us (very accurate) ndistinct estimate for the first batch, and then we do (nbatch * ndistinct), to estimate the ndistinct values in the whole data set. That estimate is of course rather naive, and often produces too high - for example we might have already seen all the distinct values in the first batch. So the result is a bloom filter much larger than necessary. Not sure how to fix this :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So, this seems to bring reasonable speedup, as long as the selectivity is > below 50%, and the data set is sufficiently large. What about semijoins? Apparently they can use bloom filters particularly effectively. Have you considered them as a special case? Also, have you considered Hash join conditions with multiple attributes as a special case? I'm thinking of cases like this: regression=# set enable_mergejoin = off; SET regression=# explain analyze select * from tenk1 o join tenk2 t on o.twenty = t.twenty and t.hundred = o.hundred; QUERY PLAN ──────────────────────────────────────────────────────────────────────Hash Join (cost=595.00..4103.00 rows=50000 width=488)(actual time=12.086..1026.194 rows=1000000 loops=1) Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred)) -> Seq Scanon tenk1 o (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..4.212 rows=10000 loops=1) -> Hash (cost=445.00..445.00 rows=10000 width=244) (actual time=12.023..12.023 rows=10000 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 2824kB -> Seq Scan on tenk2t (cost=0.00..445.00 rows=10000 width=244) (actual time=0.006..3.453 rows=10000 loops=1)Planning time: 0.567 msExecution time: 1116.094 ms (8 rows) (Note that while the optimizer has a slight preference for a merge join in this case, the plan I show here is a bit faster on my machine). -- Peter Geoghegan
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote: > Also, have you considered Hash join conditions with multiple > attributes as a special case? I'm thinking of cases like this: Sorry, accidentally fat-fingered my enter key before I was finished drafting that mail. That example isn't useful, because there is no early/cheap elimination of tuples while scanning the outer relation. My point about bloom filters on only one attribute (or perhaps multiple bloom filters on multiple attributes) is that you might be able to make the bloom filter more effective, particularly relative to the amount of memory available, by only building it for one attribute where that distinction (one attribute vs multiple) happens to exist. Simon said: "So the objective must be to get a Bloom Filter that is small enough that it lives in a higher/faster level of cache than the main Hash table". I agree that that's really important here, although I did note that Tomas wasn't so sure, emphasizing the importance of avoiding serialization and deserialization overhead -- and certainly, that's why the patch helps some cases a lot. But Simon's point applies more to the worst case than the best or average cases, and evidently we have plenty of worrying about the worst case left to do here. So, one attribute could have relatively low cardinality, and thus fewer distinct items in the bloom filter's set, making it far slower to degrade (i.e. have increased probability of false positives) as compared to a composite of two or more attributes, and yet perhaps not significantly less useful in terms of its ability to eliminate (de)serialization overhead. Or, with two bloom filters (one per attribute), one bloom filter could degrade far faster than the other (due to having more distinct values), often very unpredictably, and yet it wouldn't matter because we'd discard the bad one (maybe really early, during the scan of the inner relation, using HLL). I notice that all these example queries involve less-than operators (inner relation tuples are thereby prevented from being entered into the hash table). It might not be a terrible idea to hash abbreviated keys (or even truncated abbreviated keys) for the bloom filter in certain cases, in particular when there is a correlation in the inner relation attributes (equi-joined attribute, and attribute that is other part of qual). There might not be a correlation, of course, but with some care hashing abbreviated keys may have virtually no additional overhead, making it worth doing despite the general uncertainty about any correlation existing. If you're going to accept that the bloom filter can't be very large, which I think you must, then hashing an entire value may be no better than hashing an abbreviated key (those 8 bytes tend to have a lot of entropy). This point is rather hand-wavy, though. I suspect that BLOOM_ERROR_RATE isn't the most useful constraint here. Is the degree to which it helps in sympathetic cases at all commensurate with the degree to which it hurts in unsympathetic cases? In your "low selectivity" cases (the sympathetic cases), there are relatively few values represented in the main hash table, and therefore in the bloom filter, and so a smaller bloom filter results automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE constraint just wastes memory bandwidth for no gain, I think. If the number of elements is so large that a reasonable fixed sized bloom filter has a poor false positive rate anyway, we may have already lost. My sense is that we should try to make the bloom filter as cheap as possible more so than trying to make sure it's useful before much work has been done. Is it okay that you don't treat MCV/skew values as special? Actually, looking at this code, I think I notice something: @@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node) hashvalue); node->hj_CurTuple = NULL; + /* If still in the first batch, we check the bloom filter. */ + if ((hashtable->curbatch == 0) && + (! ExecHashBloomCheckValue(hashtable, hashvalue))) + { + /* no matches; check for possible outer-join fill */ + node->hj_JoinState = HJ_FILL_OUTER_TUPLE; + continue; + } Haven't we already established by now that the value of "node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so the value certainly was found in the inner relation scan? Why bother doing anything with the bloom filter in that common case? (Note that I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo" first). Also, are you aware of this? http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf It talks about bloom filters for hash joins in PostgreSQL specifically. Interestingly, they talk about specific TPC-H queries. BTW, I think code like this needs better comments: +static BloomFilter +BloomFilterInit(double nrows, double error) +{ + /* perhaps we should round nbits to multiples of 8 ? */ + int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0))))); + int nhashes = round(log(2.0) * nbits / nrows); + + BloomFilter filter = palloc0(offsetof(BloomFilterData, data) + ((nbits + 7) / 8)); + + filter->nbits = nbits; + filter->nhashes = nhashes; Have you experimentally verified that you get the expected probability of false positives in practice? -- Peter Geoghegan
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Which means the "dim.r" column has 100 different values (0-99) with uniform > distribution. So e.g. "WHERE r < 15" matches 15%. I think that the use of a uniform distribution to demonstrate this patch is a bad idea, unless you want to have a conversation about the worst case. Look at the use cases for bloom filters in general. They're almost all some variation on the same theme: checking a bloom filter inexpensively, to avoid an expensive cache miss (of some fashion). The Google Chrome web browser uses a bloom filter for malicious URLs. It usually avoids consulting Google's servers about whether or not any given URL that is visited is malicious by using the bloom filter. This is effective in large part because the top 100 websites ranked by popularity have a huge falloff in popularity as you go down the list. It looks like a Zipfian distribution, and so I imagine they get pretty far with a very small bloom filter, even though in theory the chances of any given valid URL resulting in a cache miss is very high. Generally, uniform distributions are rare in a non-canonical list of things, like a hash join outer relation's attribute. -- Peter Geoghegan
Hi, On 01/10/2016 01:08 AM, Peter Geoghegan wrote: > On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> So, this seems to bring reasonable speedup, as long as the selectivity is >> below 50%, and the data set is sufficiently large. > > What about semijoins? Apparently they can use bloom filters > particularly effectively. Have you considered them as a special case? You mean to handle them in a special way in the code, or just to perform benchmark semijoins (and not just regular joins)? > Also, have you considered Hash join conditions with multiple > attributes as a special case? I'm thinking of cases like this: > > regression=# set enable_mergejoin = off; > SET > regression=# explain analyze select * from tenk1 o join tenk2 t on > o.twenty = t.twenty and t.hundred = o.hundred; > QUERY PLAN > ────────────────────────────────────────────────────────────────────── > Hash Join (cost=595.00..4103.00 rows=50000 width=488) (actual > time=12.086..1026.194 rows=1000000 loops=1) > Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred)) > -> Seq Scan on tenk1 o (cost=0.00..458.00 rows=10000 width=244) > (actual time=0.017..4.212 rows=10000 loops=1) > -> Hash (cost=445.00..445.00 rows=10000 width=244) (actual > time=12.023..12.023 rows=10000 loops=1) > Buckets: 16384 Batches: 1 Memory Usage: 2824kB > -> Seq Scan on tenk2 t (cost=0.00..445.00 rows=10000 > width=244) (actual time=0.006..3.453 rows=10000 loops=1) > Planning time: 0.567 ms > Execution time: 1116.094 ms > (8 rows) > > (Note that while the optimizer has a slight preference for a merge > join in this case, the plan I show here is a bit faster on my > machine). The patch I posted actually does not build bloom filter on the values directly, but on the hashvalue we already use. So it handles hashjoins with arbitrary number of attributes just fine. Or perhaps you're thinking about some special optimization that might improve such cases (I can't think of any)? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 01/10/2016 04:03 AM, Peter Geoghegan wrote: > On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Also, have you considered Hash join conditions with multiple >> attributes as a special case? I'm thinking of cases like this: > > Sorry, accidentally fat-fingered my enter key before I was finished > drafting that mail. That example isn't useful, because there is no > early/cheap elimination of tuples while scanning the outer relation. > > My point about bloom filters on only one attribute (or perhaps > multiple bloom filters on multiple attributes) is that you might be > able to make the bloom filter more effective, particularly relative > to the amount of memory available, by only building it for one > attribute where that distinction (one attribute vs multiple) happens > to exist. I'm not sure what you mean. The patch builds bloom filter on hashvalue, not the attribute values directly. I find this very efficient as we don't have to hash the original values again and instead work with just a short 32-bit hashvalue. I really doubt doing some magic by only building the bloom filter on one of the attributes is useful - it's way more complicated, requires estimating the ndistinct for each attribute, and also somehow estimate the impact on the accuracy of the bloom filter. That seems rather complex and I don't have a good idea how to do that. > > Simon said: "So the objective must be to get a Bloom Filter that is > small enough that it lives in a higher/faster level of cache than the > main Hash table". I agree that that's really important here, although > I did note that Tomas wasn't so sure, emphasizing the importance of > avoiding serialization and deserialization overhead -- and certainly, > that's why the patch helps some cases a lot. I think the impact of the bloom filter size really depends on the size of the hash join. I see three cases, as indicated in the benchmarks I posted (a) single batch (1M:10M) - fitting into CPU cache really important (otherwise almost no improvement) - impossible to speed up for very small hash tables (that fit into L3 cache) (b) multiple batches w. medium data set (10M:100M) - important, but not as much as for (a), as we still eliminate the (de)serialization overhead (c) large data set (5M:250M) - not important, it's mostly about eliminating (de)serialization overhead and/or I/O overhead > But Simon's point applies more to the worst case than the best or> average cases, and evidently we have plenty of worryingabout the> worst case left to do here. So what do you consider to be the worst case? > So, one attribute could have relatively low cardinality, and thus > fewer distinct items in the bloom filter's set, making it far slower > to degrade (i.e. have increased probability of false positives) as > compared to a composite of two or more attributes, and yet perhaps > not significantly less useful in terms of its ability to eliminate > (de)serialization overhead. Or, with two bloom filters (one per > attribute), one bloom filter could degrade far faster than the other > (due to having more distinct values), often very unpredictably, and > yet it wouldn't matter because we'd discard the bad one (maybe > really early, during the scan of the inner relation, using HLL). I don't think so. Firstly, I don't know what you mean by "slower to degrade" as the false positive rate of a bloom filter does not depend on the cardinality of the column used to build the bloom filter, but rather on the "load factor" of the bloom filter (i.e. number of items added to the filter compared to the initial capacity). Not to mention that you still have to estimate the cardinality of the columns, which is tricky (and one of the main issues of the current patch, IMHO). Secondly, by splitting the "composite" bloom filter into per-column filters you make it impossible to track correlation between the columns. For example let's say we have two attributes (a,b), 'a' having 1000 distinct values while 'b' only has 10. But let's assume that there's a correlation between 'a' and 'b' so that (mod(a,10) = b). If you build the bloom filter on the columns separately, it's going to be entirely useless. Sure, this is a rather trivial (and perhaps naive) example, but it shows how easy it's to break it. I'd also like to point out that vast majority of joins is on a single column, so perhaps we should try to solve those first, and then perhaps improve the multi-column case if possible. > > I notice that all these example queries involve less-than operators > (inner relation tuples are thereby prevented from being entered into > the hash table). It might not be a terrible idea to hash abbreviated > keys (or even truncated abbreviated keys) for the bloom filter in > certain cases, in particular when there is a correlation in the inner > relation attributes (equi-joined attribute, and attribute that is > other part of qual). There might not be a correlation, of course, but > with some care hashing abbreviated keys may have virtually no > additional overhead, making it worth doing despite the general > uncertainty about any correlation existing. If you're going to accept > that the bloom filter can't be very large, which I think you must, > then hashing an entire value may be no better than hashing an > abbreviated key (those 8 bytes tend to have a lot of entropy). This > point is rather hand-wavy, though. I have to admit that I have zero idea on how to do that. Also, there's nothing really special about the '<' operator - the reason why I used it is that it makes it very simple to filter arbitrary fraction of the table (i.e. specify how many tuples of the outer relation have a matching tuple in the hash table). I'm not saying we can't apply the abbreviated keys somehow, but at this point it seems rather like a hammer looking for a nail. > > I suspect that BLOOM_ERROR_RATE isn't the most useful constraint > here. Is the degree to which it helps in sympathetic cases at all > commensurate with the degree to which it hurts in unsympathetic > cases? In your "low selectivity" cases (the sympathetic cases), there > are relatively few values represented in the main hash table, and > therefore in the bloom filter, and so a smaller bloom filter results > automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE > constraint just wastes memory bandwidth for no gain, I think. If the > number of elements is so large that a reasonable fixed sized bloom > filter has a poor false positive rate anyway, we may have already > lost. I agree that having a fixed error rate is somewhat annoying. What I imagined would be something like this: (1) estimate the number of distinct values the bloom filter (2) see what error rate would make the bloom filter "small enough" (so that it fits into L3) (3) see if the error rate makes the bloom filter efficient (some simple costing, or perhaps hard-coded threshold) This however requires knowledge of the L3 cache size for (2), or rather how much of it is available for the process (which is tricky to get). And then (3) requires some costing function, which seems tricky too. > > My sense is that we should try to make the bloom filter as cheap as > possible more so than trying to make sure it's useful before much > work has been done. Well, yeah. Fail fast. But how? > Is it okay that you don't treat MCV/skew values as special? Actually, > looking at this code, I think I notice something: > > @@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node) > hashvalue); > node->hj_CurTuple = NULL; > > + /* If still in the first batch, we check the bloom filter. */ > + if ((hashtable->curbatch == 0) && > + (! ExecHashBloomCheckValue(hashtable, hashvalue))) > + { > + /* no matches; check for possible outer-join fill */ > + node->hj_JoinState = HJ_FILL_OUTER_TUPLE; > + continue; > + } > > Haven't we already established by now that the value of > "node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so > the value certainly was found in the inner relation scan? Why bother > doing anything with the bloom filter in that common case? (Note that > I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo" > first). Hmmmm, maybe. The bloom filter should contain all values, including those from skew buckets. So what the code could / should do is check the bloom filter first, and only do ExecHashGetSkewBucket() if we still think the tuple may be in the hash table. So something like this: node->hj_CurHashValue = hashvalue; node->hj_CurTuple = NULL; /* If still in the first batch, we check the bloom filter. */ if ((hashtable->curbatch == 0) && ExecHashBloomCheckValue(hashtable,hashvalue))) { /* no matches; check for possible outer-join fill */ node->hj_JoinState= HJ_FILL_OUTER_TUPLE; continue; } ExecHashGetBucketAndBatch(hashtable, hashvalue, &node->hj_CurBucketNo, &batchno); node->hj_CurSkewBucketNo= ExecHashGetSkewBucket(hashtable, hashvalue); Not sure how expensive those two functions (ExecHashGetBucketAndBatch and ExecHashGetSkewBucket) are, but this seems like a good idea. Haven't tried that, though, so maybe HJ_FILL_OUTER_TUPLE needs that info or something like that. Makes sense? > > Also, are you aware of this? > > http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf > > It talks about bloom filters for hash joins in PostgreSQL > specifically. Interestingly, they talk about specific TPC-H queries. Interesting. The way that paper uses bloom filters is very different from what I do in the patch. They build the bloom filters and then propagate them into the scan nodes to eliminate the tuples early. The problem is they'd be facing mostly the same problems with sizing the bloom filter I'm facing in the patch. The only way they managed to get around them is that they used a dataset with fixed size (1GB, which is a bit small), and bloom 1kB or 8kB bloom filters. I suspect they'd get much worse results with datasets of "reasonable" size (say, 100GB or more) because the bloom fixed-size filters would stop being effective. > > BTW, I think code like this needs better comments: Yeah. > > +static BloomFilter > +BloomFilterInit(double nrows, double error) > +{ > + /* perhaps we should round nbits to multiples of 8 ? */ > + int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0))))); > + int nhashes = round(log(2.0) * nbits / nrows); > + > + BloomFilter filter = palloc0(offsetof(BloomFilterData, data) + > ((nbits + 7) / 8)); > + > + filter->nbits = nbits; > + filter->nhashes = nhashes; > > Have you experimentally verified that you get the expected probability > of false positives in practice? I have seen that when the bloom filter is properly sized, and the data set is not somehow skewed, the bloom filters have about the right false positive rate. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 01/10/2016 05:11 AM, Peter Geoghegan wrote: > On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Which means the "dim.r" column has 100 different values (0-99) with uniform >> distribution. So e.g. "WHERE r < 15" matches 15%. > > I think that the use of a uniform distribution to demonstrate this > patch is a bad idea, unless you want to have a conversation about the > worst case. > > Look at the use cases for bloom filters in general. They're almost > all some variation on the same theme: checking a bloom filter > inexpensively, to avoid an expensive cache miss (of some fashion). Right. > The Google Chrome web browser uses a bloom filter for malicious URLs. FWIW I don't think Chrome is using bloom filter for this purpose anymore. Chromium certainly does not (it's using PrefixSet instead). > It usually avoids consulting Google's servers about whether or not > any given URL that is visited is malicious by using the bloom filter. > This is effective in large part because the top 100 websites ranked > by popularity have a huge falloff in popularity as you go down the > list. It looks like a Zipfian distribution, and so I imagine they get > pretty far with a very small bloom filter, even though in theory the > chances of any given valid URL resulting in a cache miss is very > high. I'm not familiar with how Chrome used bloom filters, but I'd expect them to be very careful about false positives (and also false negatives, as the bloom filter can't contain all malicious URLs). My assumptions is that they've been able to make that work because they do have detailed stats about how frequently people visit those URLs, and use that when building the bloom filter. But we don't have such information in hashjoin. > > Generally, uniform distributions are rare in a non-canonical list of > things, like a hash join outer relation's attribute. Well, I'm not claiming testing uniform distribution is enough, but surely it's one of the cases we should handle just fine. The problem with non-uniform cases is that it really depends on the outer side of the join. For example let's say the hash table contains 1000 values, and the bloom filter has 1% false positive rate. But let's assume that the outer side has a value that triggers the false positive rate, and that it's actually 99% of the outer table. Suddenly, you have 99% false positive rate, rendering the bloom filter pointless. I don't think this is fixable while creating the bloom filter. All we can do is watch the bloom filter lookups and disable the bloom filter once the false positive rate reaches some threshold. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11 January 2016 at 09:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
-- Hi,
On 01/10/2016 04:03 AM, Peter Geoghegan wrote:On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <pg@heroku.com> wrote:Also, are you aware of this?
http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.
Interesting. The way that paper uses bloom filters is very different from what I do in the patch. They build the bloom filters and then propagate them into the scan nodes to eliminate the tuples early.
That does sound interesting, but unless I'm somehow mistaken, I guess to do that you'd have to abandon the more efficient hashing of the hash value that you're doing in the current patch, and hash the complete value in the scan node, then hash them again if they make it into the hash join node. That does not sound like it would be a win if hashing longer varlana values.
I'm closing this as returned-with-feedback; AFAICS even the last version submitted is still in research stage. Please resubmit once you make further progress. Thanks, -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services