Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Hash Joins vs. Bloom Filters / take 2 |
Date | |
Msg-id | c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com Whole thread Raw |
Responses |
Re: Hash Joins vs. Bloom Filters / take 2
Re: Hash Joins vs. Bloom Filters / take 2 Re: Hash Joins vs. Bloom Filters / take 2 |
List | pgsql-hackers |
Hi, In 2015/2016 I've been exploring if we could improve hash joins by leveraging bloom filters [1], and I was reminded about this idea in a thread about amcheck [2]. I also see that bloom filters were briefly mentioned in the thread about parallel hash [3]. So I've decided to revive the old patch, rebase it to current master, and see if we can resolve the issues that killed it in 2016. The issues are essentially about predicting when the bloom filter can improve anything, which is more a question of the hash join selectivity than the bloom filter parameters. Consider a join on tables with FK on the join condition. That is, something like this: CREATE TABLE dim (id serial primary key, ...); CREATE TABLE fact (dim_id int references dim(id), ...); Then if you do a join SELECT * FROM fact JOIN dim (id = dim_id); the bloom filter can't really help, because all the dim_id values are guaranteed to be in the hash table. So it can only really help with queries like this: SELECT * FROM fact JOIN dim (id = dim_id) WHERE (dim.x = 1000); where the additional conditions on dim make the hash table incomplete in some sense (i.e. the bloom will return false, saving quite a bit of expensive work - lookup in large hash table or spilling tuple to disk). This is not the same as "false positive rate" which is a feature of the bloom filter itself, and can be measured by simply counting bits set in the filter. That is important too, of course, but simple to determine. The bloom filter "selectivity" is more a feature of the join, i.e. what fraction of the values looked up in the bloom filter are expected to be rejected. If many are rejected, the bloom filter helps. If few, the bloom filter is a waste of time. After thinking about this a bit more, I think this is pretty much what we do during join cardinality estimation - we estimate how many of the rows in "fact" have a matching row in "dim". I do think we can reuse that to decide if to use a bloom filter or not. Of course, the decision may need to be more elaborate (and perhaps formulated as part of costing), but the filter selectivity is the crucial piece. The attached patch does not do that yet, though. The attached patch: 1) Only works in non-parallel case for now. Fixing this should not be a big deal, once the costing/planning gets addressed. 2) Adds details about the bloom filter to EXPLAIN ANALYZE output, with various important metrics (size of the filter, number of hash functions, lookups/matches, fill factor). 3) Removes the HLL counter. I came to the conclusion that using HLL to size the bloom filter makes very little sense, because there are about three cases: a) no batching (hash table fits into work_mem) We can easily walk the hash table and compute "good enough" ndistinct estimate by counting occupied buckets (or just using number of tuples in the hash table, assuming the values are distinct). At this point, the patch does not build the bloom filter in single-batch cases, unless forced to do that by setting force_hashjoin_bloom=true. More about this later. b) batching from the beginning The HLL is useless, because we need to initialize the bloom filter before actually seeing any values. Instead, the patch simply uses the expected number of rows (assuming those are distinct). c) starting in single-batch mode, switching to batches later We could use HLL to estimate number of distinct values in the initial batch (when starting to batch), but it's unclear how is that useful. This case means our estimates are off, and we don't really know how many batches will be there. We could use some arbitrary multiple, I guess. What I decided to do instead in the patch is sizing the bloom filter for 1/8 of work_mem. That seems like a viable compromise, as it's unlikely to increase the number of batches. 4) Initially, the patch was aimed at hashjoins with batches, on the premise that it can save some of the serialize/deserialize costs. But as mentioned in the discussion, bloom filters can be beneficial even in the single-batch case, when three conditions are met: a) join is selective (i.e. some rows don't have match in hash table) b) hash table > CPU cache c) bloom filter < CPU cache We don't have a good way to determine size of the CPU cache, but I think we can use some crude arbitrary limits. E.g. require that hash hash table is larger than 8MB and bloom filter is smaller than 4MB, or something like that. Opinions? regards [1] https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com [2] https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com [3] https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: