Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Hash Joins vs. Bloom Filters / take 2
Date
Msg-id c902844d-837f-5f63-ced3-9f7fd222f175@2ndquadrant.com
Whole thread Raw
Responses Re: Hash Joins vs. Bloom Filters / take 2
Re: Hash Joins vs. Bloom Filters / take 2
Re: Hash Joins vs. Bloom Filters / take 2
List pgsql-hackers
Hi,

In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].

So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.

The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.

Consider a join on tables with FK on the join condition. That is,
something like this:

    CREATE TABLE dim (id serial primary key, ...);
    CREATE TABLE fact (dim_id int references dim(id), ...);

Then if you do a join

    SELECT * FROM fact JOIN dim (id = dim_id);

the bloom filter can't really help, because all the dim_id values are
guaranteed to be in the hash table. So it can only really help with
queries like this:

    SELECT * FROM fact JOIN dim (id = dim_id) WHERE (dim.x = 1000);

where the additional conditions on dim make the hash table incomplete in
some sense (i.e. the bloom will return false, saving quite a bit of
expensive work - lookup in large hash table or spilling tuple to disk).

This is not the same as "false positive rate" which is a feature of the
bloom filter itself, and can be measured by simply counting bits set in
the filter. That is important too, of course, but simple to determine.

The bloom filter "selectivity" is more a feature of the join, i.e. what
fraction of the values looked up in the bloom filter are expected to be
rejected. If many are rejected, the bloom filter helps. If few, the
bloom filter is a waste of time.

After thinking about this a bit more, I think this is pretty much what
we do during join cardinality estimation - we estimate how many of the
rows in "fact" have a matching row in "dim". I do think we can reuse
that to decide if to use a bloom filter or not.

Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.

The attached patch:

1) Only works in non-parallel case for now. Fixing this should not be a
big deal, once the costing/planning gets addressed.


2) Adds details about the bloom filter to EXPLAIN ANALYZE output, with
various important metrics (size of the filter, number of hash functions,
lookups/matches, fill factor).


3) Removes the HLL counter. I came to the conclusion that using HLL to
size the bloom filter makes very little sense, because there are about
three cases:

a) no batching (hash table fits into work_mem)

We can easily walk the hash table and compute "good enough" ndistinct
estimate by counting occupied buckets (or just using number of tuples in
the hash table, assuming the values are distinct).

At this point, the patch does not build the bloom filter in single-batch
cases, unless forced to do that by setting force_hashjoin_bloom=true.
More about this later.


b) batching from the beginning

The HLL is useless, because we need to initialize the bloom filter
before actually seeing any values. Instead, the patch simply uses the
expected number of rows (assuming those are distinct).


c) starting in single-batch mode, switching to batches later

We could use HLL to estimate number of distinct values in the initial
batch (when starting to batch), but it's unclear how is that useful.
This case means our estimates are off, and we don't really know how many
batches will be there. We could use some arbitrary multiple, I guess.

What I decided to do instead in the patch is sizing the bloom filter for
1/8 of work_mem. That seems like a viable compromise, as it's unlikely
to increase the number of batches.


4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:

a) join is selective (i.e. some rows don't have match in hash table)

b) hash table > CPU cache

c) bloom filter < CPU cache

We don't have a good way to determine size of the CPU cache, but I think
we can use some crude arbitrary limits. E.g. require that hash hash
table is larger than 8MB and bloom filter is smaller than 4MB, or
something like that.

Opinions?


regards


[1] https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com

[2]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

[3]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Transactions involving multiple postgres foreign servers
Next
From: Tomas Vondra
Date:
Subject: Re: ALTER TABLE ADD COLUMN fast default