Home > mailing lists

Re: WIP: bloom filter in Hash Joins with batches - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: WIP: bloom filter in Hash Joins with batches
Date	January 10, 2016 21:11:55
Msg-id	5692C90E.9000300@2ndquadrant.com Whole thread Raw
In response to	Re: WIP: bloom filter in Hash Joins with batches (Peter Geoghegan <pg@heroku.com>)
List	pgsql-hackers

Tree view

Hi,

On 01/10/2016 05:11 AM, Peter Geoghegan wrote:
> On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Which means the "dim.r" column has 100 different values (0-99) with uniform
>> distribution. So e.g. "WHERE r < 15" matches 15%.
>
> I think that the use of a uniform distribution to demonstrate this
> patch is a bad idea, unless you want to have a conversation about the
> worst case.
>
> Look at the use cases for bloom filters in general. They're almost
> all some variation on the same theme: checking a bloom filter
> inexpensively, to avoid an expensive cache miss (of some fashion).

Right.

> The Google Chrome web browser uses a bloom filter for malicious URLs.

FWIW I don't think Chrome is using bloom filter for this purpose 
anymore. Chromium certainly does not (it's using PrefixSet instead).

> It usually avoids consulting Google's servers about whether or not
> any given URL that is visited is malicious by using the bloom filter.
> This is effective in large part because the top 100 websites ranked
> by popularity have a huge falloff in popularity as you go down the
> list. It looks like a Zipfian distribution, and so I imagine they get
> pretty far with a very small bloom filter, even though in theory the
> chances of any given valid URL resulting in a cache miss is very
> high.

I'm not familiar with how Chrome used bloom filters, but I'd expect them 
to be very careful about false positives (and also false negatives, as 
the bloom filter can't contain all malicious URLs).

My assumptions is that they've been able to make that work because they 
do have detailed stats about how frequently people visit those URLs, and 
use that when building the bloom filter. But we don't have such 
information in hashjoin.

>
> Generally, uniform distributions are rare in a non-canonical list of
> things, like a hash join outer relation's attribute.

Well, I'm not claiming testing uniform distribution is enough, but 
surely it's one of the cases we should handle just fine.

The problem with non-uniform cases is that it really depends on the 
outer side of the join.

For example let's say the hash table contains 1000 values, and the bloom 
filter has 1% false positive rate. But let's assume that the outer side 
has a value that triggers the false positive rate, and that it's 
actually 99% of the outer table. Suddenly, you have 99% false positive 
rate, rendering the bloom filter pointless.

I don't think this is fixable while creating the bloom filter. All we 
can do is watch the bloom filter lookups and disable the bloom filter 
once the false positive rate reaches some threshold.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Simon Riggs
Date: 10 January 2016, 20:50:49
Subject: Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

From: David Rowley
Date: 10 January 2016, 21:38:32
Subject: Re: WIP: bloom filter in Hash Joins with batches

Re: WIP: bloom filter in Hash Joins with batches - Mailing list pgsql-hackers

Previous

Next