Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers
From | Jim Finnerty |
---|---|
Subject | Re: Hash Joins vs. Bloom Filters / take 2 |
Date | |
Msg-id | 1541433842302-0.post@n3.nabble.com Whole thread Raw |
In response to | Re: Hash Joins vs. Bloom Filters / take 2 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Hash Joins vs. Bloom Filters / take 2
|
List | pgsql-hackers |
On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro <[hidden email]> wrote: > Would you compute the hash for the outer tuples in the scan, and then > again in the Hash Join when probing, or would you want to (somehow) > attach the hash to emitted tuples for later reuse by the higher node? I'm still considering your question (it sounds reasonable, but will add complexity), but I want to mention a couple of things: One benefit of a pushing a filter to the Scan operation is the ability to apply its selectivity before certain other expensive operations. Some of these benefits are storage-engine specific, so for example a column store can greatly reduce row materialization costs by applying the runtime join filter before row materialization. A row store that supports batch mode operators has different benefits of pushing down the filter than a row store that does not support batching. So we should probably think about pushing runtime filters down to the Scan operator in the broader context of an API to a pluggable storage engine [1] that may accept one or more runtime filters. I have a lot of catching-up to do on the pluggable storage thread before I can comment on that, but maybe others can offer their opinion. The runtime join filter might be represented in different ways depending on the data type and distinct cardinality of the join key(s) of the inner relation. As Thomas Munroe hinted at, there are optimizations for the integer key case, and in particular if the range of the integer key is sufficiently small, then you can avoid hashing entirely and just do the membership test in a bit array using the key itself. In that case the membership test would be exact, and you don't need the hash to be computed or passed up to the join operator. re: "What really seems finicky to me about this whole project is the costing. In the best case it's a a huge win; in the worst case it's a significant loss" An example of when you'd get pure overhead would be a hash join from the FK side to the PK side of a RI constraint, with no local predicate on the PK side. In that case, every row would pass (assuming RI constraints were properly enforced), and so the filter wouldn't reject any rows from the FK side. For this specific case we could anticipate that creating a runtime join filter wouldn't be helpful, but a robust way to handle this problem in general is to collect runtime statistics and have the operator shut itself off if it's not filtering enough rows to justify its overhead. For example, maintain a count of rows processed and rows filtered, and when the number of rows processed is above some minimum threshold and the selectivity is higher than some threshold, then disable it either permanently or temporarily. Accurate estimation of the benefits of applying the join selectivity during Scan is going to be storage-engine dependent, but as long as the operation can turn itself off if it's not filtering enough rows to justify the per-row overhead on the outer, the risk of the approach can be made very small. /Jim [1] https://www.postgresql-archive.org/Pluggable-storage-tc5916322.html ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
pgsql-hackers by date: