Re: Bloom filter Pushdown Optimization for Merge Join - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: Bloom filter Pushdown Optimization for Merge Join
Date
Msg-id CALNJ-vRHUtztCGpjdvas9KyOndoOqwH1ZM8K9GoEGKJAHaW8fQ@mail.gmail.com
Whole thread Raw
In response to Re: Bloom filter Pushdown Optimization for Merge Join  (Zhihong Yu <zyu@yugabyte.com>)
Responses Re: Bloom filter Pushdown Optimization for Merge Join
List pgsql-hackers


On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:


On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <zyu@yugabyte.com> wrote:


On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <zyu@yugabyte.com> wrote:


On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengli10@gmail.com> wrote:
Hello,

A bloom filter provides early filtering of rows that cannot be joined
before they would reach the join operator, the optimization is also
called a semi join filter (SJF) pushdown. Such a filter can be created
when one child of the join operator must materialize its derived table
before the other child is evaluated.

For example, a bloom filter can be created using the the join keys for
the build side/inner side of a hash join or the outer side of a merge
join, the bloom filter can then be used to pre-filter rows on the
other side of the join operator during the scan of the base relation.
The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
discussion on using such optimization for hash join without going into
the pushdown of the filter where its performance gain could be further
increased.

We worked on prototyping bloom filter pushdown for both hash join and
merge join. Attached is a patch set for bloom filter pushdown for
merge join. We also plan to send the patch for hash join once we have
it rebased.

Here is a summary of the patch set:
1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
during the table scan instead of later on.
        -The bloom filter is pushed down along the execution tree to
the target SeqScan nodes.
        -Experiments show that this optimization can speed up Merge
Join by up to 36%.

2. The planner makes the decision to use the bloom filter based on the
estimated filtering rate and the expected performance gain.
        -The planner accomplishes this by estimating four numbers per
variable - the total number of rows of the relation, the number of
distinct values for a given variable, and the minimum and maximum
value of the variable (when applicable). Using these numbers, the
planner estimates a filtering rate of a potential filter.
        -Because actually creating and implementing the filter adds
more operations, there is a minimum threshold of filtering where the
filter would actually be useful. Based on testing, we query to see if
the estimated filtering rate is higher than 35%, and that informs our
decision to use a filter or not.

3. If using a bloom filter, the planner also adjusts the expected cost
of Merge Join based on expected performance gain.

4. Capability to build the bloom filter in parallel in case of
parallel SeqScan. This is done efficiently by populating a local bloom
filter for each parallel worker and then taking a bitwise OR over all
the local bloom filters to form a shared bloom filter at the end of
the parallel SeqScan.

5. The optimization is GUC controlled, with settings of
enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.

We found in experiments that there is a significant improvement
when using the bloom filter during Merge Join. One experiment involved
joining two large tables while varying the theoretical filtering rate
(TFR) between the two tables, the TFR is defined as the percentage
that the two datasets are disjoint. Both tables in the merge join were
the same size. We tested changing the TFR to see the change in
filtering optimization.

For example, let’s imagine t0 has 10 million rows, which contain the
numbers 1 through 10 million randomly shuffled. Also, t1 has the
numbers 4 million through 14 million randomly shuffled. Then the TFR
for a join of these two tables is 40%, since 40% of the tables are
disjoint from the other table (1 through 4 million for t0, 10 million
through 14 million for t4).

Here is the performance test result joining two tables:
TFR: theoretical filtering rate
EFR: estimated filtering rate
AFR: actual filtering rate
HJ: hash join
MJ Default: default merge join
MJ Filter: merge join with bloom filter optimization enabled
MJ Filter Forced: merge join with bloom filter optimization forced

TFR   EFR   AFR   HJ   MJ Default   MJ Filter   MJ Filter Forced
-------------------------------------------------------------------------------------
10     33.46   7.41    6529   22638        21949            23160
20     37.27  14.85   6483   22290        21928            21930
30     41.32   22.25  6395   22374        20718            20794
40     45.67   29.7    6272   21969        19449            19410
50     50.41   37.1    6210   21412        18222            18224
60     55.64   44.51  6052   21108        17060            17018
70     61.59   51.98  5947   21020        15682            15737
80     68.64   59.36  5761   20812        14411            14437
90     77.83   66.86  5701   20585        13171            13200
Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
Tables of 10M Rows.

Attached you can find figures of the same performance test and a SQL script
to reproduce the performance test.

The first thing to notice is that Hash Join generally is the most
efficient join strategy. This is because Hash Join is better at
dealing with small tables, and our size of 10 million is still small
enough where Hash Join outperforms the other join strategies. Future
experiments can investigate using much larger tables.

However, comparing just within the different Merge Join variants, we
see that using the bloom filter greatly improves performance.
Intuitively, all of these execution times follow linear paths.
Comparing forced filtering versus default, we can see that the default
Merge Join outperforms Merge Join with filtering at low filter rates,
but after about 20% TFR, the Merge Join with filtering outperforms
default Merge Join. This makes intuitive sense, as there are some
fixed costs associated with building and checking with the bloom
filter. In the worst case, at only 10% TFR, the bloom filter makes
Merge Join less than 5% slower. However, in the best case, at 90% TFR,
the bloom filter improves Merge Join by 36%.

Based on the results of the above experiments, we came up with a
linear equation for the performance ratio for using the filter
pushdown from the actual filtering rate. Based on the numbers
presented in the figure, this is the equation:

T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863)

For example, this means that with an estimated filtering rate of 0.4,
the execution time of merge join is estimated to be improved by 16.3%.
Note that the estimated filtering rate is used in the equation, not
the theoretical filtering rate or the actual filtering rate because it
is what we have during planning. In practice the estimated filtering
rate isn’t usually accurate. In fact, the estimated filtering rate can
differ from the theoretical filtering rate by as much as 17% in our
experiments. One way to mitigate the power loss of bloom filter caused
by inaccurate estimated filtering rate is to adaptively turn it off at
execution time, this is yet to be implemented.

Here is a list of tasks we plan to work on in order to improve this patch:
1. More regression testing to guarantee correctness.
2. More performance testing involving larger tables and complicated query plans.
3. Improve the cost model.
4. Explore runtime tuning such as making the bloom filter checking adaptive.
5. Currently, only the best single join key is used for building the
Bloom filter. However, if there are several keys and we know that
their distributions are somewhat disjoint, we could leverage this fact
and use multiple keys for the bloom filter.
6. Currently, Bloom filter pushdown is only implemented for SeqScan
nodes. However, it would be possible to allow push down to other types
of scan nodes.
7. Explore if the Bloom filter could be pushed down through a foreign
scan when the foreign server is capable of handling it – which could
be made true for postgres_fdw.
8. Better explain command on the usage of bloom filters.

This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback
is appreciated.

With Regards,
Zheng Li
Amazon RDS/Aurora for PostgreSQL

[1] https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com

Hi,
In the header of patch 1:

In this prototype, the cost model is based on an assumption that there is a linear relationship between the performance gain from using a semijoin filter and the estimated filtering rate:
% improvement to Merge Join cost = 0.83 * estimated filtering rate - 0.137.

How were the coefficients (0.83 and 0.137) determined ?
I guess they were based on the results of running certain workload.

Cheers
Hi,
For patch 1:

+bool       enable_mergejoin_semijoin_filter;
+bool       force_mergejoin_semijoin_filter; 

How would (enable_mergejoin_semijoin_filter = off, force_mergejoin_semijoin_filter = on) be interpreted ?
Have you considered using one GUC which has three values: off, enabled, forced ?

+       mergeclauses_for_sjf = get_actual_clauses(path->path_mergeclauses);
+       mergeclauses_for_sjf = get_switched_clauses(path->path_mergeclauses,
+                                                   path->jpath.outerjoinpath->parent->relids);

mergeclauses_for_sjf is assigned twice and I don't see mergeclauses_for_sjf being reference in the call to get_switched_clauses().
Is this intentional ?

+           /* want at least 1000 rows_filtered to avoid any nasty edge cases */
+           if (force_mergejoin_semijoin_filter || (filteringRate >= 0.35 && rows_filtered > 1000))

The above condition is narrower compared to the enclosing condition.
Since there is no else block for the second if block, please merge the two if statements.

+   int         best_filter_clause;

Normally I would think `clause` is represented by List*. But best_filter_clause is an int. Please use another variable name so that there is less chance of confusion.

For evaluate_semijoin_filtering_rate():

+   double      best_sj_selectivity = 1.01;

How was 1.01 determined ?

+   debug_sj1("SJPD:  start evaluate_semijoin_filtering_rate");

There are debug statements in the methods.
It would be better to remove them in the next patch set.

Cheers
Hi,
Still patch 1.

+       if (!outer_arg_md->is_or_maps_to_base_column
+           && !inner_arg_md->is_or_maps_to_constant)
+       {
+           debug_sj2("SJPD:        outer equijoin arg does not map %s",
+                     "to a base column nor a constant; semijoin is not valid");

Looks like there is a typo:  inner_arg_md->is_or_maps_to_constant should be outer_arg_md->is_or_maps_to_constant

+       if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH)
+       {
+           debug_sj2("SJPD:        outer equijoin column's width %s",
+                     "was excessive; condition rejected");

How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ?

For verify_valid_pushdown():

+   Assert(path);
+   Assert(target_var_no > 0);
+
+   if (path == NULL)
+   {
+       return false;

I don't understand the first assertion. Does it mean path would always be non-NULL ? Then the if statement should be dropped.

+               if (path->parent->relid == target_var_no)
+               {
+                   /*
+                    * Found source of target var! We know that the pushdown
+                    * is valid now.
+                    */
+                   return true;
+               }
+               return false;

The above can be simplified as: return path->parent->relid == target_var_no;

+ *     True if the given con_exprs, ref_exprs and operators will exactlty

Typo: exactlty -> exactly

+   if (!bms_equal(all_vars, matched_vars))
+       return false;
+   return true;

The above can be simplified as: return bms_equal(all_vars, matched_vars);

Cheers
Hi,
Still in patch 1 :-)

+   if (best_path->use_semijoinfilter)
+   {
+       if (best_path->best_mergeclause != -1) 

Since there is no else block, the two conditions can be combined.

+           ListCell   *clause_cell = list_nth_cell(mergeclauses, best_path->best_mergeclause);

As shown in the above code, best_mergeclause is the position of the best merge clause in mergeclauses.
I think best_mergeclause_pos (or similar name) is more appropriate for the fieldname.

For depth_of_semijoin_target():

+ *  Parameters:
+ *  node: plan node to be considered for semijoin push down.

The name of the parameter is pn - please align the comment with code.

For T_SubqueryScan case in depth_of_semijoin_target():

+               Assert(rte->subquery->targetList);
...
+               if (rel && rel->subroot
+                   && rte && rte->subquery && rte->subquery->targetList)

It seems the condition can be simplified since rte->subquery->targetList has passed the assertion.

For is_table_scan_node_source_of_relids_or_var(), the else block can be simplified to returning scan_node_varno == target_var->varno directly.

For get_appendrel_occluded_references():

+ *  Given a virtual column from an Union ALL subquery,
+ *  return the expression it immediately occludes that satisfy

Since the index is returned from the func, it would be better to clarify the comment by saying `return the last index of expression ...`

+   /* Subquery without append and partitioned tables */

append and partitioned tables -> append or partitioned tables

More reviews for subsequent patches to follow.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: disfavoring unparameterized nested loops
Next
From: Zhihong Yu
Date:
Subject: Re: Bloom filter Pushdown Optimization for Merge Join