Re: Bloom filter Pushdown Optimization for Merge Join - Mailing list pgsql-hackers
From | Zhihong Yu |
---|---|
Subject | Re: Bloom filter Pushdown Optimization for Merge Join |
Date | |
Msg-id | CALNJ-vQ-t-2t0=P6P=0PgfJmN+p1U7jp8VMyXfuA_iieRzrpGQ@mail.gmail.com Whole thread Raw |
In response to | Re: Bloom filter Pushdown Optimization for Merge Join (Zhihong Yu <zyu@yugabyte.com>) |
List | pgsql-hackers |
On Sun, Oct 2, 2022 at 6:40 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <zyu@yugabyte.com> wrote:On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <zyu@yugabyte.com> wrote:On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengli10@gmail.com> wrote:Hello,
A bloom filter provides early filtering of rows that cannot be joined
before they would reach the join operator, the optimization is also
called a semi join filter (SJF) pushdown. Such a filter can be created
when one child of the join operator must materialize its derived table
before the other child is evaluated.
For example, a bloom filter can be created using the the join keys for
the build side/inner side of a hash join or the outer side of a merge
join, the bloom filter can then be used to pre-filter rows on the
other side of the join operator during the scan of the base relation.
The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
discussion on using such optimization for hash join without going into
the pushdown of the filter where its performance gain could be further
increased.
We worked on prototyping bloom filter pushdown for both hash join and
merge join. Attached is a patch set for bloom filter pushdown for
merge join. We also plan to send the patch for hash join once we have
it rebased.
Here is a summary of the patch set:
1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
during the table scan instead of later on.
-The bloom filter is pushed down along the execution tree to
the target SeqScan nodes.
-Experiments show that this optimization can speed up Merge
Join by up to 36%.
2. The planner makes the decision to use the bloom filter based on the
estimated filtering rate and the expected performance gain.
-The planner accomplishes this by estimating four numbers per
variable - the total number of rows of the relation, the number of
distinct values for a given variable, and the minimum and maximum
value of the variable (when applicable). Using these numbers, the
planner estimates a filtering rate of a potential filter.
-Because actually creating and implementing the filter adds
more operations, there is a minimum threshold of filtering where the
filter would actually be useful. Based on testing, we query to see if
the estimated filtering rate is higher than 35%, and that informs our
decision to use a filter or not.
3. If using a bloom filter, the planner also adjusts the expected cost
of Merge Join based on expected performance gain.
4. Capability to build the bloom filter in parallel in case of
parallel SeqScan. This is done efficiently by populating a local bloom
filter for each parallel worker and then taking a bitwise OR over all
the local bloom filters to form a shared bloom filter at the end of
the parallel SeqScan.
5. The optimization is GUC controlled, with settings of
enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
We found in experiments that there is a significant improvement
when using the bloom filter during Merge Join. One experiment involved
joining two large tables while varying the theoretical filtering rate
(TFR) between the two tables, the TFR is defined as the percentage
that the two datasets are disjoint. Both tables in the merge join were
the same size. We tested changing the TFR to see the change in
filtering optimization.
For example, let’s imagine t0 has 10 million rows, which contain the
numbers 1 through 10 million randomly shuffled. Also, t1 has the
numbers 4 million through 14 million randomly shuffled. Then the TFR
for a join of these two tables is 40%, since 40% of the tables are
disjoint from the other table (1 through 4 million for t0, 10 million
through 14 million for t4).
Here is the performance test result joining two tables:
TFR: theoretical filtering rate
EFR: estimated filtering rate
AFR: actual filtering rate
HJ: hash join
MJ Default: default merge join
MJ Filter: merge join with bloom filter optimization enabled
MJ Filter Forced: merge join with bloom filter optimization forced
TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced
-------------------------------------------------------------------------------------
10 33.46 7.41 6529 22638 21949 23160
20 37.27 14.85 6483 22290 21928 21930
30 41.32 22.25 6395 22374 20718 20794
40 45.67 29.7 6272 21969 19449 19410
50 50.41 37.1 6210 21412 18222 18224
60 55.64 44.51 6052 21108 17060 17018
70 61.59 51.98 5947 21020 15682 15737
80 68.64 59.36 5761 20812 14411 14437
90 77.83 66.86 5701 20585 13171 13200
Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
Tables of 10M Rows.
Attached you can find figures of the same performance test and a SQL script
to reproduce the performance test.
The first thing to notice is that Hash Join generally is the most
efficient join strategy. This is because Hash Join is better at
dealing with small tables, and our size of 10 million is still small
enough where Hash Join outperforms the other join strategies. Future
experiments can investigate using much larger tables.
However, comparing just within the different Merge Join variants, we
see that using the bloom filter greatly improves performance.
Intuitively, all of these execution times follow linear paths.
Comparing forced filtering versus default, we can see that the default
Merge Join outperforms Merge Join with filtering at low filter rates,
but after about 20% TFR, the Merge Join with filtering outperforms
default Merge Join. This makes intuitive sense, as there are some
fixed costs associated with building and checking with the bloom
filter. In the worst case, at only 10% TFR, the bloom filter makes
Merge Join less than 5% slower. However, in the best case, at 90% TFR,
the bloom filter improves Merge Join by 36%.
Based on the results of the above experiments, we came up with a
linear equation for the performance ratio for using the filter
pushdown from the actual filtering rate. Based on the numbers
presented in the figure, this is the equation:
T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863)
For example, this means that with an estimated filtering rate of 0.4,
the execution time of merge join is estimated to be improved by 16.3%.
Note that the estimated filtering rate is used in the equation, not
the theoretical filtering rate or the actual filtering rate because it
is what we have during planning. In practice the estimated filtering
rate isn’t usually accurate. In fact, the estimated filtering rate can
differ from the theoretical filtering rate by as much as 17% in our
experiments. One way to mitigate the power loss of bloom filter caused
by inaccurate estimated filtering rate is to adaptively turn it off at
execution time, this is yet to be implemented.
Here is a list of tasks we plan to work on in order to improve this patch:
1. More regression testing to guarantee correctness.
2. More performance testing involving larger tables and complicated query plans.
3. Improve the cost model.
4. Explore runtime tuning such as making the bloom filter checking adaptive.
5. Currently, only the best single join key is used for building the
Bloom filter. However, if there are several keys and we know that
their distributions are somewhat disjoint, we could leverage this fact
and use multiple keys for the bloom filter.
6. Currently, Bloom filter pushdown is only implemented for SeqScan
nodes. However, it would be possible to allow push down to other types
of scan nodes.
7. Explore if the Bloom filter could be pushed down through a foreign
scan when the foreign server is capable of handling it – which could
be made true for postgres_fdw.
8. Better explain command on the usage of bloom filters.
This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback
is appreciated.
With Regards,
Zheng Li
Amazon RDS/Aurora for PostgreSQL
[1] https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.comHi,In the header of patch 1:In this prototype, the cost model is based on an assumption that there is a linear relationship between the performance gain from using a semijoin filter and the estimated filtering rate:% improvement to Merge Join cost = 0.83 * estimated filtering rate - 0.137.How were the coefficients (0.83 and 0.137) determined ?I guess they were based on the results of running certain workload.CheersHi,For patch 1:+bool enable_mergejoin_semijoin_filter;+bool force_mergejoin_semijoin_filter;How would (enable_mergejoin_semijoin_filter = off, force_mergejoin_semijoin_filter = on) be interpreted ?Have you considered using one GUC which has three values: off, enabled, forced ?+ mergeclauses_for_sjf = get_actual_clauses(path->path_mergeclauses);
+ mergeclauses_for_sjf = get_switched_clauses(path->path_mergeclauses,
+ path->jpath.outerjoinpath->parent->relids);mergeclauses_for_sjf is assigned twice and I don't see mergeclauses_for_sjf being reference in the call to get_switched_clauses().Is this intentional ?+ /* want at least 1000 rows_filtered to avoid any nasty edge cases */
+ if (force_mergejoin_semijoin_filter || (filteringRate >= 0.35 && rows_filtered > 1000))The above condition is narrower compared to the enclosing condition.Since there is no else block for the second if block, please merge the two if statements.+ int best_filter_clause;Normally I would think `clause` is represented by List*. But best_filter_clause is an int. Please use another variable name so that there is less chance of confusion.For evaluate_semijoin_filtering_rate():+ double best_sj_selectivity = 1.01;How was 1.01 determined ?+ debug_sj1("SJPD: start evaluate_semijoin_filtering_rate");There are debug statements in the methods.It would be better to remove them in the next patch set.CheersHi,Still patch 1.+ if (!outer_arg_md->is_or_maps_to_base_column
+ && !inner_arg_md->is_or_maps_to_constant)
+ {
+ debug_sj2("SJPD: outer equijoin arg does not map %s",+ "to a base column nor a constant; semijoin is not valid");Looks like there is a typo: inner_arg_md->is_or_maps_to_constant should be outer_arg_md->is_or_maps_to_constant+ if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH)
+ {
+ debug_sj2("SJPD: outer equijoin column's width %s",
+ "was excessive; condition rejected");How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ?For verify_valid_pushdown():+ Assert(path);
+ Assert(target_var_no > 0);
+
+ if (path == NULL)
+ {
+ return false;I don't understand the first assertion. Does it mean path would always be non-NULL ? Then the if statement should be dropped.+ if (path->parent->relid == target_var_no)
+ {
+ /*
+ * Found source of target var! We know that the pushdown
+ * is valid now.
+ */
+ return true;
+ }
+ return false;The above can be simplified as: return path->parent->relid == target_var_no;+ * True if the given con_exprs, ref_exprs and operators will exactltyTypo: exactlty -> exactly+ if (!bms_equal(all_vars, matched_vars))
+ return false;
+ return true;The above can be simplified as: return bms_equal(all_vars, matched_vars);CheersHi,Still in patch 1 :-)+ if (best_path->use_semijoinfilter)
+ {+ if (best_path->best_mergeclause != -1)Since there is no else block, the two conditions can be combined.+ ListCell *clause_cell = list_nth_cell(mergeclauses, best_path->best_mergeclause);As shown in the above code, best_mergeclause is the position of the best merge clause in mergeclauses.I think best_mergeclause_pos (or similar name) is more appropriate for the fieldname.For depth_of_semijoin_target():+ * Parameters:
+ * node: plan node to be considered for semijoin push down.The name of the parameter is pn - please align the comment with code.For T_SubqueryScan case in depth_of_semijoin_target():+ Assert(rte->subquery->targetList);...+ if (rel && rel->subroot
+ && rte && rte->subquery && rte->subquery->targetList)It seems the condition can be simplified since rte->subquery->targetList has passed the assertion.For is_table_scan_node_source_of_relids_or_var(), the else block can be simplified to returning scan_node_varno == target_var->varno directly.For get_appendrel_occluded_references():+ * Given a virtual column from an Union ALL subquery,
+ * return the expression it immediately occludes that satisfySince the index is returned from the func, it would be better to clarify the comment by saying `return the last index of expression ...`+ /* Subquery without append and partitioned tables */append and partitioned tables -> append or partitioned tablesMore reviews for subsequent patches to follow.
Hi,
For 0002-Support-semijoin-filter-in-the-executor-for-non-para.patch ,
+ !((SeqScanState *) node)->applySemiJoinFilter)
I am confused by the last two clauses in the condition. If !IsA(node, SeqScanState) is true, why does the last clause cast node to SeqScanState * ?
I think you forgot to put the last two clauses in a pair of parentheses.
+ /* slot did not pass SemiJoinFilter, so skipping it. */
skipping it -> skip it
+ /* double row estimate to reduce error rate for Bloom filter */
+ *nodeRows = Max(*nodeRows, scan->ss.ps.plan->plan_rows * 2);
+ *nodeRows = Max(*nodeRows, scan->ss.ps.plan->plan_rows * 2);
Probably add more comment above about why the row count is doubled and how the error rate is reduced.
+SemiJoinFilterExamineSlot(List *semiJoinFilters, TupleTableSlot *slot, Oid tableId)
SemiJoinFilterExamineSlot -> ExamineSlotUsingSemiJoinFilter
Cheers
pgsql-hackers by date: