Re: Bloom filter Pushdown Optimization for Merge Join - Mailing list pgsql-hackers

From Zhihong Yu
Subject Re: Bloom filter Pushdown Optimization for Merge Join
Date
Msg-id CALNJ-vS-6ckvKdBd0NP4aG01F-zRxVCuJfWOWEkWgxKni_DRUA@mail.gmail.com
Whole thread Raw
In response to Re: Bloom filter Pushdown Optimization for Merge Join  (Zhihong Yu <zyu@yugabyte.com>)
Responses Re: Bloom filter Pushdown Optimization for Merge Join
List pgsql-hackers


On Wed, Oct 12, 2022 at 4:35 PM Zhihong Yu <zyu@yugabyte.com> wrote:


On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan <lyu.steve.pan@gmail.com> wrote:
Hello Zhihong Yu & Tomas Vondra,

Thank you so much for your review and feedback!

We made some updates based on previous feedback and attached the new
patch set. Due to time constraints, we didn't get to resolve all the
comments, and we'll continue to improve this patch.

> In this prototype, the cost model is based on an assumption that there is
> a linear relationship between the performance gain from using a semijoin
> filter and the estimated filtering rate:
> % improvement to Merge Join cost = 0.83 * estimated filtering rate - 0.137.
>
> How were the coefficients (0.83 and 0.137) determined ?
> I guess they were based on the results of running certain workload.

Right, the coefficients (0.83 and 0.137) determined are based on some
preliminary testings. The current costing model is pretty naive and
we'll work on a more robust costing model in future work.


> I agree, in principle, although I think the current logic / formula is a
> bit too crude and fitted to the simple data used in the test. I think
> this needs to be formulated as a regular costing issue, considering
> stuff like cost of the hash functions, and so on.
>
> I think this needs to do two things:
>
> 1) estimate the cost of building the bloom filter - This shall depend on
> the number of rows in the inner relation, number/cost of the hash
> functions (which may be higher for some data types), etc.
>
> 2) estimate improvement for the probing branch - Essentially, we need to
> estimate how much we save by filtering some of the rows, but this also
> neeeds to include the cost of probing the bloom filter.
>
> This will probably require some improvements to the lib/bloomfilter, in
> order to estimate the false positive rate - this may matter a lot for
> large data sets and small work_mem values. The bloomfilter library
> simply reduces the size of the bloom filter, which increases the false
> positive rate. At some point it'll start reducing the benefit.
>

These suggestions make a lot of sense. The current costing model is
definitely not good enough, and we plan to work on a more robust
costing model as we continue to improve the patch.


> OK. Could also build the bloom filter in shared memory?
>

We thought about this approach but didn't prefer this one because if
all worker processes share the same bloom filter in shared memory, we
need to frequently lock and unlock the bloom filter to avoid race
conditions. So we decided to have each worker process create its own
bloom filter.


> IMHO we shouldn't make too many conclusions from these examples. Yes, it
> shows merge join can be improved, but for cases where a hashjoin works
> better so we wouldn't use merge join anyway.
>
> I think we should try constructing examples where either merge join wins
> already (and gets further improved by the bloom filter), or would lose
> to hash join and the bloom filter improves it enough to win.
>
> AFAICS that requires a join of two large tables - large enough that hash
> join would need to be batched, or pre-sorted inputs (which eliminates
> the explicit Sort, which is the main cost in most cases).
>
> The current patch only works with sequential scans, which eliminates the
> second (pre-sorted) option. So let's try the first one - can we invent
> an example with a join of two large tables where a merge join would win?
>
> Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>

Agreed. The current examples are only intended to show us that using
bloom filters in merge join could improve the merge join performance
in some cases. We are working on testing more examples that merge join
with bloom filter could out-perform hash join, which should be more
persuasive.


> The bloom filter is built by the first seqscan (on t0), and then used by
> the second seqscan (on t1). But this only works because we always run
> the t0 scan to completion (because we're feeding it into Sort) before we
> start scanning t1.
>
> But when the scan on t1 switches to an index scan, it's over - we'd be
> building the filter without being able to probe it, and when we finish
> building it we no longer need it. So this seems pretty futile.
>
> It might still improve plans like
>
>    ->  Merge Join
>          Merge Cond: (t0.c1 = t1.c1)
>          SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
>          SemiJoin Estimated Filtering Rate: 1.0000
>         ->  Sort
>                Sort Key: t0.c1
>                ->  Seq Scan on t0
>          ->  Index Scan on t1
>
> But I don't know how common/likely that actually is. I'd expect to have
> an index on both sides, but perhaps I'm wrong.
>
> This is why hashjoin seems like a more natural fit for the bloom filter,
> BTW, because there we have a guarantee the inner relation is processed
> first (so we know the bloom filter is fine and can be probed).
>

Great observation. The bloom filter only works if the first SeqScan
always runs to completion before the second SeqScan starts.
I guess one possible way to avoid futile bloom filter might be
enforcing that the bloom filter only be used if both the outer/inner
plans of the MergeJoin are Sort nodes, to guarantee the bloom filter
is ready to use after processing one side of the join, but this may be
too restrictive.


> I don't know what improvements you have in mind exactly, but I think
> it'd be good to show which node is building/using a bloom filter, and
> then also some basic stats (size, number of hash functions, FPR, number
> of probes, ...). This may require improvements to lib/bloomfilter, which
> currently does not expose some of the details.
>

Along with the new patch set, we have added information to display
which node is building/using a bloom filter (as well as the
corresponding expressions), and some bloom filter basic stats. We'll
add more related information (e.g. FPR) as we modify lib/bloomfilter
implementation in future work.


Thanks again for the great reviews and they are really useful! More
feedback is always welcome and appreciated!

Regards,
Lyu Pan
Amazon RDS/Aurora for PostgreSQL

Hi,
For v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch :

+
+       /* want at least 1000 rows_filtered to avoid any nasty edge cases */
+       if (force_mergejoin_semijoin_filter ||
+           (filtering_rate >= 0.35 && rows_filtered > 1000 && best_filter_clause >= 0))

Currently rows_filtered is compared with a constant, should the constant be made configurable ?

+            * improvement of 19.5%. This equation also concludes thata a 17%

Typo: thata

+   inner_md_array = palloc(sizeof(SemiJoinFilterExprMetadata) * num_md);
+   if (!outer_md_array || !inner_md_array)
+   {
+       return 0;               /* a stack array allocation failed  */

Should the allocated array be freed before returning ?

For verify_valid_pushdown(), the parameters in comment don't match the actual parameters. Please modify the comment.

For is_fk_pk(), since the outer_key_list is fixed across the iterations, I think all_vars can be computed outside of expressions_match_foreign_key().

Cheers

Continuing with v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch 

For get_switched_clauses(), the returned List contains all the clauses. Yet the name suggests that only switched clauses are returned.
Please rename the method to adjust_XX or rearrange_YY so that it is less likely to cause confusion.

For depth_of_semijoin_target(), idx is only used inside the `if (num_exprs == get_appendrel_occluded_references(` block, you can move its declaration inside that if block.

+   outside_subq_rte = root->simple_rte_array[outside_subq_var->varno];
+
+   /* System Vars have varattno < 0, don't bother */
+   if (outside_subq_var->varattno <= 0)
+       return 0;

Since the check on outside_subq_var->varattno doesn't depend on outside_subq_rte, you can move the assignment to outside_subq_rte after the if check.

Cheers

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: PG upgrade 14->15 fails - database contains our own extension
Next
From: Tom Lane
Date:
Subject: Re: PG upgrade 14->15 fails - database contains our own extension