Thread: MergeJoin beats HashJoin in the case of multiple hash clauses

MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andrey Lepikhov

Date:

15 June 2023, 08:30:10

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite 
frequently, I see complaints on unreasonable switch of JOIN algorithm to 
Merge Join instead of Hash Join. Quick research have shown one weak 
place - estimation of an average bucket size in final_cost_hashjoin (see 
q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value across 
all buckets instead of multiplying them (or trying to make multivariate 
analysis).
It works fine for the case of one clause. But if we have many clauses, 
and if each has high value of ndistinct, we will overestimate average 
size of a bucket and, as a result, prefer to use Merge Join. As the 
example in attachment shows, it leads to worse plan than possible, 
sometimes drastically worse.
I assume, this is done with fear of functional dependencies between hash 
clause components. But as for me, here we should go the same way, as 
estimation of groups.
The attached patch shows a sketch of the solution.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Alena Rybakina

Date:

28 June 2023, 13:53:06

Hi!

On 15.06.2023 11:30, Andrey Lepikhov wrote:

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite frequently, I see complaints on unreasonable switch of JOIN algorithm to Merge Join instead of Hash Join. Quick research have shown one weak place - estimation of an average bucket size in final_cost_hashjoin (see q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value across all buckets instead of multiplying them (or trying to make multivariate analysis).
It works fine for the case of one clause. But if we have many clauses, and if each has high value of ndistinct, we will overestimate average size of a bucket and, as a result, prefer to use Merge Join. As the example in attachment shows, it leads to worse plan than possible, sometimes drastically worse.
I assume, this is done with fear of functional dependencies between hash clause components. But as for me, here we should go the same way, as estimation of groups.
The attached patch shows a sketch of the solution.

This problem is very important.

Honestly, I'm still learning your code and looking for cases on which cases your patch can affect for the worse or for the better. But I have already found something that seemed interesting to me. I have found several other interesting cases where your patch can solve some problem in order to choose a more correct plan, but in focus on memory consumption.
To make it easier to evaluate, I added a hook to your patch that makes it easier to switch to your or the original way of estimating the size of baskets (diff_estimate.diff).

Here are other cases where your fix improves the query plan.

First of all, I changed the way creation of tables are created to look at the behavior of the query plan in terms of planning and execution time:

DROP TABLE IF EXISTS a,b CASCADE;
CREATE TABLE a AS
SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
FROM generate_series(1,1e5) AS gs;
CREATE TABLE b AS
SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a,b
WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;

                                QUERY PLAN
---------------------------------------------------------------------------
Hash Join (actual time=200.872..200.879 rows=0 loops=1)
   Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
   -> Seq Scan on b (actual time=0.029..15.946 rows=100000 loops=1)
   -> Hash (actual time=97.645..97.649 rows=100000 loops=1)
         Buckets: 131072 Batches: 1 Memory Usage: 5612kB
         -> Seq Scan on a (actual time=0.024..17.153 rows=100000 loops=1)
Planning Time: 2.910 ms
Execution Time: 201.949 ms
(8 rows)

SET
                                QUERY PLAN
---------------------------------------------------------------------------
Merge Join (actual time=687.415..687.416 rows=0 loops=1)
   Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
   -> Sort (actual time=462.022..536.716 rows=100000 loops=1)
         Sort Key: b.y, b.x, b.z
         Sort Method: external merge Disk: 3328kB
         -> Seq Scan on b (actual time=0.017..12.326 rows=100000 loops=1)
   -> Sort (actual time=111.295..113.196 rows=16001 loops=1)
         Sort Key: a.y, a.x, a.z
         Sort Method: external sort Disk: 2840kB
         -> Seq Scan on a (actual time=0.020..10.129 rows=100000 loops=1)
Planning Time: 0.752 ms
Execution Time: 688.829 ms
(12 rows)

Secondly, I found another case that is not related to the fact that the planner would prefer to choose merge join rather than hash join, but we have the opportunity to see that the plan has become better due to the consumption of less memory, and also takes less planning time.

Here, with the same query, the planning time was reduced by 5 times, and the number of buckets by 128 times, therefore, memory consumption also decreased:

                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Hash Join (cost=20.50..3157.58 rows=8 width=32) (actual time=95.648..95.651 rows=0 loops=1)
   Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z = (a.z)::numeric))
   -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=0.027..17.980 rows=100000 loops=1)
   -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=2.046..2.047 rows=600 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 34kB
         -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=0.022..0.315 rows=600 loops=1)
Planning Time: 0.631 ms
Execution Time: 95.730 ms
(8 rows)

SET
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3387.00..8621.58 rows=8 width=32) (actual time=102.873..102.877 rows=0 loops=1)
   Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND ((a.z)::numeric = b.z))
   -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=0.014..0.131 rows=600 loops=1)
   -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time=101.920..101.921 rows=100000 loops=1)
         Buckets: 131072 Batches: 1 Memory Usage: 6474kB
         -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=0.013..16.349 rows=100000 loops=1)
Planning Time: 0.153 ms
Execution Time: 103.518 ms
(8 rows)

I also give an improvement relative to the left external or right connection:

DROP TABLE IF EXISTS a,b CASCADE;

CREATE TABLE a AS
SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
FROM generate_series(1,600) AS gs;
CREATE TABLE b AS
SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
FROM generate_series(1,1e5) AS gs;
ANALYZE a,b;

SET enable_cost_size = 'on';

EXPLAIN ANALYZE
SELECT * FROM a right join b
on a.x=b.x AND a.y=b.y AND a.z=b.z;

SET enable_cost_size = 'off';
EXPLAIN ANALYZE
SELECT * FROM a right join b
on a.x=b.x AND a.y=b.y AND a.z=b.z;

                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=20.50..3157.58 rows=100000 width=32) (actual time=1.846..102.264 rows=100000 loops=1)
   Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z = (a.z)::numeric))
   -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=0.041..15.328 rows=100000 loops=1)
   -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=1.780..1.781 rows=600 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 34kB
         -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=0.031..0.252 rows=600 loops=1)
Planning Time: 0.492 ms
Execution Time: 107.609 ms
(8 rows)

SET
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=3387.00..8500.08 rows=100000 width=32) (actual time=80.919..101.613 rows=100000 loops=1)
   Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND ((a.z)::numeric = b.z))
   -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=0.017..0.084 rows=600 loops=1)
   -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time=80.122..80.123 rows=100000 loops=1)
         Buckets: 131072 Batches: 1 Memory Usage: 6474kB
         -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=0.015..11.819 rows=100000 loops=1)
Planning Time: 0.194 ms
Execution Time: 104.662 ms
(8 rows)

-- 
Regards,
Alena Rybakina
Postgres Professional

Attachment

diff_estimate.diff

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Bruce Momjian

Date:

07 September 2023, 18:08:56

Does anyone else have an opinion on this patch?  It looks promising.

---------------------------------------------------------------------------

On Wed, Jun 28, 2023 at 04:53:06PM +0300, Alena Rybakina wrote:
> Hi!
> 
> On 15.06.2023 11:30, Andrey Lepikhov wrote:
> 
>     Hi, all.
> 
>     Some of my clients use JOIN's with three - four clauses. Quite frequently,
>     I see complaints on unreasonable switch of JOIN algorithm to Merge Join
>     instead of Hash Join. Quick research have shown one weak place - estimation
>     of an average bucket size in final_cost_hashjoin (see q2.sql in attachment)
>     with very conservative strategy.
>     Unlike estimation of groups, here we use smallest ndistinct value across
>     all buckets instead of multiplying them (or trying to make multivariate
>     analysis).
>     It works fine for the case of one clause. But if we have many clauses, and
>     if each has high value of ndistinct, we will overestimate average size of a
>     bucket and, as a result, prefer to use Merge Join. As the example in
>     attachment shows, it leads to worse plan than possible, sometimes
>     drastically worse.
>     I assume, this is done with fear of functional dependencies between hash
>     clause components. But as for me, here we should go the same way, as
>     estimation of groups.
>     The attached patch shows a sketch of the solution.
> 
> 
> This problem is very important.
> 
> Honestly, I'm still learning your code and looking for cases on which cases
> your patch can affect for the worse or for the better. But I have already found
> something that seemed interesting to me. I have found several other interesting
> cases where your patch can solve some problem in order to choose a more correct
> plan, but in focus on memory consumption.
> 
> To make it easier to evaluate, I added a hook to your patch that makes it
> easier to switch to your or the original way of estimating the size of baskets
> (diff_estimate.diff).
> 
> Here are other cases where your fix improves the query plan.
> 
> 
> First of all, I changed the way creation of tables are created to look at the
> behavior of the query plan in terms of planning and execution time:
> 
> DROP TABLE IF EXISTS a,b CASCADE;
> CREATE TABLE a AS
>   SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
>   FROM generate_series(1,1e5) AS gs;
> CREATE TABLE b AS
>   SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
>   FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
> 
> SET enable_cost_size = 'on';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> 
>                                 QUERY PLAN                                 
> ---------------------------------------------------------------------------
>  Hash Join (actual time=200.872..200.879 rows=0 loops=1)
>    Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
>    ->  Seq Scan on b (actual time=0.029..15.946 rows=100000 loops=1)
>    ->  Hash (actual time=97.645..97.649 rows=100000 loops=1)
>          Buckets: 131072  Batches: 1  Memory Usage: 5612kB
>          ->  Seq Scan on a (actual time=0.024..17.153 rows=100000 loops=1)
>  Planning Time: 2.910 ms
>  Execution Time: 201.949 ms
> (8 rows)
> 
> SET
>                                 QUERY PLAN                                 
> ---------------------------------------------------------------------------
>  Merge Join (actual time=687.415..687.416 rows=0 loops=1)
>    Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
>    ->  Sort (actual time=462.022..536.716 rows=100000 loops=1)
>          Sort Key: b.y, b.x, b.z
>          Sort Method: external merge  Disk: 3328kB
>          ->  Seq Scan on b (actual time=0.017..12.326 rows=100000 loops=1)
>    ->  Sort (actual time=111.295..113.196 rows=16001 loops=1)
>          Sort Key: a.y, a.x, a.z
>          Sort Method: external sort  Disk: 2840kB
>          ->  Seq Scan on a (actual time=0.020..10.129 rows=100000 loops=1)
>  Planning Time: 0.752 ms
>  Execution Time: 688.829 ms
> (12 rows)
> 
> Secondly, I found another case that is not related to the fact that the planner
> would prefer to choose merge join rather than hash join, but we have the
> opportunity to see that the plan has become better due to the consumption of
> less memory, and also takes less planning time.
> 
> Here, with the same query, the planning time was reduced by 5 times, and the
> number of buckets by 128 times, therefore, memory consumption also decreased:
> 
> DROP TABLE IF EXISTS a,b CASCADE;
> 
> CREATE TABLE a AS
>   SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
>   FROM generate_series(1,600) AS gs;
> CREATE TABLE b AS
>   SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
>   FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
> 
> SET enable_cost_size = 'on';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
>                                                    QUERY
> PLAN                                                   
> ----------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=20.50..3157.58 rows=8 width=32) (actual time=95.648..95.651
> rows=0 loops=1)
>    Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z =
> (a.z)::numeric))
>    ->  Seq Scan on b  (cost=0.00..1637.00 rows=100000 width=20) (actual time=
> 0.027..17.980 rows=100000 loops=1)
>    ->  Hash  (cost=10.00..10.00 rows=600 width=12) (actual time=2.046..2.047
> rows=600 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 34kB
>          ->  Seq Scan on a  (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.022..0.315 rows=600 loops=1)
>  Planning Time: 0.631 ms
>  Execution Time: 95.730 ms
> (8 rows)
> 
> SET
>                                                       QUERY
> PLAN                                                      
>
----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=3387.00..8621.58 rows=8 width=32) (actual time=
> 102.873..102.877 rows=0 loops=1)
>    Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND
> ((a.z)::numeric = b.z))
>    ->  Seq Scan on a  (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.014..0.131 rows=600 loops=1)
>    ->  Hash  (cost=1637.00..1637.00 rows=100000 width=20) (actual time=
> 101.920..101.921 rows=100000 loops=1)
>          Buckets: 131072  Batches: 1  Memory Usage: 6474kB
>          ->  Seq Scan on b  (cost=0.00..1637.00 rows=100000 width=20) (actual
> time=0.013..16.349 rows=100000 loops=1)
>  Planning Time: 0.153 ms
>  Execution Time: 103.518 ms
> (8 rows)
> 
> I also give an improvement relative to the left external or right connection:
> 
> DROP TABLE IF EXISTS a,b CASCADE;
> 
> CREATE TABLE a AS
>   SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
>   FROM generate_series(1,600) AS gs;
> CREATE TABLE b AS
>   SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
>   FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
> 
> 
> SET enable_cost_size = 'on';
> 
> EXPLAIN ANALYZE
> SELECT * FROM a right join b
> on a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a right join b
> on a.x=b.x AND a.y=b.y AND a.z=b.z;
> 
>                                                    QUERY
> PLAN                                                   
> ----------------------------------------------------------------------------------------------------------------
>  Hash Left Join  (cost=20.50..3157.58 rows=100000 width=32) (actual time=
> 1.846..102.264 rows=100000 loops=1)
>    Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z =
> (a.z)::numeric))
>    ->  Seq Scan on b  (cost=0.00..1637.00 rows=100000 width=20) (actual time=
> 0.041..15.328 rows=100000 loops=1)
>    ->  Hash  (cost=10.00..10.00 rows=600 width=12) (actual time=1.780..1.781
> rows=600 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 34kB
>          ->  Seq Scan on a  (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.031..0.252 rows=600 loops=1)
>  Planning Time: 0.492 ms
>  Execution Time: 107.609 ms
> (8 rows)
> 
> SET
>                                                       QUERY
> PLAN                                                      
>
----------------------------------------------------------------------------------------------------------------------
>  Hash Right Join  (cost=3387.00..8500.08 rows=100000 width=32) (actual time=
> 80.919..101.613 rows=100000 loops=1)
>    Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND
> ((a.z)::numeric = b.z))
>    ->  Seq Scan on a  (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.017..0.084 rows=600 loops=1)
>    ->  Hash  (cost=1637.00..1637.00 rows=100000 width=20) (actual time=
> 80.122..80.123 rows=100000 loops=1)
>          Buckets: 131072  Batches: 1  Memory Usage: 6474kB
>          ->  Seq Scan on b  (cost=0.00..1637.00 rows=100000 width=20) (actual
> time=0.015..11.819 rows=100000 loops=1)
>  Planning Time: 0.194 ms
>  Execution Time: 104.662 ms
> (8 rows)
> 
> --
> Regards,
> Alena Rybakina
> Postgres Professional
> 

> diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
> index ef475d95a18..31771dfba46 100644
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -153,6 +153,7 @@ bool        enable_parallel_hash = true;
>  bool        enable_partition_pruning = true;
>  bool        enable_presorted_aggregate = true;
>  bool        enable_async_append = true;
> +bool         enable_cost_size = true;
>  
>  typedef struct
>  {
> @@ -4033,11 +4034,22 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
>                  thismcvfreq = restrictinfo->left_mcvfreq;
>              }
>  
> +            if (enable_cost_size)
> +            {
> +                innerbucketsize *= thisbucketsize;
> +                innermcvfreq *= thismcvfreq;
> +            }
> +            else
> +            {
>              if (innerbucketsize > thisbucketsize)
>                  innerbucketsize = thisbucketsize;
>              if (innermcvfreq > thismcvfreq)
>                  innermcvfreq = thismcvfreq;
> +            }
>          }
> +
> +        if (enable_cost_size && innerbucketsize > virtualbuckets)
> +            innerbucketsize = 1.0 / virtualbuckets;
>      }
>  
>      /*
> diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
> index 71e27f8eb05..ded9ba3b7a9 100644
> --- a/src/backend/utils/misc/guc_tables.c
> +++ b/src/backend/utils/misc/guc_tables.c
> @@ -1007,6 +1007,19 @@ struct config_bool ConfigureNamesBool[] =
>          true,
>          NULL, NULL, NULL
>      },
> +    {
> +        {"enable_cost_size", PGC_USERSET, QUERY_TUNING_OTHER,
> +            gettext_noop("set the optimizer coefficient"
> +                         "so that custom or generic plan is selected more often. "
> +                         "by default, the value is set to 1, which means that "
> +                         "the choice of using both depends on the calculated cost"),
> +            NULL,
> +            GUC_EXPLAIN
> +        },
> +        &enable_cost_size,
> +        true,
> +        NULL, NULL, NULL
> +    },
>      {
>          {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
>              gettext_noop("Enables the planner's use of async append plans."),
> diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
> index 6cf49705d3a..c79ec12e6d5 100644
> --- a/src/include/optimizer/cost.h
> +++ b/src/include/optimizer/cost.h
> @@ -71,6 +71,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
>  extern PGDLLIMPORT bool enable_presorted_aggregate;
>  extern PGDLLIMPORT bool enable_async_append;
>  extern PGDLLIMPORT int constraint_exclusion;
> +extern PGDLLIMPORT bool enable_cost_size;
>  
>  extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
>                                    double index_pages, PlannerInfo *root);


-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andy Fan

Date:

11 September 2023, 04:51:03

Hi,

On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote:

Hi, all.

Some of my clients use JOIN's with three - four clauses. Quite
frequently, I see complaints on unreasonable switch of JOIN algorithm to
Merge Join instead of Hash Join. Quick research have shown one weak
place - estimation of an average bucket size in final_cost_hashjoin (see
q2.sql in attachment) with very conservative strategy.
Unlike estimation of groups, here we use smallest ndistinct value across
all buckets instead of multiplying them (or trying to make multivariate
analysis).
It works fine for the case of one clause. But if we have many clauses,
and if each has high value of ndistinct, we will overestimate average
size of a bucket and, as a result, prefer to use Merge Join. As the
example in attachment shows, it leads to worse plan than possible,
sometimes drastically worse.
I assume, this is done with fear of functional dependencies between hash
clause components. But as for me, here we should go the same way, as
estimation of groups.

I can reproduce the visitation you want to improve and verify the patch

can do it expectedly. I think this is a right thing to do.

The attached patch shows a sketch of the solution.

I understand that this is a sketch of the solution, but the below changes still

make me confused.

+ if (innerbucketsize > virtualbuckets)

+ innerbucketsize = 1.0 / virtualbuckets;

innerbucketsize is a fraction of rows in all the rows, so it is between 0.0 and 1.0.

and virtualbuckets is the number of buckets in total (when considered the mutli

batchs), how is it possible for 'innerbucketsize > virtualbuckets' ? Am

I missing something?

Best Regards

Andy Fan

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

"Lepikhov Andrei"

Date:

11 September 2023, 08:04:22


On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
> Hi,
>
> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> Hi, all.
>>
>> Some of my clients use JOIN's with three - four clauses. Quite
>> frequently, I see complaints on unreasonable switch of JOIN algorithm to
>> Merge Join instead of Hash Join. Quick research have shown one weak
>> place - estimation of an average bucket size in final_cost_hashjoin (see
>> q2.sql in attachment) with very conservative strategy.
>> Unlike estimation of groups, here we use smallest ndistinct value across
>> all buckets instead of multiplying them (or trying to make multivariate
>> analysis).
>> It works fine for the case of one clause. But if we have many clauses,
>> and if each has high value of ndistinct, we will overestimate average
>> size of a bucket and, as a result, prefer to use Merge Join. As the
>> example in attachment shows, it leads to worse plan than possible,
>> sometimes drastically worse.
>> I assume, this is done with fear of functional dependencies between hash
>> clause components. But as for me, here we should go the same way, as
>> estimation of groups.
>
> I can reproduce the visitation you want to improve and verify the patch
> can do it expectedly.  I think this is a right thing to do.
>
>> The attached patch shows a sketch of the solution.
>
> I understand that this is a sketch of the solution,  but the  below
> changes still
> make me confused.
>
> + if (innerbucketsize > virtualbuckets)
> +     innerbucketsize = 1.0 / virtualbuckets;
>
> innerbucketsize is a fraction of rows in all the rows, so it is between
> 0.0 and 1.0.
> and virtualbuckets is the number of buckets in total (when considered
> the mutli
> batchs),  how is it possible for 'innerbucketsize > virtualbuckets' ?
> Am
> I missing something?

You are right here. I've made a mistake here. Changed diff is in attachment.

--
Regards,
Andrei Lepikhov

Attachment

fix_bucketsize_v2.diff

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Tomas Vondra

Date:

03 November 2023, 16:43:26

On 9/11/23 10:04, Lepikhov Andrei wrote:
> 
> 
> On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
>> Hi, 
>>
>> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov 
>> <a.lepikhov@postgrespro.ru> wrote:
>>> Hi, all.
>>>
>>> Some of my clients use JOIN's with three - four clauses. Quite 
>>> frequently, I see complaints on unreasonable switch of JOIN algorithm to 
>>> Merge Join instead of Hash Join. Quick research have shown one weak 
>>> place - estimation of an average bucket size in final_cost_hashjoin (see 
>>> q2.sql in attachment) with very conservative strategy.
>>> Unlike estimation of groups, here we use smallest ndistinct value across 
>>> all buckets instead of multiplying them (or trying to make multivariate 
>>> analysis).
>>> It works fine for the case of one clause. But if we have many clauses, 
>>> and if each has high value of ndistinct, we will overestimate average 
>>> size of a bucket and, as a result, prefer to use Merge Join. As the 
>>> example in attachment shows, it leads to worse plan than possible, 
>>> sometimes drastically worse.
>>> I assume, this is done with fear of functional dependencies between hash 
>>> clause components. But as for me, here we should go the same way, as 
>>> estimation of groups.
>>

Yes, this analysis is correct - final_cost_hashjoin assumes the clauses
may be correlated (not necessarily by functional dependencies, just that
the overall ndistinct is not a simple product of per-column ndistincts).

And it even says so in the comment before calculating bucket size:

 * Determine bucketsize fraction and MCV frequency for the inner
 * relation. We use the smallest bucketsize or MCV frequency estimated
 * for any individual hashclause; this is undoubtedly conservative.

I'm sure this may lead to inflated cost for "good" cases (where the
actual bucket size really is a product), which may push the optimizer to
use the less efficient/slower join method.

Unfortunately, AFAICS the patch simply assumes the extreme in the
opposite direction - it assumes each clause splits the bucket for each
distinct value in the column. Which works great when it's true, but
surely it'd have issues when the columns are correlated?

I think this deserves more discussion, i.e. what happens if the
assumptions do not hold? We know what happens for the conservative
approach, but what's the worst thing that would happen for the
optimistic one?

I doubt e can simply switch from the conservative approach to the
optimistic one. Yes, it'll make some queries faster, but for other
queries it likely causes problems and slowdowns.

IMHO the only principled way forward is to get a better ndistinct
estimate (which this implicitly does), perhaps by using extended
statistics. I haven't tried, but I guess it'd need to extract the
clauses for the inner side, and call estimate_num_groups() on it.

This however reminds me we don't use extended statistics for join
clauses at all. Which means that even with accurate extended statistics,
we can still get stuff like this for multiple join clauses:

   Hash Join  (cost=1317.00..2386.00 rows=200 width=24)
              (actual time=85.781..8574.784 rows=8000000 loops=1)

This is unrelated to the issue discussed here, of course, as it won't
affect join method selection for that join. But it certainly will affect
all estimates/costs above that join, which can be pretty disastrous.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andrei Lepikhov

Date:

08 July 2024, 12:45:15

On 3/11/2023 23:43, Tomas Vondra wrote:
> On 9/11/23 10:04, Lepikhov Andrei wrote:
>   * Determine bucketsize fraction and MCV frequency for the inner
>   * relation. We use the smallest bucketsize or MCV frequency estimated
>   * for any individual hashclause; this is undoubtedly conservative.
> 
> I'm sure this may lead to inflated cost for "good" cases (where the
> actual bucket size really is a product), which may push the optimizer to
> use the less efficient/slower join method.
Yes, It was contradictory idea, though.
> IMHO the only principled way forward is to get a better ndistinct
> estimate (which this implicitly does), perhaps by using extended
> statistics. I haven't tried, but I guess it'd need to extract the
> clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of 
extended statistics for estimation of the HashJoin bucket_size. In 
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows 
to reduce computational cost of this statistic.

[1] 
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web

-- 
regards, Andrei Lepikhov

Attachment

0001-Use-extended-statistics-for-precise-estimation-of-bu.patch

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andrei Lepikhov

Date:

03 March, 11:24:40

On 17/2/2025 01:34, Alexander Korotkov wrote:
> Hi, Andrei!
> 
> On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> Thank you for your work on this subject.  I agree with the general
> direction.  While everyone has used conservative estimates for a long
> time, it's better to change them only when we're sure about it.
> However, I'm still not sure I get the conservatism.
> 
> if (innerbucketsize > thisbucketsize)
>      innerbucketsize = thisbucketsize;
> if (innermcvfreq > thismcvfreq)
>     innermcvfreq = thismcvfreq;
> 
> IFAICS, even in the worst case (all columns are totally correlated),
> the overall bucket size should be the smallest bucket size among
> clauses (not the largest).  And the same is true of MCV.  As a mental
> experiment, we can add a new clause to hash join, which is always true
> because columns on both sides have the same value.  In fact, it would
> have almost no influence except for the cost of extracting additional
> columns and the cost of executing additional operators.  But in the
> current model, this additional clause would completely ruin
> thisbucketsize and thismcvfreq, making hash join extremely
> unappealing.  Should we still revise this to calculate minimum instead
> of maximum?
I agree with your point. But I think the code works precisely the way 
you have described.
> 
> I've slightly revised the patch.  I've run pg_indent and renamed
> s/saveList/origin_rinfos/g for better readability.
Thank You!
> 
> Also, the patch badly needs regression test coverage.  We can't
> include costs in expected outputs.  But that could be some plans,
> which previously were reliably merge joins but now become reliable
> hash joins.
I added one test here. Writing more tests on this feature is hard, but 
feature [1] may provide us with additional tools to reveal extended stat 
internals. I also have thought about injection points, but it seems an 
over-complication.

[1] Showing applied extended statistics in explain Part 2

https://www.postgresql.org/message-id/flat/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com

-- 
regards, Andrei Lepikhov

Attachment

v3-0001-Use-extended-statistics-for-precise-estimation-of.patch

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Alexander Korotkov

Date:

05 March, 05:43:28

On Mon, Mar 3, 2025 at 10:24 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> On 17/2/2025 01:34, Alexander Korotkov wrote:
> > Hi, Andrei!
> >
> > On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> > Thank you for your work on this subject.  I agree with the general
> > direction.  While everyone has used conservative estimates for a long
> > time, it's better to change them only when we're sure about it.
> > However, I'm still not sure I get the conservatism.
> >
> > if (innerbucketsize > thisbucketsize)
> >      innerbucketsize = thisbucketsize;
> > if (innermcvfreq > thismcvfreq)
> >     innermcvfreq = thismcvfreq;
> >
> > IFAICS, even in the worst case (all columns are totally correlated),
> > the overall bucket size should be the smallest bucket size among
> > clauses (not the largest).  And the same is true of MCV.  As a mental
> > experiment, we can add a new clause to hash join, which is always true
> > because columns on both sides have the same value.  In fact, it would
> > have almost no influence except for the cost of extracting additional
> > columns and the cost of executing additional operators.  But in the
> > current model, this additional clause would completely ruin
> > thisbucketsize and thismcvfreq, making hash join extremely
> > unappealing.  Should we still revise this to calculate minimum instead
> > of maximum?
> I agree with your point. But I think the code works precisely the way
> you have described.

You're right.  I just messed up with the sides of comparison operator.

------
Regards,
Alexander Korotkov
Supabase

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Alexander Korotkov

Date:

09 March, 15:13:52

On Wed, Mar 5, 2025 at 4:43 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Mon, Mar 3, 2025 at 10:24 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> > On 17/2/2025 01:34, Alexander Korotkov wrote:
> > > Hi, Andrei!
> > >
> > > On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
> > > Thank you for your work on this subject.  I agree with the general
> > > direction.  While everyone has used conservative estimates for a long
> > > time, it's better to change them only when we're sure about it.
> > > However, I'm still not sure I get the conservatism.
> > >
> > > if (innerbucketsize > thisbucketsize)
> > >      innerbucketsize = thisbucketsize;
> > > if (innermcvfreq > thismcvfreq)
> > >     innermcvfreq = thismcvfreq;
> > >
> > > IFAICS, even in the worst case (all columns are totally correlated),
> > > the overall bucket size should be the smallest bucket size among
> > > clauses (not the largest).  And the same is true of MCV.  As a mental
> > > experiment, we can add a new clause to hash join, which is always true
> > > because columns on both sides have the same value.  In fact, it would
> > > have almost no influence except for the cost of extracting additional
> > > columns and the cost of executing additional operators.  But in the
> > > current model, this additional clause would completely ruin
> > > thisbucketsize and thismcvfreq, making hash join extremely
> > > unappealing.  Should we still revise this to calculate minimum instead
> > > of maximum?
> > I agree with your point. But I think the code works precisely the way
> > you have described.
>
> You're right.  I just messed up with the sides of comparison operator.

I've revised commit message, comments, formatting etc.
I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment

v3-0001-Use-extended-stats-for-precise-estimation-of-buck.patch

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andres Freund

Date:

11 April, 01:27:26

Hi,

On 2025-03-09 14:13:52 +0200, Alexander Korotkov wrote:
> I've revised commit message, comments, formatting etc.
> I'm going to push this if no objections.

I'm rather confused as to why this is a thing to push at this point? This
doesn't seem to be a bugfix and it's post feature freeze.

Andres

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Matthias van de Meent

Date:

11 April, 01:47:19

On Fri, 11 Apr 2025 at 00:27, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-03-09 14:13:52 +0200, Alexander Korotkov wrote:
> > I've revised commit message, comments, formatting etc.
> > I'm going to push this if no objections.
>
> I'm rather confused as to why this is a thing to push at this point? This
> doesn't seem to be a bugfix and it's post feature freeze.

I think the patch from that mail got committed as 6bb6a62f about a
month ago, which was shortly after Alexander's message. Did you get
confused about the month of his message, or by the incorrect state of
the CF entry?

Kind regards,

Matthias van de Meent

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Andres Freund

Date:

11 April, 05:06:32

Hi,

On 2025-04-11 00:47:19 +0200, Matthias van de Meent wrote:
> On Fri, 11 Apr 2025 at 00:27, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2025-03-09 14:13:52 +0200, Alexander Korotkov wrote:
> > > I've revised commit message, comments, formatting etc.
> > > I'm going to push this if no objections.
> >
> > I'm rather confused as to why this is a thing to push at this point? This
> > doesn't seem to be a bugfix and it's post feature freeze.
> 
> I think the patch from that mail got committed as 6bb6a62f about a
> month ago, which was shortly after Alexander's message. Did you get
> confused about the month of his message, or by the incorrect state of
> the CF entry?

Sorry for that Alexander - for some reason the email just showed up as new in
my inbox and I only looked at the date, not the month :(

Greetings,

Andres Freund

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Alexander Korotkov

Date:

12 April, 00:42:26

On Fri, Apr 11, 2025 at 5:06 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-04-11 00:47:19 +0200, Matthias van de Meent wrote:
> > On Fri, 11 Apr 2025 at 00:27, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2025-03-09 14:13:52 +0200, Alexander Korotkov wrote:
> > > > I've revised commit message, comments, formatting etc.
> > > > I'm going to push this if no objections.
> > >
> > > I'm rather confused as to why this is a thing to push at this point? This
> > > doesn't seem to be a bugfix and it's post feature freeze.
> >
> > I think the patch from that mail got committed as 6bb6a62f about a
> > month ago, which was shortly after Alexander's message. Did you get
> > confused about the month of his message, or by the incorrect state of
> > the CF entry?
>
> Sorry for that Alexander - for some reason the email just showed up as new in
> my inbox and I only looked at the date, not the month :(

Not a problem at all!

------
Regards,
Alexander Korotkov
Supabase

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Tender Wang

Date:

14 April, 09:17:17

Hi,

While I debug hashjoin codes, in estimate_multivariate_bucketsize(), I find that

the list_copy(hashclauses) below is unnecessary if we have a single join clause.

List *clauses = list_copy(hashclauses);

...

I adjust the place of list_copy() call as the attached patch.

This can save some overhead of function calls and memory copies.

Any thoughts?

Thanks, Tender Wang

Attachment

0001-Adjust-the-place-of-list_copy-in-estimate_multivaria.patch

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From

Tender Wang

Date:

24 April, 17:07:33

Tender Wang <tndrwang@gmail.com> 于2025年4月14日周一 14:17写道：

Hi,

While I debug hashjoin codes, in estimate_multivariate_bucketsize(), I find that
the list_copy(hashclauses) below is unnecessary if we have a single join clause.

List *clauses = list_copy(hashclauses);
...

I adjust the place of list_copy() call as the attached patch.
This can save some overhead of function calls and memory copies.

Any thoughts?

Hi Alexander,

In the last thread, I found a minor optimization for the code in estimate_multivariate_bucketsize().

Adjust the place of list_copy() at the start of estimate_multivariate_bucketsize, and we can avoid unnecessarily creating a new list

and memory copy if we have only a single hash clause.

Do you think it's worth doing this?

Thanks,

Tender Wang