Thread: pgsql: Use extended stats for precise estimation of bucket size in hash

pgsql: Use extended stats for precise estimation of bucket size in hash

From
Alexander Korotkov
Date:
Use extended stats for precise estimation of bucket size in hash join

Recognizing the real-life complexity where columns in the table often have
functional dependencies, PostgreSQL's estimation of the number of distinct
values over a set of columns can be underestimated (or much rarely,
overestimated) when dealing with multi-clause JOIN.  In the case of hash
join, it can end up with a small number of predicted hash  buckets and, as
a result, picking non-optimal merge join.

To improve the situation, we introduce one additional stage of bucket size
estimation - having two or more join clauses estimator lookup for extended
statistics and use it for multicolumn estimation.  Clauses are grouped into
lists, each containing expressions referencing the same relation.  The result
of the multicolumn estimation made over such a list is combined with others
according to the caller's logic.  Clauses that are not estimated are returned
to the caller for further estimation.

Discussion: https://postgr.es/m/52257607-57f6-850d-399a-ec33a654457b%40postgrespro.ru
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/6bb6a62f3cc45624c601d5270673a17447734629

Modified Files
--------------
src/backend/optimizer/path/costsize.c   |  12 ++-
src/backend/utils/adt/selfuncs.c        | 175 ++++++++++++++++++++++++++++++++
src/include/utils/selfuncs.h            |   4 +
src/test/regress/expected/stats_ext.out |  45 ++++++++
src/test/regress/sql/stats_ext.sql      |  29 ++++++
5 files changed, 264 insertions(+), 1 deletion(-)


Hi,

On Mon, 10 Mar 2025 at 22:18, Alexander Korotkov <akorotkov@postgresql.org> wrote:
>
> Use extended stats for precise estimation of bucket size in hash join


After this commit, I see a recurrence of an error similar to the one fixed in e28033fe1af8037e0fec8bb3a32fabbe18ac06b1

Re: pgsql: Use extended stats for precise estimation of bucket size in hash

From
Alexander Korotkov
Date:
On Thu, Apr 10, 2025 at 3:37 AM Robins Tharakan <tharakan@gmail.com> wrote:
> On Mon, 10 Mar 2025 at 22:18, Alexander Korotkov <akorotkov@postgresql.org> wrote:
> >
> > Use extended stats for precise estimation of bucket size in hash join
>
>
> After this commit, I see a recurrence of an error similar to the one fixed in
e28033fe1af8037e0fec8bb3a32fabbe18ac06b1
>
> https://www.postgresql.org/message-id/18885-da51324078588253%40postgresql.org

Thank you, I'm looking into that.

------
Regards,
Alexander Korotkov
Supabase