Re: using extended statistics to improve join estimates - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: using extended statistics to improve join estimates
Date
Msg-id b8079953-9347-468b-ada9-db26752ce00b@enterprisedb.com
Whole thread Raw
In response to Re: using extended statistics to improve join estimates  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
List pgsql-hackers
Hi,

I finally got to do a review of the reworked patch series. For the most
part I do like the changes, although I'm not 100% sure about some of
them. I do like that the changes have been kept in separate patches,
which makes it much easier to understand what the goal is etc. But it's
probably time to start merging some of the patches back into the main
patch - it's a bit tedious work with 22 patches.

Note: This needs to be added to the next CF, so that we get cfbot
results and can focus on it in 2024-07. Also, I'd attach the patches
directly, not as .tar.

I did go though the patches one by one, and did a review for each of
them separately. I only had a couple hours for this today, so it's not
super-deep review, more a start for a discussion / asking questions.

For each patch I added a "review" and "pgindent" where review is my
comments, pgindent is the changes pgindent would do (which we now expect
to happen before commit). In hindsight I should have skipped the
pgindent, it made it a more tedious with little benefits. But I realized
that half-way through the series, so it was easier to just continue.

Let me quickly go through the original parts - most of this is already
in the "review" patches, but it's better to quote the main points here
to start a discussion. I'll omit some of the smaller suggestions, so
please look at the 'review' patches.


v20240617-0001-Estimate-joins-using-extended-statistics.patch

- rewords a couple comments, particularly for statext_find_matching_mcv

- a couple XXX comments about possibly stale/inaccurate comments

- suggestion to improve statext_determine_join_restrictions, but we one
of the later patches already does the caching


v20240617-0004-Remove-estimiatedcluases-and-varRelid-argu.patch

- I'm not sure we actually should do this (esp. the removal of
estimatedclauses bitmap). It breaks if we add the new hook.


v20240617-0007-Remove-SpecialJoinInfo-sjinfo-argument.patch
v20240617-0009-Remove-joinType-argument.patch

- I'm skeptical about removing these two. Yes, the current code does not
actually use those fields, but selfuncs.c always passes both  jointype
and sjinfo, so maybe we should do that too for consistency. What happens
if we end up wanting to call an existing selfuncs function that needs
these parameters in the future? Say because we want to call the regular
join estimator, and then apply some "correction" to the result?


v20240617-0011-use-the-pre-calculated-RestrictInfo-left-r.patch

- why not to keep the BMS_MULTIPLE check on clause_relids, seems cheap
so maybe we could do it before the more expensive stuff?


v20240617-0014-Fast-path-for-general-clauselist_selectivi.patch

- Does this actually make meaningful difference?


v20240617-0017-a-branch-of-updates-around-JoinPairInfo.patch

- Can we actually assume the clause has a RestrictInfo on top? IIRC
there are cases where we can get here without it (e.g. AND clause?).


v20240617-0020-Cache-the-result-of-statext_determine_join.patch

- This addresses some of my suggestions in 0001, but I think we don't
actually need to recalculate both lists in each loop.


v20240617-0030-optimize-the-order-of-mcv-equal-function-e.patch

-  There's no explanation to support this optimization. I guess I know
what it tries to do, but doesn't it have the same issues withu
npredictable behavior like the GROUP BY patch, which ended up reverting
and reworking?

- modifies .sql test but not the expected output

- The McvProc name seems a bit misleading. I think it's really "procs",
for example.


v20240617-0033-Merge-3-palloc-into-1-palloc.patch

- Not sure. It's presented as an optimization to save on palloc calls,
but I doubt that's measurable. Maybe it makes it a little bit more
readable, but now I'm not convinced it's worth it.


v20240617-0036-Remove-2-pull_varnos-calls-with-rinfo-left.patch

- Again, can we rely on this always getting a RestrictInfo? Maybe we do,
but it's not obvious to me, so a comment explaining that would be nice.
And maybe an assert to check this.


v20240617-0040-some-code-refactor-as-before.patch

- Essentially applies earlier refactorings/tweaks to another place.

- Seems OK (depending on whether we agree on those changes), but it
seems mostly independent of this patch series. So I'd at least keep it
in a separate patch.


v20240617-0043-Fix-error-unexpected-system-attribute-when.patch

- seems to only tweak the .sql, not expected output

- One of the comments refers to "above limitation" but I'm unsure what
that's about.


v20240617-0048-Add-fastpath-when-combine-the-2-MCV-like-e.patch
v20240617-0050-When-mcv-ndimensions-list_length-clauses-h.patch

- I'm not sure about one of the opimizations, relying on having a clause
for each dimensions of the MCV.


v20240617-0054-clauselist_selectivity_hook.patch

- I believe this does not work with the earlier patch that removed
estimatedclaused bitmap from the "try" function.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Sven Klemm
Date:
Subject: [PATCH] Improve error message when trying to lock virtual tuple.
Next
From: Aleksander Alekseev
Date:
Subject: Re: [PATCH] Improve error message when trying to lock virtual tuple.