Home > mailing lists

Re: Hash-based MCV matching for large IN-lists - Mailing list pgsql-hackers

From	Ilia Evdokimov
Subject	Re: Hash-based MCV matching for large IN-lists
Date	January 27 18:43:00
Msg-id	cf175ba7-a35b-4ebb-8764-254d5a2ace19@tantorlabs.com Whole thread Raw
In response to	Re: Hash-based MCV matching for large IN-lists (David Geier <geidav.pg@gmail.com>)
List	pgsql-hackers

Tree view

Hi,

On 19.01.2026 17:01, David Geier wrote:

Does that mean that we get a different estimation result, depending on
if the IN list is smaller or not? I think we should avoid that because
estimation quality might flip for the user unexpectedly.

I think you're right.

To address this, I changed the hash-table entry to track an additional 'count' filed, representing how many times a particular value appears on the hashed side. When inserting into the hash table, if the value is already present, I increment 'count', otherwise, I create a new entry with count = 1

The code in master currently calls an operator-specific selectivity
estimation function. For equality this is typically eqsel() but the
function can be specified during CREATE OPERATOR.

Can be safely special-case the behavior of eqsel() for all possible
operators for the ScalarArrayOpExpr case?


Unfortunately there is no safe way to make this optimization generic for
arbitrary restrict functions, because a custom RESTRICT function does
not have to use MCVs at all. IMO, in practice the vast majority of
ScalarArrayOpExpr uses with = or <> rely on the built-in equality
operators whose selectivity is computed by eqsel()/neqsel(), so I
limited this optimization to those cases.

How did you do that? I cannot find the code that checks for that.

In scalararraysel(), before attempting the hash-based path, we determine whether the operator behaves like equality or inequality based on its selectivity function:

if (oprsel == F_EQSEL || oprsel == F_EQJOINSEL)
isEquality = true;
else if (oprsel == F_NEQSEL || oprsel == F_NEQJOINSEL)
isInequality = true;

Then the hash-based MCV matching is only attempted under:

if ((isEquality || isInequality) && !is_join_clause)

So effectively this restricts the optimization to operators whose selectivity is computed by eqsel()/neqsel() on restriction clauses. Join clauses (which would use eqjoinsel/neqjoinsel) are excluded via !is_join_clause

For the MCVs, can't we reuse some code from the eqjoinsel() optimization
we did? The entry and context structs look similar enough to only need one.

I considered reusing pieces from the eqjoinsel() , but in practice it turned out to be difficult to share code cleanly. Also, when looking at this file more broadly, we already have multiple places that reimplement similar pattern.

Making the code more compact would ease reviewing a lot.

Agreed — I also think making the code more compact would significantly ease reviewing. I’ve found a way to unify the Const-array and ArrayExpr cases: in the ArrayExpr path, we can first construct the same arrays as in the Const-array case (elem_values, elem_nulls), and additionally build a boolean array elem_const[] indicating whether each element is a Const. Then the hash-based MCV matching function can:

- Ignore NULL and non-Const elements when building and probing the hash table.
- Count how many non-Const elements are present.
- After MCV and non-MCV constant handling, account for non-Const elements separately using var_eq_non_const() and fold their probabilities into the same ANY/ALL accumulation logic.

I've attached v3 patch with it.

To validate the same estimation results, I temporarily kept both implementations (hash-based and nested-loop) and compared their resulting selectivity values. Whenever they differed, I logged it. I ran regression tests and some local workload testing with this check enabled, and did not observe any mismatches. I attached patch with this logging.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment

pgsql-hackers by date:

From: "Matheus Alcantara"
Date: 27 January, 18:42:10
Subject: Re: Enable partitionwise join for partition keys wrapped by RelabelType

From: Tom Lane
Date: 27 January, 18:49:39
Subject: Re: pgsql: Prevent invalidation of newly synced replication slots.

Re: Hash-based MCV matching for large IN-lists - Mailing list pgsql-hackers

Attachment

Previous

Next