Home > mailing lists

Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Use merge-based matching for MCVs in eqjoinsel
Date	November 18 22:16:54
Msg-id	1310340.1763493414@sss.pgh.pa.us Whole thread Raw
In response to	Re: Use merge-based matching for MCVs in eqjoinsel (David Geier <geidav.pg@gmail.com>)
Responses	Re: Use merge-based matching for MCVs in eqjoinsel
List	pgsql-hackers

Tree view

David Geier <geidav.pg@gmail.com> writes:
> On 17.11.2025 19:44, Tom Lane wrote:
>> Or maybe better, since we are considering an O(m*n) algorithm
>> versus an O(m+n) one, we could check whether
>> sslot1.nvalues * sslot2.nvalues - (sslot1.nvalues + sslot2.nvalues)
>> exceeds some threshold.  But that doesn't offer any insight into
>> just what the threshold should be, either.

> Good idea. How about using that formula and then determining the
> threshold with a few experiments? Could be the JOB benchmark Ilia has
> already set up or some synthetic test-cases.

Thinking a bit harder, we are comparing these costs:

1. The old code does m*n comparisons, with next-to-no other overhead.
Sometimes the inner loop will stop early, resulting in fewer
comparisons; but I don't think we have any good handle on how often
that's likely to happen.  Let's just consider worst-case numbers.

2. The hash code will do m+n hash-value computations, n hashtable
insertions, and m hashtable searches.  (We can assume m >= n.)
The hashtable insertions might sometimes do datum_image_eq
comparisons, but only in the event of a hash value collision,
which is probably rare.  The hashtable searches will do comparisons
in the event of a hash match.  Unlike the old code, the worst case
is where everything has a match not where nothing has a match, but
we're considering the worst case so let's suppose that the m
searches do n comparisons on the way to finding n matches.  (They
could do more comparisons in the event of hash value collisions,
but I'm still supposing those are rare.)  So altogether we have
    m+n hash-value computations
    n comparisons
    n hashtable insertions (exclusive of above costs)
    m hashtable searches (exclusive of above costs)
    1 hashtable creation/destruction, with O(n)+constant cost

What we lack here is a solid idea of the relative costs of those
primitive operations.  I think though that it's reasonable to assume
that hash-value computations are about as expensive as comparisons:
data types with expensive comparisons must also have expensive hashing
methods to ensure the hashing gets the right answers for "equal"
values.  Hashtable insertions and searches are probably about equally
expensive too, though it's not clear that that's in the same league as
the datatype-dependent operations.  And the hashtable creation
certainly has an O(n) cost just from initial zeroing of the array,
though the constant factor in that is likely small.

However, if we fuzz things tremendously and just assume all these
costs are equal, we get 2m + 4n operations altogether, which is
probably not so far off for datatypes with cheap hashing and
comparison (like integers).  At the other end of the scale, for
datatypes with expensive operations, we could disregard the hashtable
operations and conclude that there are m + 2n interesting operations.

I'm a little inclined to split the difference and take the hashing
cost as 2m + 2n, which leads to the conclusion that we should switch
to hashing when m*n > 2*(m+n), maybe with a little extra added to
account for the constant-time aspects of the hashtable setup.

It'd be good to validate this model with some tests of course.

> We could also include the operator costs for hashing and equality
> comparison to make it more precise, in case they're easily accessible
> at this point.

Well, we could look those up, but sadly it'd just be
garbage-in-garbage-out.  We don't have good estimates for the relative
costs of different hash or equality functions.

regression=# select procost from pg_proc where proname = 'int4eq';
 procost 
---------
       1
(1 row)

regression=# select procost from pg_proc where proname = 'texteq';
 procost 
---------
       1
(1 row)

As long as you are willing to concede that 1 hash operation
should be of comparable cost to 1 comparison, I think it'd
mostly come out in the wash anyway.

            regards, tom lane

pgsql-hackers by date:

From: Melanie Plageman
Date: 18 November, 22:13:32
Subject: Re: Eagerly evict bulkwrite strategy ring

From: Mahmoud Ayman
Date: 18 November, 22:33:47
Subject: Re: gen_guc_tables.pl: Validate required GUC fields before code generation

Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

Previous

Next