Re: Hash-based MCV matching for large IN-lists - Mailing list pgsql-hackers

From David Geier
Subject Re: Hash-based MCV matching for large IN-lists
Date
Msg-id ff7d491d-f084-4935-8417-c0b7285cdd89@gmail.com
Whole thread Raw
In response to Re: Hash-based MCV matching for large IN-lists  (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>)
List pgsql-hackers
Hi!

> Attached v4 patch with above fixes.

Good progress!

I did another pass over the code, focusing on structure:

- MCVHasContext and MCVInHashContext are identical. MCVHashEntry and
MCVInHashEntry only differ by the count member. I would, as said before,
merge them and simply not use the count member for the join case.

- hash_mcv_in() and mcvs_in_equal() are identical to hash_mcv() and
mcvs_equal(). Let's remove the new functions and use the existing ones
instead, in the spirit of the previous point.

- The threshold constants are also identical. I would merge them into a
single, e.g. MCV_HASH_THRESHOLD, in the spirit of the previous two points.

- MCVHashTable_hash will then be interchangable with
MCVInHashTable_hash. So let's remove MCVInHashTable_hash, in the spirit
of the previous three points.

- Use palloc_array() instead of palloc() when allocating arrays.

- We can avoid allocating the all-true elem_const array by passing NULL
for elem_const to scalararray_mcv_hash_match(), and considering a NULL
pointer to mean "all elements are constant".

- The following comment got copy&pasted from eqsel_internal() twice. It
reads a little strange now because we're not punting here by immediately
returning like in eqsel_internal() but instead fallback to the original
code path. Maybe say instead "... falling back to default code path to
compute default selectivity" or something like that.
    /*
     * If expression is not variable = something or something =
     * variable, then punt and return a default estimate.
     */

- The call to fmgr_info(opfuncoid, &eqproc) is currently under have_mcvs
but can be moved into the next if.

- elem_nulls and elem_const does have to be 0-initialized via palloc0().
All elements are set in the subsequent for-loop. I believe elem_values
also doesn't have to be 0-initialized via palloc0().

- Have you checked there there's test coverage for the special cases
(nvalues_non_mcv > 0, nvalues_nonconst > 0, IN contains NULL,
isEnequality==true, etc.)?  If not let's add tests for these.


I'll do a 2nd iteration, focusing on correctness, once these comments
are addressed and I've got the SQL from you so that I can test the
corner cases manually.

--
David Geier



pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Next
From: Antonin Houska
Date:
Subject: Re: Adding REPACK [concurrently]