Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From David Rowley
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAKJS1f81TbFHJtfzYYVH19NWroxX4NOyZ2vvd5y+jb6VLr3WRg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
I've started looking over 0002.  Here are a few things so far:

1. I think this should be pg_statistic_ext.stxhistogram?

     Values of the <type>pg_histogram</type> can be obtained only from the
     <literal>pg_statistic.stxhistogram</literal> column.

2. I don't think this bms_copy is needed anymore. I think it was
previously since there were possibly multiple StatisticExtInfo objects
per pg_statistic_ext row, but now it's 1 for 1.

+ info->keys = bms_copy(keys);

naturally, the bms_free() will need to go too.

3. I've not really got into understanding how the new statistics types
are applied yet, but I found this:

* If asked to build both MCV and histogram, first build the MCV part
* and then histogram on the remaining rows.

I guess that means we'll get different estimates with:

create statistic a_stats (mcv,histogram) on a,b from t;

vs

create statistic a_stats1 (mcv) on a,b from t;
create statistic a_stats2 (histogram) on a,b from t;

Is that going to be surprising to people?

4. I guess you can replace "(histogram == NULL);" with "false". The
compiler would likely do it anyway, but...

if (histogram != NULL)
{
/* histogram already is a bytea value, not need to serialize */
nulls[Anum_pg_statistic_ext_stxhistogram - 1] = (histogram == NULL);
values[Anum_pg_statistic_ext_stxhistogram - 1] = PointerGetDatum(histogram);
}

but, hmm. Shouldn't you serialize this, like you are with the others?

5. serialize_histogram() and statext_histogram_deserialize(), should
these follow the same function naming format?

6. IIRC some compilers may warn about this:

if (stat->kinds & requiredkinds)

making it:

if ((stat->kinds & requiredkinds))

should fix that.

UPDATE: Tried to make a few compilers warn about this and failed.
Perhaps I've misremembered.

7. Comment claims function has a parameter named 'requiredkind', but
it no longer does. The comment also needs updated to mention that it
finds statistics with any of the required kinds.

 * choose_best_statistics
 * Look for and return statistics with the specified 'requiredkind' which
 * have keys that match at least two of the given attnums.  Return NULL if
 * there's no match.
 *
 * The current selection criteria is very simple - we choose the statistics
 * object referencing the most of the requested attributes, breaking ties
 * in favor of objects with fewer keys overall.
 *
 * XXX If multiple statistics objects tie on both criteria, then which object
 * is chosen depends on the order that they appear in the stats list. Perhaps
 * further tiebreakers are needed.
 */
StatisticExtInfo *
choose_best_statistics(List *stats, Bitmapset *attnums, int requiredkinds)

8. Looking at statext_clauselist_selectivity() I see it calls
choose_best_statistics() passing requiredkinds as STATS_EXT_INFO_MCV |
STATS_EXT_INFO_HISTOGRAM, do you think the function now needs to
attempt to find the best match plus the one with the most statistics
kinds?

It might only matter if someone had:

create statistic a_stats1 (mcv) on a,b from t;
create statistic a_stats2 (histogram) on a,b from t;
create statistic a_stats3 (mcv,histogram) on a,b from t;

Is it fine to just return a_stats1 and ignore the fact that a_stats3
is probably better? Or too corner case to care?

9. examine_equality_clause() assumes it'll get a Var. I see we should
only allow clauses that pass statext_is_compatible_clause_internal(),
so maybe it's worth an Assert(IsA(var, Var)) along with a comment to
mention anything else could not have been allowed.

10. Does examine_equality_clause need 'root' as an argument?

11. UINT16_MAX -> PG_UINT16_MAX

/* make sure we fit into uint16 */
Assert(count <= UINT16_MAX);

(Out of energy for today.)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Libpq support to connect to standby server as priority
Next
From: Surafel Temesgen
Date:
Subject: Re: pg_dump multi VALUES INSERT