Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 72a97865-9864-9cd7-eedb-e96dae4bde65@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers

On 1/16/19 7:56 AM, David Rowley wrote:> On Tue, 15 Jan 2019 at 08:21, 
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 >> Turns out you were right - the attribute_referenced piece was quite
 >> unnecessary. So I've removed it. I've also extended the regression tests
 >> to verify changing type of another column does not reset the stats.
 >
 > (Trying to find my feet over here)
 >
 > I've read over the entire thread, and apart from missing the last two
 > emails and therefore the latest patch, I managed to read over most of
 > the MCV patch. I didn't quite get to reading mcv.c and don't quite
 > have the energy to take that on now.
 >
Thanks for looking!

 > At this stage I'm trying to get to know the patch.  I read a lot of
 > discussing between you and Dean ironing out how the stats should be
 > used to form selectivities.  At the time I'd not read the patch yet,
 > so most of it went over my head.
 >
 > I did note down a few things on my read.  I've included them below.
 > Hopefully, they're useful.
 >
 > MCV list review
 >
 > 1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be
 > PG_UINT16_MAX
 >
Yep. Will fix.

 > 2. math.h should be included just after postgres.h
 >
Yep. Will fix.

 > 3. Copyright is still -2017 in mcv.c.  Hopefully, if you change it to
 > 2019, you'll never have to bump it ever again! :-)
 >
Optimist ;-)

 > 4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the
 > string building manually. The way it's coded I'm finding a little
 > strange.  It means the copying becomes quadratic due to
 >
 > snprintf(buff, 1024, format, values[1], DatumGetPointer(valout));
 > strncpy(values[1], buff, 1023);
 >
 > So basically, generally, here you're building a new string with
 > values[1] followed by a comma, then followed by valout. One the next
 > line you then copy that new buffer back into values[1].  I understand
 > this part is likely not performance critical, but I see no reason to
 > write the code this way.
 >
 > Are you limiting the strings to 1024 bytes on purpose?  I don't see
 > any comment mentioning you want to truncate strings.
 >
 > Would it not be better to do this part using a
 > AppendStringInfoString()? and just manually add a '{', ',' or '}' as
 > and when required?
 >> DatumGetPointer(valout) should really be using DatumGetCString(valout).
 >
 > Likely you can also use heap_form_tuple.  This will save you having to
 > convert ints into strings then only to have BuildTupleFromCStrings()
 > do the reverse.
 >
I agree. I admit all of this is a residue of an initial hackish version 
of the function, and should be changed to StringInfo. Will fix.

 > 5. individiaul -> individual
 >      lists.  This allows very accurate estimates for individiaul 
columns, but
 >
 >      litst -> lists
 >
 >          litst on combinations of columns.  Similarly to functional 
dependencies
 >
Will fix.

 > 6. Worth mentioning planning cycles too?
 >
 >       "It's advisable to create <literal>MCV</literal> statistics 
objects only
 >       on combinations of columns that are actually used in conditions 
together,
 >       and for which misestimation of the number of groups is 
resulting in bad
 >       plans.  Otherwise, the <command>ANALYZE</command> cycles are 
just wasted."
 >
Makes sense. Although that's what we say about the existing stats, so 
perhaps we should tweak that too.

 > 7. straight-forward -> straightforward
 >
 > (most-common values) lists, a straight-forward extension of the 
per-column
 > > 8. adresses -> addresses
 >
 > statistics adresses the limitation by storing individual values, but it
 >
Will fix. Thanks for proof-reading.

 > 9. Worth mentioning ANALYZE time?
 >
 >      This section introduces multivariate variant of 
<acronym>MCV</acronym>
 >      (most-common values) lists, a straight-forward extension of the 
per-column
 >      statistics described in <xref 
linkend="row-estimation-examples"/>. This
 >      statistics adresses the limitation by storing individual values, 
but it
 >      is naturally more expensive, both in terms of storage and 
planning time.
 >
Yeah.

 > 10. low -> a low
 >
 >      with low number of distinct values. Before looking at the second 
query,
 >
 > 11. them -> then
 >
 >      on items in the <acronym>MCV</acronym> list, and them sums the 
frequencies
 >
Will fix.

 > 12.  Should we be referencing the source from the docs?
 >
 > See <function>mcv_clauselist_selectivity</function>
 >      in <filename>src/backend/statistics/mcv.c</filename> for details.
 >
 > hmm. I see it's not the first going by: git grep -E "\w+\.c\<"
 > gt
Hmm, that does not return anything to me - do you actually see any 
references to .c files in the sgml docs? I agree that probably is not a 
good idea, so I'll remove that.

 > 13. Pretty minor, but the following loop in
 > UpdateStatisticsForTypeChange() could use a break;
 >
 > attribute_referenced = false;
 > for (i = 0; i < staForm->stxkeys.dim1; i++)
 > if (attnum == staForm->stxkeys.values[i])
 > attribute_referenced = true;
 >
 > UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
 > removed this already
 >
;-)

 > 14. Again in UpdateStatisticsForTypeChange(), would it not be better
 > to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before
 > checking if the stats contain this column?  This gets rid of your
 > reset_stats variable.
 >
 > I also don't quite understand why there's an additional check for
 > statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false
 > then why do we do the dummy update on the tuple?
 >
 > Have you just coded this so that you can support other stats types
 > later without too much modification? If so, I'm finding it a bit
 > confusing to read, so maybe it's worth only coding it that way if
 > there's more than one stats type to reset for.
 >
 > UPDATE: If I'd reviewed the correct patch I'd have seen that you'd
 > removed this already
;-)

 >
 > 15. I see you broke out the remainder of the code from
 > clauselist_selectivity() into clauselist_selectivity_simple().  The
 > comment looks like just a copy and paste from the original.  That
 > seems like quite a bit of duplication. Is it better to maybe trim down
 > the original one?
 >
I'll see what I can do.

 > 16. I initially didn't see how this code transformed the bms into an 
array:
 >
 > /*
 > * Transform the bms into an array, to make accessing i-th member easier,
 > * and then construct a filtered version with only attnums referenced
 > * by the dependency we validate.
 > */
 > attnums = build_attnums(attrs);
 >
 > attnums_dep = (int *)palloc(k * sizeof(int));
 > for (i = 0; i < k; i++)
 > attnums_dep[i] = attnums[dependency[i]];
 >
 > Would it be better to name build_attnums() build_attnums_array() ?
 >
 > I think it would also be better to, instead of saying "the bms", just
 > say "attrs".
 >
Hmmm, maybe.

 > 17. dependencies_clauselist_selectivity(), in:
 >
 > if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) &&
 > (!bms_is_member(listidx, *estimatedclauses)))
 >
 > would it be better to have the bms_is_member() first?
 >
Yes, that might be a tad faster.

 > 18. In dependencies_clauselist_selectivity() there seem to be a new
 > bug introduced.  We do:
 >
 > /* mark this one as done, so we don't touch it again. */
 > *estimatedclauses = bms_add_member(*estimatedclauses, listidx);
 >
 > but the bms_is_member() check that skipped these has been removed.
 >
 > It might be easier to document if we just always do:
 >
 >    if (bms_is_member(listidx, *estimatedclauses))
 > continue;
 >
 > at the start of both loops. list_attnums can just be left unset for
 > the originally already estimatedclauses.
 >
It's probably not as clear as it should be, but if the clause is already 
estimated (or incompatible), then the list_attnums[] entry will be 
InvalidAttrNumber. Which is what we check in the second loop.

 > 19. in extended_stats.c, should build_attnums() be documented that the
 > Bitmapset members are not offset by
 > FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of
 > Attnums are offset by this, so might be worth a mention.
 >
Good point.

 > 20. I think bms_member_index() needs documentation.  I imagine you'll
 > want to mention that the bitmapset must contain the given varattno,
 > else surely it'll do the wrong thing if it's not. Perhaps an
 > Assert(bms_is_member(keys, varattno)); should be added to it.
 >
Agreed. Or maybe make it return -1 in that case? It might even have 
missing_ok flag or something like that.

 > 21. Comment does not really explain what the function does or what the
 > arguments mean:
 >
 > /*
 >   * statext_is_compatible_clause_internal
 >   * Does the heavy lifting of actually inspecting the clauses for
 >   * statext_is_compatible_clause.
 >   */
 >
Will improve.

 > 22. In statext_is_compatible_clause_internal():
 >
 > /* Var = Const */
 >
 > The above comment seems a bit misplaced.  It looks like the code below
 > it is looking for an OpExpr in the form of "Var <op> Const", or "Const
 > <op> Var".
 >
Yes, I agree.

 > 23. statext_is_compatible_clause_internal() you have:
 >
 > if ((get_oprrest(expr->opno) != F_EQSEL) &&
 > (get_oprrest(expr->opno) != F_NEQSEL) &&
 > (get_oprrest(expr->opno) != F_SCALARLTSEL) &&
 > (get_oprrest(expr->opno) != F_SCALARLESEL) &&
 > (get_oprrest(expr->opno) != F_SCALARGTSEL) &&
 > (get_oprrest(expr->opno) != F_SCALARGESEL))
 > return false;
 >
 > 6 calls to get_oprrest(). 1 is enough.
 >
 > How does the existing MCV and histogram stats handle these operators?
 > Does it insist on a btree opfamily, or is it as crude as this too?
 >
It's this crude too, AFAICS.

 > 24. In statext_is_compatible_clause_internal, you have:
 >
 > /* NOT/AND/OR clause */
 > if (or_clause(clause) ||
 > and_clause(clause) ||
 > not_clause(clause))
 > {
 > /*
 > * AND/OR/NOT-clauses are supported if all sub-clauses are supported
 >
 > Looks like you were not sure which order to have these, so you just
 > tried a few variations :-D Maybe just make them all the same?
 >
If you insist ;-)

 > 25. Does statext_is_compatible_clause_internal)_ need to skip over 
RelabelTypes?
 >
I believe it does, based on what I've observed during development. Why 
do you think it's not necessary?

 > 26. In statext_is_compatible_clause_internal() you mention: /* We only
 > support plain Vars for now */, but I see nothing that ensures that
 > only Vars are allowed in the is_opclause() condition.
 >
 > /* see if it actually has the right */
 > ok = (NumRelids((Node *) expr) == 1) &&
 > (is_pseudo_constant_clause(lsecond(expr->args)) ||
 > (varonleft = false,
 >    is_pseudo_constant_clause(linitial(expr->args))));
 >
 > the above would allow var+var == const through.
 >
But then we call statext_is_compatible_clause_internal on it again, and 
that only allows Vars and "Var op Const" expressions. Maybe there's a 
way around that?

 > The NumRelids seems like it would never have anything > 1 as you have
 > a BMS_SINGLETON test on the RestrictInfo where you're calling this
 > function from.  I think you likely want just a IsA(... , Var) checks
 > here, after skipping over RelabelTypes.
 > > Not sure what "/* see if it actually has the right */" means.
 >
That should have been "right structure" I believe.

 > 27. Should the function be named something more related to MCV?  The
 > name makes it appear fairly generic to extended stats.
 >
 >   * statext_is_compatible_clause
 >   * Determines if the clause is compatible with MCV lists.
 >
No, because it's supposed to also handle histograms (and perhaps other 
stats types) in the future.

 > 28. This comment seems wrong:
 >
 >   * Currently we only support Var = Const, or Const = Var. It may be 
possible
 >   * to expand on this later.
 >
 > I see you're allowing IS NULL and IS NOT NULL too.  = does not seem to
 > be required either.
 >
OK, will fix.

 > 29. The following fragment makes me think we're only processing
 > clauses to use them with MCV lists, but the comment claims "dependency
 > selectivity estimations"
 >
 > /* we're interested in MCV lists */
 > int types = STATS_EXT_MCV;
 >
 > /* check if there's any stats that might be useful for us. */
 > if (!has_stats_of_kind(rel->statlist, types))
 > return (Selectivity) 1.0;
 >
 > list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
 > list_length(clauses));
 >
 > /*
 > * Pre-process the clauses list to extract the attnums seen in each item.
 > * We need to determine if there's any clauses which will be useful for
 > * dependency selectivity estimations. Along the way we'll record all of
 >
Yeah, that's copy-pasto.

 > 30. Is it better to do the bms_is_member() first here?
 >
 > if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) &&
 > (!bms_is_member(listidx, *estimatedclauses)))
 >
 > Likely it'll be cheaper.
 >
Yeah, same as before.

 > 31. I think this comment should be /* Ensure choose_best_statistics()
 > didn't mess up */
 >
 > /* We only understand MCV lists for now. */
 > Assert(stat->kind == STATS_EXT_MCV);
 >
I'll expand the comment a bit.

 > 32. What're lags?
 >
 > bool    *isnull; /* lags of NULL values (up to 32 columns) */
 >
Should be "flags" I think.

 > 33. "ndimentions"? There's no field in the struct by that name. I'd
 > assume it's the same size as the isnull array above it?
 >
 > Datum    *values; /* variable-length (ndimensions) */
 >
Yes, that's the case.

 > 34. README.mcv
 >
 > * large -> a large
 >
 > For columns with large number of distinct values (e.g. those with 
continuous
 >
 > * Is the following up-to-date?  I thought I saw code for NOT too?
 >
 >      (a) equality clauses    WHERE (a = 1) AND (b = 2)
 >      (b) inequality clauses  WHERE (a < 1) AND (b >= 2)
 >      (c) NULL clauses        WHERE (a IS NULL) AND (b IS NOT NULL)
 >      (d) OR clauses          WHERE (a < 1) OR (b >= 2)
 >
 > * multi-variate -> multivariate
 >
 > are large the list may be quite large. This is especially true for 
multi-variate
 >
 > * a -> an
 >
 > TODO Currently there's no logic to consider building only a MCV list 
(and not
 >
 > * I'd have said "an SRF", but git grep "a SRF" disagrees with me. I
 > guess those people must be pronouncing it, somehow!? surf... serf... ?
 >
 > easier, there's a SRF returning detailed information about the MCV lists.
 >
 > * Is it better to put a working SQL in here?
 >
 > SELECT * FROM pg_mcv_list_items(stxmcv);
 >
 > maybe like:
 >
 > SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s;
 >
 > Maybe with a WHERE clause?
 >
 > * This list seems outdated.
 >
 >      - item index (0, ..., (nitems-1))
 >      - values (string array)
 >      - nulls only (boolean array)
 >      - frequency (double precision)
 >
 > base_frequency seems to exist now too.
 >
Yeah, those are mostly typos. Will fix.


thanks

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] REINDEX CONCURRENTLY 2.0
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists