Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCWAT3hEvWv2JoMfioroy41eKb7roCaM3D0b-BDAoKTbFw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
On Thu, 28 Feb 2019 at 19:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Attached is an updated version of this patch series.

Here are some random review comments. I'll add more later, but I'm out
of energy for today.

1). src/test/regress/expected/type_sanity.out has bit-rotted.

2). Duplicate OIDs (3425).

3). It looks a bit odd that clauselist_selectivity() calls
statext_clauselist_selectivity(), which does MCV stats and will do
histograms, but it doesn't do dependencies, so
clauselist_selectivity() has to then separately call
dependencies_clauselist_selectivity(). It would seem neater if
statext_clauselist_selectivity() took care of calling
dependencies_clauselist_selectivity(), since dependencies are just
another kind of extended stats.

4). There are no tests for pg_mcv_list_items(). Given a table with a
small enough amount of data, so that it's all sampled, it ought to be
possible to get predictable MCV stats.

5). It's not obvious what some of the new test cases in the
"stats_ext" tests are intended to show. For example, the first test
creates a table with 5000 rows and a couple of indexes, does a couple
of queries, builds some MCV stats, and then repeats the queries, but
the results seem to be the same with and without the stats.

I wonder if it's possible to write smaller, more targeted tests.
Currently "stats_ext" is by far the slowest test in its group, and I'm
not sure that some of those tests add much. It ought to be possible to
write a function that calls EXPLAIN and returns a query's row
estimate, and then you could write tests to confirm the effect of the
new stats by verifying the row estimates change as expected.

6). This enum isn't needed for MCVs:

/*
 * Degree of how much MCV item matches a clause.
 * This is then considered when computing the selectivity.
 */
#define STATS_MATCH_NONE        0    /* no match at all */
#define STATS_MATCH_PARTIAL        1    /* partial match */
#define STATS_MATCH_FULL        2    /* full match */

STATS_MATCH_PARTIAL is never used for MCVs, so you may as well just
use booleans instead of this enum. If those are needed for histograms,
they can probably be made local to histogram.c.

7). estimate_num_groups_simple() isn't needed in this patch.

8). In README.mcv,
s/clauselist_mv_selectivity_mcvlist/mcv_clauselist_selectivity/.

9). In the list of supported clause types that follows (e) is the same
a (c), but with a more general description.

10). It looks like most of the subsequent description of the algorithm
is out of date and needs rewriting. All the stuff about full matches
and the use of ndistinct is now obsolete.

Regards,
Dean


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Should we increase the default vacuum_cost_limit?
Next
From: Julien Rouhaud
Date:
Subject: Re: Checksum errors in pg_stat_database