Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id CAEZATCW8=TDxAJUXnFT1p+4v+ReaabXxWJeGTn5EpEgGjQKBQw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists
List pgsql-hackers
On Mon, 7 Jan 2019 at 00:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> FWIW the main unsolved issue (at least on the MCV part) is how it
> decides which items to keep in the list.
>
> As explained in [1], in the multivariate case we can't simply look at
> the group frequency and compare it to the average frequency (of the
> non-MCV items), which is what analyze_mcv_list() does in the
> single-column case. In the multivariate case we also case about observed
> vs. base frequency, i.e. we want the MCV list to include groups that are
> present singificantly more/less than product of per-column stats.
>
> I've repeatedly tried to come up with a criteria that would address
> that, but it seems rather difficult because we can't abandon the other
> criteria either. So the MCV list should include groups that match both
>
> (a) items that are statistically more common than the non-MCV part (i.e.
> the rule from per-column analyze_mcv_list)
>
> (b) items that are statistically more/less common than estimated from
> per-column stats (i.e. the new rule)

Thinking about this some more, I think that it probably isn't
appropriate to use analyze_mcv_list() directly because that's making
specific assumptions about how items not in the list will be estimated
that aren't actually true for groups of values in multivariate stats.
If a group of values isn't in the MCV list, it gets estimated based on
the product of the selectivities from the per-column stats (modulo the
additional logic preventing the selectivity not exceeding the total
non-MCV selectivity).

So actually, the estimate for a group of values will be either the MCV
item's frequency (if the MCV item is kept), or (roughly) the MCV
item's base_frequency (if the MCV item is not kept). That suggests
that we should simply keep items that are significantly more or less
common than the item's base frequency -- i.e., keep rule (b) and ditch
rule (a).

> Enforcing rule (a) seems reasonable because it ensures the MCV list
> includes all items more frequent than the last one. Without it, it's
> difficult to decide know whether the absent item is very common (but
> close to base frequency) or very uncommon (so less frequent than the
> last MCV item).

I'm not sure there's much we can do about that. Keeping the item will
result in keeping a frequency that we know is close to the base
frequency, and not keeping the item will result in per-column stats
being used that we expect to also give an estimate close to the base
frequency. So either way, the result is about the same, and it's
probably better to discard it, leaving more room for other items about
which we may have more information.

That said, there is a separate benefit to keeping items in the list
even if their frequency is close to the base frequency -- the more
items kept, the larger their total selectivity will be, giving a
better cap on the non-MCV selectivities. So if, after keeping all
items satisfying rule (b), there are free slots available, perhaps
they should be used for the most common remaining values satisfying
rule (a).

Regards,
Dean


pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: Offline enabling/disabling of data checksums
Next
From: Tom Lane
Date:
Subject: Re: OpenBSD versus semaphores