Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id 3b8577ba-6a29-d1f0-b155-d3a50c1a8399@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
FWIW the main unsolved issue (at least on the MCV part) is how it
decides which items to keep in the list.

As explained in [1], in the multivariate case we can't simply look at
the group frequency and compare it to the average frequency (of the
non-MCV items), which is what analyze_mcv_list() does in the
single-column case. In the multivariate case we also case about observed
vs. base frequency, i.e. we want the MCV list to include groups that are
present singificantly more/less than product of per-column stats.

I've repeatedly tried to come up with a criteria that would address
that, but it seems rather difficult because we can't abandon the other
criteria either. So the MCV list should include groups that match both

(a) items that are statistically more common than the non-MCV part (i.e.
the rule from per-column analyze_mcv_list)

(b) items that are statistically more/less common than estimated from
per-column stats (i.e. the new rule)

Enforcing rule (a) seems reasonable because it ensures the MCV list
includes all items more frequent than the last one. Without it, it's
difficult to decide know whether the absent item is very common (but
close to base frequency) or very uncommon (so less frequent than the
last MCV item).

So it's not clear to me how to best marry these two things. So far the
only thing I came up with is looking for the last item where the
frequency and base frequency are very different (not sure how exactly to
decide when the difference becomes statistically significant), include
all items with higher frequencies, and then do analyze_mcv_list() to
also enforce (a). But it seems a bit cumbersome :-(

[1]
https://www.postgresql.org/message-id/8ac8bd94-478d-215d-e6bd-339f1f20a74c%402ndquadrant.com


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Python versions (was Re: RHEL 8.0 build)
Next
From: "Tsunakawa, Takayuki"
Date:
Subject: RE: [Proposal] Add accumulated statistics