Choosing values for multivariate MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Choosing values for multivariate MCV lists
Date
Msg-id 20190618205920.qtlzcu73whfpfqne@development
Whole thread Raw
Responses Re: Choosing values for multivariate MCV lists
List pgsql-hackers
Hi,

The current implementation of multi-column MCV lists (added in this
cycle) uses a fairly simple algorithm to pick combinations to include in
the MCV list. We just compute a minimum number of occurences, and then
include all entries sampled more often. See get_mincount_for_mcv_list().

By coincidence I received a real-world data set where this does not work
particularly well, so I'm wondering if we can improve this somehow. It
does not make the estimates worse than without the MCV lists, so I don't
think we have an issue, but I wonder if we could do better.

The data set is very simple - it's a single table of valid addresses
(city/street name and street number):

    CREATE TABLE addresses (
        city_name   text,
        street_name text,
        street_no   int
    ); 

Data with Czech addresses are available here (it's ~9.3MB, so I'm not
attaching it directly)

    https://drive.google.com/file/d/1EiZGsk6s5hqzZrL7t5KiaqByJnBsndfA

and attached is a SQL script I used to compute a "virtual" MCV list.

There are about 3M records, so let's query for a street name in one of
the cities:

  EXPLAIN ANALYZE
  SELECT * FROM addresses
   WHERE city_name = 'Most' AND street_name = 'Pionýrů'


                                  QUERY PLAN
  ----------------------------------------------------------------------------
   Seq Scan on addresses  (cost=0.00..62645.38 rows=1 width=25)
                          (actual time=117.225..204.291 rows=779 loops=1)
     Filter: ((city_name = 'Most'::text) AND (street_name = 'Pionýrů'::text))
     Rows Removed by Filter: 2923846
   Planning Time: 0.065 ms
   Execution Time: 204.525 ms
  (5 rows)

It's true 779 rows is only a tiny fraction of the data set (~0.025%),
but this data set is a bit weird in one other aspect - about half of the
records has empty (NULL) street_name, it's just city + number. Small
villages simply don't have streets at all, and large cities probably
started as small villages, so they have such addresses too.

This however means that by choosing the MCV entries solely based on the
number of occurrences in the sample, we end up with MCV lists where vast
majority of entries has NULL street name.

That's why we got such poor estimate in the example query, despite the
fact that the city/street combination is the most frequent in the data
set (with non-NULL street name).

The other weird thing is that frequency of NULL street names is fairly
uniform in the whole data set. In total about 50% addresses match that,
and for individual cities it's generally between 25% and 100%, so the
estimate is less than 2x off in those cases.

But for addresses with non-NULL street names, the situation is very
different. Some street names are unique to a single city, etc.

Overall, this means we end up with MCV list with entries representing
the mostly-uniform part of the data set, instead of prefering the
entries that are truly skewed.

So I'm wondering if we could/should rethink the algorithm, so look at
the frequency and base_frequency, and maybe pick entries based on their
ratio, or something like that.

For example, we might sort the entries by

    abs(freq - base_freq) / freq

which seems like a reasonable "measure of mis-estimate". Or maybe we
might look just at abs(freq - base_freq)? I think the first option would
be better, because (1 row vs. 100.000 rows) is probably worse than (1M
rows vs. 2M rows).

Of course, this is a challenging problem, for a couple of reasons.

Firstly, picking simply the most frequent groups is very simple and it
gives us additional information about the largest group (which may be
useful elsewhere, e.g. the incremental sort patch).

Secondly, if the correlated combinations are less frequent, how reliably
can we even estimate the frequency from a sample? The combination in the
example query was ~0.02% of the data, so how likely it's to be sampled?

Alternatively, it's possible to increase statistics target to make the
sample larger, but that also keeps more entries in the MCV list,
including the correlated ones. So maybe the best thing we can do is to
just rely on that?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: initdb recommendations
Next
From: Peter Eisentraut
Date:
Subject: Re: New EXPLAIN option: ALL