Choosing values for multivariate MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Choosing values for multivariate MCV lists |
Date | |
Msg-id | 20190618205920.qtlzcu73whfpfqne@development Whole thread Raw |
Responses |
Re: Choosing values for multivariate MCV lists
|
List | pgsql-hackers |
Hi, The current implementation of multi-column MCV lists (added in this cycle) uses a fairly simple algorithm to pick combinations to include in the MCV list. We just compute a minimum number of occurences, and then include all entries sampled more often. See get_mincount_for_mcv_list(). By coincidence I received a real-world data set where this does not work particularly well, so I'm wondering if we can improve this somehow. It does not make the estimates worse than without the MCV lists, so I don't think we have an issue, but I wonder if we could do better. The data set is very simple - it's a single table of valid addresses (city/street name and street number): CREATE TABLE addresses ( city_name text, street_name text, street_no int ); Data with Czech addresses are available here (it's ~9.3MB, so I'm not attaching it directly) https://drive.google.com/file/d/1EiZGsk6s5hqzZrL7t5KiaqByJnBsndfA and attached is a SQL script I used to compute a "virtual" MCV list. There are about 3M records, so let's query for a street name in one of the cities: EXPLAIN ANALYZE SELECT * FROM addresses WHERE city_name = 'Most' AND street_name = 'Pionýrů' QUERY PLAN ---------------------------------------------------------------------------- Seq Scan on addresses (cost=0.00..62645.38 rows=1 width=25) (actual time=117.225..204.291 rows=779 loops=1) Filter: ((city_name = 'Most'::text) AND (street_name = 'Pionýrů'::text)) Rows Removed by Filter: 2923846 Planning Time: 0.065 ms Execution Time: 204.525 ms (5 rows) It's true 779 rows is only a tiny fraction of the data set (~0.025%), but this data set is a bit weird in one other aspect - about half of the records has empty (NULL) street_name, it's just city + number. Small villages simply don't have streets at all, and large cities probably started as small villages, so they have such addresses too. This however means that by choosing the MCV entries solely based on the number of occurrences in the sample, we end up with MCV lists where vast majority of entries has NULL street name. That's why we got such poor estimate in the example query, despite the fact that the city/street combination is the most frequent in the data set (with non-NULL street name). The other weird thing is that frequency of NULL street names is fairly uniform in the whole data set. In total about 50% addresses match that, and for individual cities it's generally between 25% and 100%, so the estimate is less than 2x off in those cases. But for addresses with non-NULL street names, the situation is very different. Some street names are unique to a single city, etc. Overall, this means we end up with MCV list with entries representing the mostly-uniform part of the data set, instead of prefering the entries that are truly skewed. So I'm wondering if we could/should rethink the algorithm, so look at the frequency and base_frequency, and maybe pick entries based on their ratio, or something like that. For example, we might sort the entries by abs(freq - base_freq) / freq which seems like a reasonable "measure of mis-estimate". Or maybe we might look just at abs(freq - base_freq)? I think the first option would be better, because (1 row vs. 100.000 rows) is probably worse than (1M rows vs. 2M rows). Of course, this is a challenging problem, for a couple of reasons. Firstly, picking simply the most frequent groups is very simple and it gives us additional information about the largest group (which may be useful elsewhere, e.g. the incremental sort patch). Secondly, if the correlated combinations are less frequent, how reliably can we even estimate the frequency from a sample? The combination in the example query was ~0.02% of the data, so how likely it's to be sampled? Alternatively, it's possible to increase statistics target to make the sample larger, but that also keeps more entries in the MCV list, including the correlated ones. So maybe the best thing we can do is to just rely on that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: