Re: multivariate statistics (v19) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics (v19) |
Date | |
Msg-id | 61e71067-9461-d785-b4a6-6e8a08996d5f@2ndquadrant.com Whole thread Raw |
In response to | Re: multivariate statistics (v19) (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] multivariate statistics (v19)
|
List | pgsql-hackers |
Hi, Attached is v20 of the multivariate statistics patch series, doing mostly the changes outlined in the preceding e-mail from October 11. The patch series currently has these parts: * 0001 : (FIX) teach pull_varno about RestrictInfo * 0002 : (PATCH) shared infrastructure and ndistinct coefficients * 0003 : (PATCH) functional dependencies (only the ANALYZE part) * 0004 : (PATCH) selectivity estimation using functional dependencies * 0005 : (PATCH) multivariate MCV lists * 0006 : (PATCH) multivariate histograms * 0007 : (WIP) selectivity estimation using ndistinct coefficients * 0008 : (WIP) use multiple statistics for estimation * 0009 : (WIP) psql tab completion basics Let me elaborate about the main changes in this version: 1) rework CREATE STATISTICS to what Dean Rasheed proposed in [1]: ----------------------------------------------------------------------- CREATE STATISTICS name WITH (options) ON (columns) FROM table This allows adding support for statistics on joins, expressions referencing multiple tables, and partial statistics (with WHERE predicates, similar to indexes). Although those things are not implemented (and I don't know if/when that happens), it's good the syntax supports it. I've been thinking about using "CREATE STATISTIC" instead, but I decided to stick with "STATISTICS" for two reasons. Firstly it's possible to create multiple statistics in a single command, for example by using WITH (mcv,histogram). And secondly, we already hava "ALTER TABLE ... SET STATISTICS n" (although that tweaks the statistics target for a column, not the statistics on the column). 2) no changes to catalog names ----------------------------------------------------------------------- Clearly, naming things is one of the hardest things in computer science. I don't have a good idea what names would be better than the current ones. In any case, this is fairly trivial to do. 3) special data types for statistics ----------------------------------------------------------------------- Heikki proposed to invent a new data type, similar to pg_node_tree. I do agree that storing the stats in plain bytea (i.e. catalog having bytea columns) was not particularly convenient, but I'm not sure how much of pg_node_tree Heikki wanted to copy. In particular, I'm not sure whether Heikki's idea was store all the statistics together in a single Datum, serialized into a text string (similar to pg_node_tree). I don't think that would be a good idea, as the statistics may be quite large and complex, and deserializing them from text format would be quite expensive. For pg_node_tree that's not a major issue because the values are usually fairly small. Similarly, packing everything into a single datum would force the planner to parse/unpack everything, even if it needs just a small piece (e.g. the ndistinct coefficients, but not histograms). So I've decided to invent new data types, one for each statistic type: * pg_ndistinct * pg_dependencies * pg_mcv_list * pg_histogram Similarly to pg_node_tree those data types only support output, i.e. both 'recv' and 'in' functions do elog(ERROR). But while pg_node_tree is stored as text, those new data types are still bytea. I do believe this is a good solution, and it allows casting the data types to text easily, as it simply calls the out function. The statistics however do not store attnums in the bytea, just indexes into pg_mv_statistic.stakeys. That means the out functions can't print column names in the output, or values (because without the attnum we don't know the type, and thus can't lookup the proper out function). I don't think there's a good solution for that (I was thinking about storing the attnums/typeoid in the statistics itself, but that seems fairly ugly). And I'm quite happy with those new data types. 4) replace functional dependencies with ndistinct (in the first patch) ----------------------------------------------------------------------- As the ndistinct coeffients are simpler than functional dependencies, I've decided to use them in the fist patch in the series, which implements the shared infrastructure. This does not mean throwing away functional dependencies entirely, just moving them to a later patch. 5) rework of ndistinct coefficients ----------------------------------------------------------------------- The ndistinct coefficients were also significantly reworked. Instead of computing and storing the value for the exact combination of attributes, the new version computes ndistinct for all combinations of attributes. So for example with CREATE STATISTICS x ON (a,b,c) the old patch only computed ndistinct on (a,b,c), while the new patch computes ndistinct on {(a,b,c), (a,b), (a,c), (b,c)}. This makes it way more powerful. The first patch (0002) only uses this in estimate_num_groups to improve GROUP BY estimates. A later patch (0007) shows how it might be used for selectivity estimation, but it's a very early WIP at this point. Also, I'm not sure we should use ndistinct coefficients this way, because of the "homogenity" assumption, similarly to functional dependencies. Functional dependencies are used only for selectivity estimation, so it's quite easy not to use them if they don't work for that purpose. But ndistinct coefficients are also used for GROUP BY estimation, where the homogenity assumption is not such a big deal. So I expect people to add ndistinct, get better GROUP BY estimates but sometimes worse selectivity estimates - not great, I guess. But the selectivity estimation using ndistinct coefficients is very simple right now - in particular it does not use the per-clause selectivities at all, it simply assumes the whole selectivity is 1/ndistinct for the combination of columns. Functional dependencies use this formula to combine the selectivities: P(a,b) = P(a) * [f + (1-f)*P(b)] so maybe there's something similar for ndistinct coefficients? I mean, let's say we know ndistinc(a), ndistinct(b), ndistinct(a,b) and P(a) and P(b). How do we compute P(a,b)? 5) rework functional dependencies ----------------------------------------------------------------------- Based on Dean's feedback, I've reworked functional dependencies to use continuous "degree" of validity (instead of true/false behavior, resulting in sudden changes in behavior). This significantly reduced the amount of code, because the old patch tried to identify transitive dependencies (to minimize time and storage requirements). Switching to continuous degree makes this impossible (or at least far more complicated), so I've simply ripped all of this out. This means the statistics will be larger and ANALYZE will take more time, the differences are fairly small in practice, and the estimation actually seems to work better. 6) MCV and histogram changes ----------------------------------------------------------------------- Those statistics types are mostly unchanged, except for a few minor bug fixes and removal of remove max_mcv_items and max_buckets options. Those options were meant to allow users to limit the size of the statistics, but the implementation was ignoring them so far. So I've ripped them out, and if needed we may reintroduce them later. 7) no more (elaborate) combinations of statistics ----------------------------------------------------------------------- I've ripped out the patch that combined multiple statistics in very elaborate way - it was overly complex, possibly wrong, but most importantly it distracted people from the preceding patches. So I've ripped this out, and instead replaced that with a very simple approach that allows using multiple statistics on different subsets if the clause list. So for example WHERE (a=1) AND (b=1) AND (c=1) AND (d=1) may benefit from two statistics, one on (a,b) and second on (c,d). It's very simple approach, but it does the trick for many cases and is better than "single statistics" limitation. The 0008 patch is actually very simple, essentially adding just a loop into the code blocks, so I think it's quite likely this will get merged into the preceding patches. 8) reduce table sizes used in regression tests ----------------------------------------------------------------------- Some of the regression tests used quite large tables (with up to 1M rows), which had two issues - long runtimes and unstability (because the ANALYZE sample is only 30k rows, so there were sometimes small changes due to picking a different sample). I've limited the table sizes to 30k rows. 8) open / unsolved questions ----------------------------------------------------------------------- The main open question is still whether clausesel.c is the best place to do all the heavy lifting (particularly matching clauses and statistics, and deciding which statistics to use). I suspect some of that should be done elsewhere (earlier in the planning), enriching the query tree somehow. Then clausesel.c would "only" compute the estimates, and it would also allow showing the info in EXPLAIN. I'm not particularly happy with the changes in claselist_selectivity look right now - there are three almost identical blocks, so this would deserve some refactoring. But I'd like to get some feedback first. regards [1] https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com [2] https://www.postgresql.org/message-id/1c7e4e63-769b-f8ce-f245-85ef4f59fcba%40iki.fi -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: