Query planner selects “cheapest” query plan based on its cost estimation. But it’s done with many rough assumptions. This is why the estimated cost could be inadequate to real execution cost. One possibility is to improve the cost estimate mechanism itself by adding features like multivariate statistics. Another possibility is to use query execution feedback: see how estimated parameter values differ from actual parameter values. We can apply machine learning techniques to improve the cost estimates using this feedback, so DBMS would learn on its own mistakes.
We’ve already done this in a simple case, and further work is planned in the following directions:
- Extend implemented model to cover more use cases,
- Provide the infrastructure necessary to make our machine learning an extension.
Currently, query planning strictly precedes query execution. Sometimes it appears to be a serious limitation. When one part of a plan is already executed it could be possible to significantly improve the rest of the plan on the basis of gathered statistics. We can see two cases when this approach could be applied:
- Online reordering of filter expressions. During sequential scan of large table it’s important to do the cheapest and the most selective checks first. However estimated selectivity and cost of filtering are inaccurate, and thus the order of applying filters based on estimates can be not optimal. But filter expressions could be reordered online on the base of statistics of their previous execution.
- Some queries could be divided into sequence of steps when subsequent steps could be replanned on the base of results of previous steps. For instance, suppose that step 1 is a scan of table A, and step 2 is a join of tables A and B. Depending on row count and data distribution from the first step we could choose different join algorithm on the second step.