Thread: Predicting query runtime
Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.
Alex
On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.Hi,I'm not aware of machine-learning techniques to achieve that (and I don't actually believe it's feasible), but there you might find this extension particularly useful: https://www.postgresql.org/docs/9.5/static/ pgstatstatements.html
Can you share some links to the papers you are referring to (assuming these are publicly available)?
On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote: > Hi everyone, > > I'm trying to find a way to predict query runtime (I don't need to be > extremely precise). I've been reading some papers about it, and people are > using machine learning to do so. For the feature vector, they use what the > DBMS's query planner provide, such as operators and their cost. The thing is > that I haven't found any work using PostgreSQL, so I'm struggling to adapt > it. > My question is if anyone is aware of a work that uses machine learning and > PostgreSQL to predict query runtime, or maybe some other method to perform > this. Well, postgres estimates the query runtime in the form of an expected 'cost', where the cost is an arbitrary measure based on time complexity of query plan. It shouldn't be too difficult to correlate estimated cost to runtime cost. A statistical analysis of that correlation would be incredibly useful work although generating sample datasets would be a major challenge. merlin
Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.
Hi Vinicius, At Heap we have non-trivial complexity in our analytical queries, and some of them can take a long time to complete. We did analyze features like the query planner's output, our query properties (type, parameters, complexity) and tried to automatically identify factors that contribute the most into the total query time. It turns out that you don't need to use machine learning for the basics, but at this point we were not aiming for predictions yet. As a spoiler: queries take long time because they do a lot of IO. Features like reachback depth and duration (e.g. what period is the analytical query about) can contribute a lot to the amount of IO, thus, the query time. I have a blog post in my queue about our analysis, would gladly bump its priority if there is interest in such details. I'm also curious: if you had a great way to predict the time/cost of the queries, how would you use it? Best regards, Istvan On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote: > Hi everyone, > > I'm trying to find a way to predict query runtime (I don't need to be > extremely precise). I've been reading some papers about it, and people are > using machine learning to do so. For the feature vector, they use what the > DBMS's query planner provide, such as operators and their cost. The thing is > that I haven't found any work using PostgreSQL, so I'm struggling to adapt > it. > My question is if anyone is aware of a work that uses machine learning and > PostgreSQL to predict query runtime, or maybe some other method to perform > this. > > Thank you. > > Best regards, > > Vinicius Segalin
On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
Well, postgres estimates the query runtime in the form of an expected
'cost', where the cost is an arbitrary measure based on time
complexity of query plan. It shouldn't be too difficult to correlate
estimated cost to runtime cost.
A statistical analysis of that
correlation would be incredibly useful work although generating sample
datasets would be a major challenge.
merlin
On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.I don't know about machine learning, but if there were some way to get the planner to tell you predicted cost in terms of a breakdown of how many multiples of each *_cost factor (rather than only a grand total which is what it does now), then it would be fairly easy to combine that with wall times from log_duration and do a simple linear regression.I suspect the result would be that seq_page_cost and random_page_cost would have huge uncertainties on them. And since pretty much every query has non-zero predicted values for at least one of those, the huge uncertainties would then pollute all the rest of the fitted values as well. Perhaps that is where the machine learning would come in?Another issue is the predicted costs are only meant to choose between different plans, not to predict overall wall time. Some parts of the planner only have one way to do something, and so doesn't bother to compute a cost for that as there is no choice to be made. This would leave glaring holes in the estimates (particularly for updates)But to get that data out would require quite a bit of tedious altering of the planner code, and then you would have to find people willing to run that altered code on real world databases with a high level of logging to gather the data. (I suspect that gathering data from only toy databases would not be very useful).Cheers,Jeff
Hi Vinicius,
At Heap we have non-trivial complexity in our analytical queries, and
some of them can take a long time to complete. We did analyze features
like the query planner's output, our query properties (type,
parameters, complexity) and tried to automatically identify factors
that contribute the most into the total query time. It turns out that
you don't need to use machine learning for the basics, but at this
point we were not aiming for predictions yet.
As a spoiler: queries take long time because they do a lot of IO.
Features like reachback depth and duration (e.g. what period is the
analytical query about) can contribute a lot to the amount of IO,
thus, the query time. I have a blog post in my queue about our
analysis, would gladly bump its priority if there is interest in such
details.
I'm also curious: if you had a great way to predict the time/cost of
the queries, how would you use it?
I’ve been looking for this on postgres too. Does Postgres have something similar to Oracle’s v$session_longops? It gives info on total unit of work, units done so far, last update time, and time remaining etc, and I found it valuable in providing an estimate to how long a certain query would keep running and whether or not to kill it if applicable. This should be relatively easy to implement in postgres too if it is not available yet?
Thanks,
Patricia
From: Oleksandr Shulgin [mailto:oleksandr.shulgin@zalando.de]
Sent: Monday, September 12, 2016 11:08 AM
To: Vinicius Segalin
Cc: pgsql general
Subject: Re: Predicting query runtime
On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
Hi everyone,
I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.
Hi,
I'm not aware of machine-learning techniques to achieve that (and I don't actually believe it's feasible), but there you might find this extension particularly useful: https://www.postgresql.org/docs/9.5/static/pgstatstatements.html[postgresql.org]
Can you share some links to the papers you are referring to (assuming these are publicly available)?
Regards,
--
Alex
On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote: > 2016-09-12 18:22 GMT-03:00 Istvan Soos <istvan.soos@gmail.com>: >> At Heap we have non-trivial complexity in our analytical queries, and >> some of them can take a long time to complete. We did analyze features >> like the query planner's output, our query properties (type, >> parameters, complexity) and tried to automatically identify factors >> that contribute the most into the total query time. It turns out that >> you don't need to use machine learning for the basics, but at this >> point we were not aiming for predictions yet. > > And how did you do that? Manually analyzing some queries? In this case, it was automatic analysis and feature discovery. We were generating features out of our query parameters, out of the SQL string, and also out of the explain analyze output. For each of these features, we have examined the P(query is slow | feature is present), and measured its statistical properties (precision, recall, correlations...). With these we have built a decision tree-based partitioning, where our feature-predicates divided the queries into subsets. Such a tree could be used for predictions, or if we would like to be fancy, we could use the feature vectors to train a neural network. Hope this helps for now, Istvan
On Tue, Sep 13, 2016 at 2:54 PM, Istvan Soos <istvan.soos@gmail.com> wrote: > On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote: >> 2016-09-12 18:22 GMT-03:00 Istvan Soos <istvan.soos@gmail.com>: >>> At Heap we have non-trivial complexity in our analytical queries, and >>> some of them can take a long time to complete. We did analyze features >>> like the query planner's output, our query properties (type, >>> parameters, complexity) and tried to automatically identify factors >>> that contribute the most into the total query time. It turns out that >>> you don't need to use machine learning for the basics, but at this >>> point we were not aiming for predictions yet. >> >> And how did you do that? Manually analyzing some queries? > > In this case, it was automatic analysis and feature discovery. We were > generating features out of our query parameters, out of the SQL > string, and also out of the explain analyze output. For each of these > features, we have examined the P(query is slow | feature is present), > and measured its statistical properties (precision, recall, > correlations...). > > With these we have built a decision tree-based partitioning, where our > feature-predicates divided the queries into subsets. Such a tree could > be used for predictions, or if we would like to be fancy, we could use > the feature vectors to train a neural network. FYI, please check https://pgconf.ru/2016/89977 > > Hope this helps for now, > Istvan > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
FYI, please check https://pgconf.ru/2016/89977
I recommend you to read this (http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf) paper. Authors make a nice classification of different query performance prediction methods and propose their own solution for this problem.
You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be warned about possible pitfalls in PostgreSQL query optimizer. In my opinion, the most unpleasant one is that you often cannot rely on cardinality estimations made by PostgreSQL for path nodes. Typically, the more complicated query is, the less reliable cardinality estimations become. The good news is that cost model allows to predict query execution time precisely enough with good cardinality estimations.
In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf) there is no machine learning. Nevertheless, you may find it interesting. It contains good description of PostgreSQL cost model and a method for automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each *_cost factor are not provided or even directly computed in PostgreSQL for the majority of path nodes. The typical way of computations is not, for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost, but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 * (random_page_cost / 2). Mathematically these formulas are equivalent, but practically you will spend more time and write more code to extract the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those nodes, for which the computations are not very complicated and, therefore, the multipliers can be extracted relatively easy. Anyway, cost models are available in src/backend/optimizer/path/costsize.c, and you have to get inside it somehow to obtain extra information.
As for me, the paper (http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf) is interesting mostly by their introduction of graph editing distance as a distance on the space of paths. It is interesting because some machine learning methods do not require feature representations of objects, but only a valid distance function on each pair of them.
The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about predicting query execution time for concurrent workloads and also contains machine learning.
I hope listed papers will be useful for your master's thesis.
The post related to (https://pgconf.ru/en/2016/89977) is available here (http://tigvarts.livejournal.com/691.html). Please note, that this post was published in February 2016, so the information in this post is partially outdated. Some main principles were changed during my work, some issues for further research are closed now, while some other issues appeared. I believe I will have a paper on my current results completed in the early October.
------
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.Thank you.Best regards,Vinicius Segalin
I hope listed papers will be useful for your master's thesis.