Thread: Predicting query runtime

Predicting query runtime

From

Vinicius Segalin

Date:

12 September 2016, 14:03:42

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.

My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Thank you.

Best regards,

Vinicius Segalin

Re: Predicting query runtime

From

Oleksandr Shulgin

Date:

12 September 2016, 15:08:49

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Hi,

I'm not aware of machine-learning techniques to achieve that (and I don't actually believe it's feasible), but there you might find this extension particularly useful: https://www.postgresql.org/docs/9.5/static/pgstatstatements.html

Can you share some links to the papers you are referring to (assuming these are publicly available)?

Regards,

--
Alex

Re: Predicting query runtime

From

Vinicius Segalin

Date:

12 September 2016, 15:59:36

2016-09-12 12:08 GMT-03:00 Oleksandr Shulgin <oleksandr.shulgin@zalando.de>:

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Hi,

I'm not aware of machine-learning techniques to achieve that (and I don't actually believe it's feasible), but there you might find this extension particularly useful: https://www.postgresql.org/docs/9.5/static/pgstatstatements.html

I'll read about it and see how it can help me. Thank you

Can you share some links to the papers you are referring to (assuming these are publicly available)?

I don't have them all now, but these are some of them:

http://dl.acm.org/citation.cfm?id=1845166

https://people.eecs.berkeley.edu/~jordan/papers/ganapathi-etal-icde09.pdf

Re: Predicting query runtime

From

Merlin Moncure

Date:

12 September 2016, 18:16:21

On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.

Well, postgres estimates the query runtime in the form of an expected
'cost', where the cost is an arbitrary measure based on time
complexity of query plan.   It shouldn't be too difficult to correlate
estimated cost to runtime cost.  A statistical analysis of that
correlation would be incredibly useful work although generating sample
datasets would be a major challenge.

merlin

Re: Predicting query runtime

From

Jeff Janes

Date:

12 September 2016, 20:01:50

On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

I don't know about machine learning, but if there were some way to get the planner to tell you predicted cost in terms of a breakdown of how many multiples of each *_cost factor (rather than only a grand total which is what it does now), then it would be fairly easy to combine that with wall times from log_duration and do a simple linear regression.

I suspect the result would be that seq_page_cost and random_page_cost would have huge uncertainties on them. And since pretty much every query has non-zero predicted values for at least one of those, the huge uncertainties would then pollute all the rest of the fitted values as well. Perhaps that is where the machine learning would come in?

Another issue is the predicted costs are only meant to choose between different plans, not to predict overall wall time. Some parts of the planner only have one way to do something, and so doesn't bother to compute a cost for that as there is no choice to be made. This would leave glaring holes in the estimates (particularly for updates)

But to get that data out would require quite a bit of tedious altering of the planner code, and then you would have to find people willing to run that altered code on real world databases with a high level of logging to gather the data. (I suspect that gathering data from only toy databases would not be very useful).

Cheers,

Jeff

Re: Predicting query runtime

From

Istvan Soos

Date:

12 September 2016, 21:23:20

Hi Vinicius,

At Heap we have non-trivial complexity in our analytical queries, and
some of them can take a long time to complete. We did analyze features
like the query planner's output, our query properties (type,
parameters, complexity) and tried to automatically identify factors
that contribute the most into the total query time. It turns out that
you don't need to use machine learning for the basics, but at this
point we were not aiming for predictions yet.

As a spoiler: queries take long time because they do a lot of IO.
Features like reachback depth and duration (e.g. what period is the
analytical query about) can contribute a lot to the amount of IO,
thus, the query time. I have a blog post in my queue about our
analysis, would gladly bump its priority if there is interest in such
details.

I'm also curious: if you had a great way to predict the time/cost of
the queries, how would you use it?

Best regards,
  Istvan

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
>
> Thank you.
>
> Best regards,
>
> Vinicius Segalin

Re: Predicting query runtime

From

Vinicius Segalin

Date:

12 September 2016, 23:49:01

2016-09-12 15:16 GMT-03:00 Merlin Moncure <mmoncure@gmail.com>:

On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.

Well, postgres estimates the query runtime in the form of an expected
'cost', where the cost is an arbitrary measure based on time
complexity of query plan. It shouldn't be too difficult to correlate
estimated cost to runtime cost.

That's what I though too. At least it makes sense, I guess. But sometimes logic doesn't work, so I think only giving it a try will say.

A statistical analysis of that
correlation would be incredibly useful work although generating sample
datasets would be a major challenge.

merlin

Indeed. I'm using TPC-B along with pgbench to have some data to test (while I don't have real data), but I'm having a hard time creating queries that give me (very) different performance results so I can train my ML algorithm.

Re: Predicting query runtime

From

Vinicius Segalin

Date:

12 September 2016, 23:54:37

2016-09-12 17:01 GMT-03:00 Jeff Janes <jeff.janes@gmail.com>:

On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

I don't know about machine learning, but if there were some way to get the planner to tell you predicted cost in terms of a breakdown of how many multiples of each *_cost factor (rather than only a grand total which is what it does now), then it would be fairly easy to combine that with wall times from log_duration and do a simple linear regression.

I suspect the result would be that seq_page_cost and random_page_cost would have huge uncertainties on them. And since pretty much every query has non-zero predicted values for at least one of those, the huge uncertainties would then pollute all the rest of the fitted values as well. Perhaps that is where the machine learning would come in?

Another issue is the predicted costs are only meant to choose between different plans, not to predict overall wall time. Some parts of the planner only have one way to do something, and so doesn't bother to compute a cost for that as there is no choice to be made. This would leave glaring holes in the estimates (particularly for updates)

But to get that data out would require quite a bit of tedious altering of the planner code, and then you would have to find people willing to run that altered code on real world databases with a high level of logging to gather the data. (I suspect that gathering data from only toy databases would not be very useful).

Cheers,

Jeff

Modifying the planner is way too complex for me at this time, so I really can't go into that kind of solution, but I can try to use as much as the planner gives me today, make the best out of it and hope it's enough to give me some satisfactory results.

Re: Predicting query runtime

From

Vinicius Segalin

Date:

13 September 2016, 00:07:01

2016-09-12 18:22 GMT-03:00 Istvan Soos <istvan.soos@gmail.com>:

Hi Vinicius,

At Heap we have non-trivial complexity in our analytical queries, and
some of them can take a long time to complete. We did analyze features
like the query planner's output, our query properties (type,
parameters, complexity) and tried to automatically identify factors
that contribute the most into the total query time. It turns out that
you don't need to use machine learning for the basics, but at this
point we were not aiming for predictions yet.

And how did you do that? Manually analyzing some queries?

But I think, as you said, it wouldn't apply for predictions, but instead for making long queries run faster, right?

As a spoiler: queries take long time because they do a lot of IO.
Features like reachback depth and duration (e.g. what period is the
analytical query about) can contribute a lot to the amount of IO,
thus, the query time. I have a blog post in my queue about our
analysis, would gladly bump its priority if there is interest in such
details.

If it's not too much work, I would like to have some details on your process. It looks it's not exactly what I'm trying to do, but would certainly help me with my work.

I'm also curious: if you had a great way to predict the time/cost of
the queries, how would you use it?

I'm working on something for my master degree (it's the idea, and I really hope I can make it possible) where I'll help the user choosing the resources for the database that will give him the best performance (or at least the performance he thinks it's good enough). So the idea would be to train each machine (with different resources) and then be able to predict for an specific query what the performance would be.

Thank you all for the answers so far. I hope we can clear my mind about this issue.

Re: Predicting query runtime

From

"Hu, Patricia"

Date:

13 September 2016, 14:20:30

I’ve been looking for this on postgres too. Does Postgres have something similar to Oracle’s v$session_longops? It gives info on total unit of work, units done so far, last update time, and time remaining etc, and I found it valuable in providing an estimate to how long a certain query would keep running and whether or not to kill it if applicable. This should be relatively easy to implement in postgres too if it is not available yet?

Thanks,

Patricia

From: Oleksandr Shulgin [mailto:oleksandr.shulgin@zalando.de]
Sent: Monday, September 12, 2016 11:08 AM
To: Vinicius Segalin
Cc: pgsql general
Subject: Re: Predicting query runtime

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin <vinisegalin@gmail.com> wrote:

Hi everyone,

My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Hi,

Can you share some links to the papers you are referring to (assuming these are publicly available)?

Regards,

--
Alex

Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.

Re: Predicting query runtime

From

Istvan Soos

Date:

13 September 2016, 19:54:40

On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
> 2016-09-12 18:22 GMT-03:00 Istvan Soos <istvan.soos@gmail.com>:
>> At Heap we have non-trivial complexity in our analytical queries, and
>> some of them can take a long time to complete. We did analyze features
>> like the query planner's output, our query properties (type,
>> parameters, complexity) and tried to automatically identify factors
>> that contribute the most into the total query time. It turns out that
>> you don't need to use machine learning for the basics, but at this
>> point we were not aiming for predictions yet.
>
> And how did you do that? Manually analyzing some queries?

In this case, it was automatic analysis and feature discovery. We were
generating features out of our query parameters, out of the SQL
string, and also out of the explain analyze output. For each of these
features, we have examined the P(query is slow | feature is present),
and measured its statistical properties (precision, recall,
correlations...).

With these we have built a decision tree-based partitioning, where our
feature-predicates divided the queries into subsets. Such a tree could
be used for predictions, or if we would like to be fancy, we could use
the feature vectors to train a neural network.

Hope this helps for now,
  Istvan

Re: Predicting query runtime

From

Oleg Bartunov

Date:

13 September 2016, 20:12:48

On Tue, Sep 13, 2016 at 2:54 PM, Istvan Soos <istvan.soos@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin <vinisegalin@gmail.com> wrote:
>> 2016-09-12 18:22 GMT-03:00 Istvan Soos <istvan.soos@gmail.com>:
>>> At Heap we have non-trivial complexity in our analytical queries, and
>>> some of them can take a long time to complete. We did analyze features
>>> like the query planner's output, our query properties (type,
>>> parameters, complexity) and tried to automatically identify factors
>>> that contribute the most into the total query time. It turns out that
>>> you don't need to use machine learning for the basics, but at this
>>> point we were not aiming for predictions yet.
>>
>> And how did you do that? Manually analyzing some queries?
>
> In this case, it was automatic analysis and feature discovery. We were
> generating features out of our query parameters, out of the SQL
> string, and also out of the explain analyze output. For each of these
> features, we have examined the P(query is slow | feature is present),
> and measured its statistical properties (precision, recall,
> correlations...).
>
> With these we have built a decision tree-based partitioning, where our
> feature-predicates divided the queries into subsets. Such a tree could
> be used for predictions, or if we would like to be fancy, we could use
> the feature vectors to train a neural network.

FYI, please check https://pgconf.ru/2016/89977

>
> Hope this helps for now,
>   Istvan
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: Predicting query runtime

From

Vinicius Segalin

Date:

13 September 2016, 22:53:24

2016-09-13 17:12 GMT-03:00 Oleg Bartunov <obartunov@gmail.com>:

FYI, please check https://pgconf.ru/2016/89977

Interesting! Was this presentation filmed? Or would you have a post or something else with more details?

Re: Predicting query runtime

From

Oleg Ivanov

Date:

14 September 2016, 10:23:22

Hi Vinicius,

I recommend you to read this (http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf) paper. Authors make a nice classification of different query performance prediction methods and propose their own solution for this problem.

You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be warned about possible pitfalls in PostgreSQL query optimizer. In my opinion, the most unpleasant one is that you often cannot rely on cardinality estimations made by PostgreSQL for path nodes. Typically, the more complicated query is, the less reliable cardinality estimations become. The good news is that cost model allows to predict query execution time precisely enough with good cardinality estimations.

In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf) there is no machine learning. Nevertheless, you may find it interesting. It contains good description of PostgreSQL cost model and a method for automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each *_cost factor are not provided or even directly computed in PostgreSQL for the majority of path nodes. The typical way of computations is not, for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost, but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 * (random_page_cost / 2). Mathematically these formulas are equivalent, but practically you will spend more time and write more code to extract the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those nodes, for which the computations are not very complicated and, therefore, the multipliers can be extracted relatively easy. Anyway, cost models are available in src/backend/optimizer/path/costsize.c, and you have to get inside it somehow to obtain extra information.

As for me, the paper (http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf) is interesting mostly by their introduction of graph editing distance as a distance on the space of paths. It is interesting because some machine learning methods do not require feature representations of objects, but only a valid distance function on each pair of them.

The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about predicting query execution time for concurrent workloads and also contains machine learning.

I hope listed papers will be useful for your master's thesis.

The post related to (https://pgconf.ru/en/2016/89977) is available here (http://tigvarts.livejournal.com/691.html). Please note, that this post was published in February 2016, so the information in this post is partially outdated. Some main principles were changed during my work, some issues for further research are closed now, while some other issues appeared. I believe I will have a paper on my current results completed in the early October.

------
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 09/12/2016 05:03 PM, Vinicius Segalin wrote:

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Thank you.

Best regards,

Vinicius Segalin

Re: Predicting query runtime

From

Vinicius Segalin

Date:

14 September 2016, 14:14:32

2016-09-14 7:23 GMT-03:00 Oleg Ivanov <o.ivanov@postgrespro.ru>:

I hope listed papers will be useful for your master's thesis.

I'm sure they will!

Thank you, Oleg.