Thread: Wait events monitoring future development
https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
Wait event monitoring looks ones again stuck on the way through community approval in spite of huge progress done last year in that direction. The importance of the topic is beyond discussion now, if you talk to any PostgreSQL person about implementing such a tool in Postgres and if the person does not get exited, probably you talk to a full-time PostgreSQL developer;-) Obviously it needs a better design, both the user interface and implementation, and perhaps this is why full-time developers are still sceptical.
If you are a PostgreSQL DBA with Oracle experience and use perf for troubleshooting Postgres - you are an ideal person to share your experience, but everyone is welcome.
2. Further pg_wait_sampling performance testing needed and in different environments.
According to developers, overhead is small, but many people have doubts that it can be much more significant for intensive workloads. Obviously, it is not an easy task to test, because you need to put doubtfully non-production ready code into mission-critical production for such tests.
PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com
On Sun, Aug 7, 2016 at 5:33 PM, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> wrote: > Hi, > > I've summarized Wait events monitoring discussion at Developer unconference > in Ottawa this year on wiki: > > https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring > > > (Thanks to Alexander Korotkov for patiently pushing me to make this thing > finally done) > > If you attended, fill free to point me out if I missed something, I will put > it on the wiki too. > Thanks for summarization. > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. The > importance of the topic is beyond discussion now, if you talk to any > PostgreSQL person about implementing such a tool in Postgres and if the > person does not get exited, probably you talk to a full-time PostgreSQL > developer;-) Obviously it needs a better design, both the user interface and > implementation, and perhaps this is why full-time developers are still > sceptical. > > In order to move forward, imho we need at least some steps, whose steps can > be done in parallel > > 1. Further requirements need to be collected from DBAs. > > If you are a PostgreSQL DBA with Oracle experience and use perf for > troubleshooting Postgres - you are an ideal person to share your experience, > but everyone is welcome. > > 2. Further pg_wait_sampling performance testing needed and in different > environments. > I think it is better to first go with a knob whose default value will be off. We can do the performance testing as well and if by end of release nobody reported any visible regression, then we can discuss for changing the default to on. > According to developers, overhead is small, but many people have doubts > that it can be much more significant for intensive workloads. Obviously, it > is not an easy task to test, because you need to put doubtfully > non-production ready code into mission-critical production for such tests. > As a result it will be clear if this design should be abandoned and we > need to think about less-invasive solutions or this design is acceptable. > I think here main objection was that gettimeofday can cause performance regression which can be taken care by using configurable knob. I am not aware if any other part of the design has been discussed in detail to conclude whether it has any obvious problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Aug 8, 2016 at 04:43:40PM +0530, Amit Kapila wrote: > > According to developers, overhead is small, but many people have doubts > > that it can be much more significant for intensive workloads. Obviously, it > > is not an easy task to test, because you need to put doubtfully > > non-production ready code into mission-critical production for such tests. > > As a result it will be clear if this design should be abandoned and we > > need to think about less-invasive solutions or this design is acceptable. > > > > I think here main objection was that gettimeofday can cause > performance regression which can be taken care by using configurable > knob. I am not aware if any other part of the design has been > discussed in detail to conclude whether it has any obvious problem. It seems asking users to run pg_test_timing before deploying to check the overhead would be sufficient. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Aug 8, 2016 at 10:03 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Aug 8, 2016 at 04:43:40PM +0530, Amit Kapila wrote: >> > According to developers, overhead is small, but many people have doubts >> > that it can be much more significant for intensive workloads. Obviously, it >> > is not an easy task to test, because you need to put doubtfully >> > non-production ready code into mission-critical production for such tests. >> > As a result it will be clear if this design should be abandoned and we >> > need to think about less-invasive solutions or this design is acceptable. >> > >> >> I think here main objection was that gettimeofday can cause >> performance regression which can be taken care by using configurable >> knob. I am not aware if any other part of the design has been >> discussed in detail to conclude whether it has any obvious problem. > > It seems asking users to run pg_test_timing before deploying to check > the overhead would be sufficient. They should also run it in parallel, as sometimes the real overhead is in synchronization between multiple CPUs and doesn't show up when only a single CPU is involved. Cheers, Jeff
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote: > It seems asking users to run pg_test_timing before deploying to check > the overhead would be sufficient. I'am not sure. Time measurement for waits is slightly more complicated than a time measurement for explain analyze: a good workload plus using gettimeofday in a straightforward manner can cause huge overhead. Thats why a proper testing is important - if we can see a significant performance drop if we have for example large shared_buffers with the same concurrency, that shows gettimeofday is too expensive to use. Am I correct, that we do not have such accurate tests now? My another concern is, that it is a bad idea to release a feature, which allegedly has huge performance impact even if it is not turned on by default. I often meet people who do not use exceptions in plpgsql because a tip "A block containing an EXCEPTION clause is significantly more expensive to enter ..." in PostgreSQL documentation -- Ilya Kosmodemiansky, PostgreSQL-Consulting.com tel. +14084142500 cell. +4915144336040 ik@postgresql-consulting.com
On Mon, Aug 8, 2016 at 11:47:11PM +0200, Ilya Kosmodemiansky wrote: > On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote: > > It seems asking users to run pg_test_timing before deploying to check > > the overhead would be sufficient. > > I'am not sure. Time measurement for waits is slightly more complicated > than a time measurement for explain analyze: a good workload plus > using gettimeofday in a straightforward manner can cause huge > overhead. Thats why a proper testing is important - if we can see a > significant performance drop if we have for example large > shared_buffers with the same concurrency, that shows gettimeofday is > too expensive to use. Am I correct, that we do not have such accurate > tests now? Well, if we find that pg_test_timing is insufficient, we can perhaps add a parallel test option to that utility. > My another concern is, that it is a bad idea to release a feature, > which allegedly has huge performance impact even if it is not turned > on by default. I often meet people who do not use exceptions in > plpgsql because a tip "A block containing an EXCEPTION clause is > significantly more expensive to enter ..." in PostgreSQL documentation Well, if we document that is can be slow, it is up to the user to decide if they want to use it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ilya Kosmodemiansky
I've summarized Wait events monitoring discussion at Developer unconference in Ottawa this year on wiki:
https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
I hope wait event monitoring will be on by default even if the overhead is not almost zero, because the data needs to be readily available for faster troubleshooting. IMO, the benefit would be worth even 10% overhead. If you disable it by default because of overhead, how can we convince users to enable it in production systems to solve some performance problem? I’m afraid severe users would say “we can’t change any setting that might cause more trouble, so investigate the cause with existing information.”
We should positively consider the performance with wait event monitoring on as the new normal. Then, we should develop more features that leverage the wait event data, so that wait event data is crucial. The manual explains to users that wait event monitoring can be turned off for maximal performance but it’s not recommended.
BTW, taking advantage of this chance, why don’t we enrich the content of performance tuning in the manual? At least it needs to be explained how to analyze the wait event data and tune the system.
Performance Tips
https://www.postgresql.org/docs/devel/static/performance-tips.html
Regards
Takayuki Tsunakawa
On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote: > I hope wait event monitoring will be on by default even if the overhead is not > almost zero, because the data needs to be readily available for faster > troubleshooting. IMO, the benefit would be worth even 10% overhead. If you > disable it by default because of overhead, how can we convince users to enable > it in production systems to solve some performance problem? I’m afraid severe > users would say “we can’t change any setting that might cause more trouble, so > investigate the cause with existing information.” If you want to know why people are against enabling this monitoring by default, above is the reason. What percentage of people do you think would be willing to take a 10% performance penalty for monitoring like this? I would bet very few, but the argument above doesn't seem to address the fact it is a small percentage. In fact, the argument above goes even farther, saying that we should enable it all the time because people will be unwilling to enable it on their own. I have to question the value of the information if users are not willing to enable it. And the solution proposed is to force the 10% default overhead on everyone, whether they are currently doing debugging, whether they will ever do this level of debugging, because people will be too scared to enable it. (Yes, I think Oracle took this approach.) We can talk about this feature all we want, but if we are not willing to be realistic in how much performance penalty the _average_ user is willing to lose to have this monitoring, I fear we will make little progress on this feature. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 08/08/2016 07:37 PM, Bruce Momjian wrote: > On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote: >> I hope wait event monitoring will be on by default even if the overhead is not >> almost zero, because the data needs to be readily available for faster >> troubleshooting. IMO, the benefit would be worth even 10% overhead. If you >> disable it by default because of overhead, how can we convince users to enable >> it in production systems to solve some performance problem? I’m afraid severe >> users would say “we can’t change any setting that might cause more trouble, so >> investigate the cause with existing information.” > > If you want to know why people are against enabling this monitoring by > default, above is the reason. What percentage of people do you think > would be willing to take a 10% performance penalty for monitoring like > this? I would bet very few, but the argument above doesn't seem to > address the fact it is a small percentage. I would argue it is zero. There are definitely users for this feature but to enable it by default is looking for trouble. *MOST* users do not need this. Sincerely, JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. Unless otherwise stated, opinions are my own.
2016-08-09 11:49 GMT+09:00 Joshua D. Drake <jd@commandprompt.com>: > On 08/08/2016 07:37 PM, Bruce Momjian wrote: >> >> On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote: >>> >>> I hope wait event monitoring will be on by default even if the overhead >>> is not >>> almost zero, because the data needs to be readily available for faster >>> troubleshooting. IMO, the benefit would be worth even 10% overhead. If >>> you >>> disable it by default because of overhead, how can we convince users to >>> enable >>> it in production systems to solve some performance problem? I’m afraid >>> severe >>> users would say “we can’t change any setting that might cause more >>> trouble, so >>> investigate the cause with existing information.” >> >> >> If you want to know why people are against enabling this monitoring by >> default, above is the reason. What percentage of people do you think >> would be willing to take a 10% performance penalty for monitoring like >> this? I would bet very few, but the argument above doesn't seem to >> address the fact it is a small percentage. > > > I would argue it is zero. There are definitely users for this feature but to > enable it by default is looking for trouble. *MOST* users do not need this. I used to think of that this kind of features should be enabled by default, because when I was working at the previous company, I had only few features to understand what is happening inside PostgreSQL by observing production databases. I needed those features enabled in the production databases when I was called. However, now I have another opinion. When we release the next major release saying 10.0 with the wait monitoring, many people will start their benchmark test with a configuration with *the default values*, and if they see some performance decrease, for example around 10%, they will be talking about it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will be facing difficult reputation. So, I agree with the features should be disabled by default for a while. Regards, -- Satoshi Nagayasu <snaga@uptime.jp>
2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com>: > I've summarized Wait events monitoring discussion at Developer unconference > in Ottawa this year on wiki: > > https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring > > (Thanks to Alexander Korotkov for patiently pushing me to make this thing > finally done) Thanks for your effort to make us move forward. > If you attended, fill free to point me out if I missed something, I will put > it on the wiki too. > > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. The > importance of the topic is beyond discussion now, if you talk to any > PostgreSQL person about implementing such a tool in Postgres and if the > person does not get exited, probably you talk to a full-time PostgreSQL > developer;-) Obviously it needs a better design, both the user interface and > implementation, and perhaps this is why full-time developers are still > sceptical. > > In order to move forward, imho we need at least some steps, whose steps can > be done in parallel > > 1. Further requirements need to be collected from DBAs. > > If you are a PostgreSQL DBA with Oracle experience and use perf for > troubleshooting Postgres - you are an ideal person to share your experience, > but everyone is welcome. > > 2. Further pg_wait_sampling performance testing needed and in different > environments. > > According to developers, overhead is small, but many people have doubts > that it can be much more significant for intensive workloads. Obviously, it > is not an easy task to test, because you need to put doubtfully > non-production ready code into mission-critical production for such tests. > As a result it will be clear if this design should be abandoned and we > need to think about less-invasive solutions or this design is acceptable. > > Any thoughts? Seems a good starting point. I'm interested in both, and I would like to contribute with running (or writing) several tests. Regards, -- Satoshi Nagayasu <snaga@uptime.jp>
From: pgsql-hackers-owner@postgresql.org > If you want to know why people are against enabling this monitoring by > default, above is the reason. What percentage of people do you think would > be willing to take a 10% performance penalty for monitoring like this? I > would bet very few, but the argument above doesn't seem to address the fact > it is a small percentage. > > In fact, the argument above goes even farther, saying that we should enable > it all the time because people will be unwilling to enable it on their own. > I have to question the value of the information if users are not willing > to enable it. And the solution proposed is to force the 10% default overhead > on everyone, whether they are currently doing debugging, whether they will > ever do this level of debugging, because people will be too scared to enable > it. (Yes, I think Oracle took this > approach.) > > We can talk about this feature all we want, but if we are not willing to > be realistic in how much performance penalty the _average_ user is willing > to lose to have this monitoring, I fear we will make little progress on > this feature. OK, 10% was an overstatement. Anyway, As Amit said, we can discuss the default value based on the performance evaluationbefore release. As another idea, we can stand on the middle ground. Interestingly, MySQL also enables their event monitoring (PerformanceSchema) by default, but not all events are collected. I guess highly encountered events are not collected bydefault to minimize the overhead. http://dev.mysql.com/doc/refman/5.7/en/performance-schema-quick-start.html -------------------------------------------------- Assuming that the Performance Schema is available, it is enabled by default. ... [mysqld] performance_schema=ON ... Initially, not all instruments and consumers are enabled, so the performance schema does not collect all events. To turnall of these on and enable event timing, execute two statements (the row counts may differ depending on MySQL version): mysql> UPDATE setup_instruments SET ENABLED = 'YES', TIMED = 'YES'; Query OK, 560 rows affected (0.04 sec) mysql> UPDATE setup_consumers SET ENABLED = 'YES'; Query OK, 10 rows affected (0.00 sec) -------------------------------------------------- BTW, I remember EnterpriseDB has a wait event monitoring feature. Is it disabled by default? What was the overhead? Regards Takayuki Tsunakawa
From: pgsql-hackers-owner@postgresql.org > I used to think of that this kind of features should be enabled by default, > because when I was working at the previous company, I had only few features > to understand what is happening inside PostgreSQL by observing production > databases. I needed those features enabled in the production databases when > I was called. > > However, now I have another opinion. When we release the next major release > saying 10.0 with the wait monitoring, many people will start their benchmark > test with a configuration with *the default values*, and if they see some > performance decrease, for example around 10%, they will be talking about > it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will > be facing difficult reputation. > > So, I agree with the features should be disabled by default for a while. I understand your feeling well. This is a difficult decision. Let's hope for trivial overhead. Regards Takayuki Tsunakawa
On Tue, Aug 9, 2016 at 04:17:28AM +0000, Tsunakawa, Takayuki wrote: > From: pgsql-hackers-owner@postgresql.org > > I used to think of that this kind of features should be enabled by default, > > because when I was working at the previous company, I had only few features > > to understand what is happening inside PostgreSQL by observing production > > databases. I needed those features enabled in the production databases when > > I was called. > > > > However, now I have another opinion. When we release the next major release > > saying 10.0 with the wait monitoring, many people will start their benchmark > > test with a configuration with *the default values*, and if they see some > > performance decrease, for example around 10%, they will be talking about > > it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will > > be facing difficult reputation. > > > > So, I agree with the features should be disabled by default for a while. > > I understand your feeling well. This is a difficult decision. Let's hope for trivial overhead. I think the goal is that some internal tracking can be enabled by default and some internal or external tool can be turned on and off to get more fine-grained statistics about the event durations. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 8/8/16 11:07 PM, Tsunakawa, Takayuki wrote: > From: pgsql-hackers-owner@postgresql.org >> > If you want to know why people are against enabling this monitoring by >> > default, above is the reason. What percentage of people do you think would >> > be willing to take a 10% performance penalty for monitoring like this? I >> > would bet very few, but the argument above doesn't seem to address the fact >> > it is a small percentage. >> > >> > In fact, the argument above goes even farther, saying that we should enable >> > it all the time because people will be unwilling to enable it on their own. >> > I have to question the value of the information if users are not willing >> > to enable it. And the solution proposed is to force the 10% default overhead >> > on everyone, whether they are currently doing debugging, whether they will >> > ever do this level of debugging, because people will be too scared to enable >> > it. (Yes, I think Oracle took this >> > approach.) Lets put this in perspective: there's tons of companies that spend thousands of dollars per month extra by running un-tuned systems in cloud environments. I almost called that "waste" but in reality it should be a simple business question: is it worth more to the company to spend resources on reducing the AWS bill or rolling out new features? It's something that can be estimated and a rational business decision made. Where things become completely *irrational* is when a developer reads something like "plpgsql blocks with an EXCEPTION handler are more expensive" and they freak out and spend a bunch of time trying to avoid them, without even the faintest idea of what that overhead actually is. More important, they haven't the faintest idea of what that overhead costs the company, vs what it costs the company for them to spend an extra hour trying to avoid the EXCEPTION (and probably introducing code that's far more bug-prone in the process). So in reality, the only people likely to notice even something as large as a 10% hit are those that were already close to maxing out their hardware anyway. The downside to leaving stuff like this off by default is users won't remember it's there when they need it. At best, that means they spend more time debugging something than they need to. At worse, it means they suffer a production outage for longer than they need to, and that can easily exceed many months/years worth of the extra cost from the monitoring overhead. >> > We can talk about this feature all we want, but if we are not willing to >> > be realistic in how much performance penalty the _average_ user is willing >> > to lose to have this monitoring, I fear we will make little progress on >> > this feature. > OK, 10% was an overstatement. Anyway, As Amit said, we can discuss the default value based on the performance evaluationbefore release. > > As another idea, we can stand on the middle ground. Interestingly, MySQL also enables their event monitoring (PerformanceSchema) by default, but not all events are collected. I guess highly encountered events are not collected bydefault to minimize the overhead. That's what we currently do with several track_* and log_*_stats GUCs, several of which I forgot even existed until just now. Since there's question over the actual overhead maybe that's a prudent approach for now, but I think we should be striving to enable these things ASAP. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
From: pgsql-hackers-owner@postgresql.org > Lets put this in perspective: there's tons of companies that spend thousands > of dollars per month extra by running un-tuned systems in cloud environments. > I almost called that "waste" but in reality it should be a simple business > question: is it worth more to the company to spend resources on reducing > the AWS bill or rolling out new features? > It's something that can be estimated and a rational business decision made. > > Where things become completely *irrational* is when a developer reads > something like "plpgsql blocks with an EXCEPTION handler are more expensive" > and they freak out and spend a bunch of time trying to avoid them, without > even the faintest idea of what that overhead actually is. > More important, they haven't the faintest idea of what that overhead costs > the company, vs what it costs the company for them to spend an extra hour > trying to avoid the EXCEPTION (and probably introducing code that's far > more bug-prone in the process). > > So in reality, the only people likely to notice even something as large > as a 10% hit are those that were already close to maxing out their hardware > anyway. > > The downside to leaving stuff like this off by default is users won't > remember it's there when they need it. At best, that means they spend more > time debugging something than they need to. At worse, it means they suffer > a production outage for longer than they need to, and that can easily exceed > many months/years worth of the extra cost from the monitoring overhead. I'd rather like this way of positive thinking. It will be better to think of the event monitoring as a positive featurefor (daily) proactive improvement, not only as a debugging feature which gives negative image. For example, pgAdmin4can display 10 most time-consuming events and their solutions. The DBA initially places the database and WAL onthe same volume. As the system grows and the write workload increases, the DBA can get a suggestion from pgAdmin4 thathe can prepare for the system growth by placing WAL on another volume to reduce WALWriteLock wait events. This is notdebugging, but proactive monitoring. > > As another idea, we can stand on the middle ground. Interestingly, MySQL > also enables their event monitoring (Performance Schema) by default, but > not all events are collected. I guess highly encountered events are not > collected by default to minimize the overhead. > > That's what we currently do with several track_* and log_*_stats GUCs, > several of which I forgot even existed until just now. Since there's question > over the actual overhead maybe that's a prudent approach for now, but I > think we should be striving to enable these things ASAP. Agreed. And as Bruce said, it may be better to be able to disable collection of some events that have visible impact onperformance. Regards Takayuki Tsunakawa
The downside to leaving stuff like this off by default is users won't remember it's there when they need it. At best, that means they spend more time debugging something than they need to. At worse, it means they suffer a production outage for longer than they need to, and that can easily exceed many months/years worth of the extra cost from the monitoring overhead.
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.
I'am not sure. Time measurement for waits is slightly more complicated
than a time measurement for explain analyze: a good workload plus
using gettimeofday in a straightforward manner can cause huge
overhead.
Thats why a proper testing is important - if we can see a
significant performance drop if we have for example large
shared_buffers with the same concurrency, that shows gettimeofday is
too expensive to use. Am I correct, that we do not have such accurate
tests now?
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
> I hope wait event monitoring will be on by default even if the overhead is not
> almost zero, because the data needs to be readily available for faster
> troubleshooting. IMO, the benefit would be worth even 10% overhead. If you
> disable it by default because of overhead, how can we convince users to enable
> it in production systems to solve some performance problem? I’m afraid severe
> users would say “we can’t change any setting that might cause more trouble, so
> investigate the cause with existing information.”
If you want to know why people are against enabling this monitoring by
default, above is the reason. What percentage of people do you think
would be willing to take a 10% performance penalty for monitoring like
this? I would bet very few, but the argument above doesn't seem to
address the fact it is a small percentage.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote: > On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote: > > I hope wait event monitoring will be on by default even if the overhead > is not > > almost zero, because the data needs to be readily available for faster > > troubleshooting. IMO, the benefit would be worth even 10% overhead. If > you > > disable it by default because of overhead, how can we convince users to > enable > > it in production systems to solve some performance problem? I’m afraid > severe > > users would say “we can’t change any setting that might cause more > trouble, so > > investigate the cause with existing information.” > > If you want to know why people are against enabling this monitoring by > default, above is the reason. What percentage of people do you think > would be willing to take a 10% performance penalty for monitoring like > this? I would bet very few, but the argument above doesn't seem to > address the fact it is a small percentage. > > > Just two notes from me: > > 1) 10% overhead from monitoring wait events is just an idea without any proof > so soon. > 2) We already have functionality which trades insight into database with way > more huge overhead. auto_explain.log_analyze = true can slowdown queries *in > times*. Do you think we should remove it? The point is not removing it, the point is whether auto_explain.log_analyze = true should be enabled by default, and I think no one wants to do that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
<p dir="ltr">2016/08/10 23:22 "Bruce Momjian" <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>>:<br /> ><br/> > On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:<br /> > > On Tue, Aug 9, 2016 at5:37 AM, Bruce Momjian <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>> wrote:<br /> > ><br /> >> On Tue, Aug 9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:<br /> > > > I hope wait eventmonitoring will be on by default even if the overhead<br /> > > is not<br /> > > > almost zero,because the data needs to be readily available for faster<br /> > > > troubleshooting. IMO, the benefitwould be worth even 10% overhead. If<br /> > > you<br /> > > > disable it by default becauseof overhead, how can we convince users to<br /> > > enable<br /> > > > it in production systemsto solve some performance problem? I’m afraid<br /> > > severe<br /> > > > users would say“we can’t change any setting that might cause more<br /> > > trouble, so<br /> > > > investigatethe cause with existing information.”<br /> > ><br /> > > If you want to know why people are againstenabling this monitoring by<br /> > > default, above is the reason. What percentage of people do you think<br/> > > would be willing to take a 10% performance penalty for monitoring like<br /> > > this? I would bet very few, but the argument above doesn't seem to<br /> > > address the fact it is a small percentage.<br/> > ><br /> > ><br /> > > Just two notes from me:<br /> > ><br /> > > 1) 10%overhead from monitoring wait events is just an idea without any proof<br /> > > so soon.<br /> > > 2) Wealready have functionality which trades insight into database with way<br /> > > more huge overhead. auto_explain.log_analyze= true can slowdown queries *in<br /> > > times*. Do you think we should removeit?<br /> ><br /> > The point is not removing it, the point is whether<br /> > auto_explain.log_analyze =true should be enabled by default, and I<br /> > think no one wants to do that.<p dir="ltr">Agreed.<p dir="ltr">If peopleare facing with some difficult situation in terms of performance, they may accept some (one-time) overhead to resolvethe issue.<br /> But if they don't have (recognize) any issue, they may not.<p dir="ltr">That's one of the realitiesaccording to my experiences.<p dir="ltr">Regards,
On Wed, Aug 10, 2016 at 11:37:36PM +0900, Satoshi Nagayasu wrote: > Agreed. > > If people are facing with some difficult situation in terms of performance, > they may accept some (one-time) overhead to resolve the issue. > But if they don't have (recognize) any issue, they may not. > > That's one of the realities according to my experiences. Yes. Many people are arguing for specific defaults based on what _they_ would want, not what the average user would want. Sophisticated users will know about this and turn it on when desired. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, Aug 9, 2016 at 12:07 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > As another idea, we can stand on the middle ground. Interestingly, MySQL also enables their event monitoring (PerformanceSchema) by default, but not all events are collected. I guess highly encountered events are not collected bydefault to minimize the overhead. Yes, I think that's a sensible approach. I can't see enabling by default a feature that significantly regresses performance. We work too hard to improve performance to throw very much of it away for any one feature, even a feature that a lot of people like. What I really like about what got committed to 9.6 is that it's so cheap we should be able to use for lots of other things - latch events, network I/O, disk I/O, etc. without hurting performance at all. But if we start timing those events, it's going to be really expensive. Even just counting them or keeping a history will cost a lot more than just publishing them while they're active, which is what we're doing now. > BTW, I remember EnterpriseDB has a wait event monitoring feature. Is it disabled by default? What was the overhead? Timed events in Advanced Server are disabled by default. I haven't actually tested the overhead myself and I don't remember exactly what the numbers were the last time someone else did, but I think if you turned edb_timed_statistics on, it's pretty expensive. If we can agree on something sensible here, I imagine we'll get rid of that feature in Advanced Server in favor of whatever the community settles on. But if the community agrees to turn on something by default that costs a measurable percentage in performance, I predict that Advanced Server 10 will ship with a different default for that feature than PostgreSQL 10. Personally, I think too much of this thread (and previous threads) is devoted to arguing about whether it's OK to make performance worse, and by how much we'd be willing to make it worse. What I think we ought to be talking about is how to design a feature that produces the most useful data for the least performance cost possible, like by avoiding measuring wait times for events that are very frequent or waits that are very short. Or, maybe we could have a background process that updates a timestamp in shared memory every millisecond, and other processes can read that value instead of making a system call. I think on Linux systems with fast clocks the operating system basically does something like that for you, but there might be other systems where it helps. Of course, it could also skew the results if the system is so overloaded that the clock-updater process gets descheduled for a lengthy period of time. Anyway, I disagree with the idea that this feature is stalled or blocked in some way. I (and quite a few other people, though not everyone) oppose making performance significantly worse in the default configuration. I oppose that regardless of whether it is a hypothetical patch for this feature that causes the problem or whether it is a hypothetical patch for some other feature that causes the problem. I am not otherwise opposed to more work in this area; in fact, I'm rather in favor of it. But you can count on me to argue against pretty much everything that causes a performance regression, whatever the reason. Virtually every release, at least one developer proposes some patch that slows the server down by "only" 1-2%. If we'd accepted all of the patches that were shot down because of such impacts, we'd have lost a very big chunk of performance between the time I started working on PostgreSQL and now. As it is, our single-threaded performance seems to have regressed noticeably since 9.1: http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/ I think that's awful. But if we'd accepted all of those patches that cost "only" one or two percentage points, it would probably be -15% or -25% rather than -4.4%. I think that if we want to really be successful as a project, we need to make that number go UP, not down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-08-07 14:03:17 +0200, Ilya Kosmodemiansky wrote: > Wait event monitoring looks ones again stuck on the way through community > approval in spite of huge progress done last year in that direction. I see little evidence of that. If you consider "please do some reasonable benchmarks" as being stuck...