Thread: Wait events monitoring future development

Wait events monitoring future development

From
Ilya Kosmodemiansky
Date:
Hi,

I've summarized Wait events monitoring discussion at Developer unconference in Ottawa this year on wiki:

https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring


(Thanks to Alexander Korotkov for patiently pushing me to make this thing finally done)

If you attended, fill free to point me out if I missed something, I will put it on the wiki too.

Wait event monitoring looks ones again stuck on the way through community approval in spite of huge progress done last year in that direction. The importance of the topic is beyond discussion now, if you talk to any PostgreSQL person about implementing such a tool in Postgres and if the person does not get exited, probably you talk to a full-time PostgreSQL developer;-) Obviously it needs a better design, both the user interface and implementation, and perhaps this is why full-time developers are still sceptical. 

In order to move forward, imho we need at least some steps, whose steps can be done in parallel

1. Further requirements need to be collected from DBAs.

   If you are a PostgreSQL DBA with Oracle experience and use perf for troubleshooting Postgres - you are an ideal person to share your experience, but everyone is welcome.

2. Further pg_wait_sampling performance testing needed and in different environments.

   According to developers, overhead is small, but many people have doubts that it can be much more significant for intensive workloads. Obviously, it is not an easy task to test, because you need to put doubtfully non-production ready code into mission-critical production for such tests. 
   As a result it will be clear if this design should be abandoned  and we need to think about less-invasive solutions or this design is acceptable.

Any thoughts?

Best regards,
Ilya
  
--
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com

Re: Wait events monitoring future development

From
Amit Kapila
Date:
On Sun, Aug 7, 2016 at 5:33 PM, Ilya Kosmodemiansky
<ilya.kosmodemiansky@postgresql-consulting.com> wrote:
> Hi,
>
> I've summarized Wait events monitoring discussion at Developer unconference
> in Ottawa this year on wiki:
>
> https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
>
>
> (Thanks to Alexander Korotkov for patiently pushing me to make this thing
> finally done)
>
> If you attended, fill free to point me out if I missed something, I will put
> it on the wiki too.
>

Thanks for summarization.

> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction. The
> importance of the topic is beyond discussion now, if you talk to any
> PostgreSQL person about implementing such a tool in Postgres and if the
> person does not get exited, probably you talk to a full-time PostgreSQL
> developer;-) Obviously it needs a better design, both the user interface and
> implementation, and perhaps this is why full-time developers are still
> sceptical.
>
> In order to move forward, imho we need at least some steps, whose steps can
> be done in parallel
>
> 1. Further requirements need to be collected from DBAs.
>
>    If you are a PostgreSQL DBA with Oracle experience and use perf for
> troubleshooting Postgres - you are an ideal person to share your experience,
> but everyone is welcome.
>
> 2. Further pg_wait_sampling performance testing needed and in different
> environments.
>

I think it is better to first go with a knob whose default value will
be off.  We can do the performance testing as well and if by end of
release nobody reported any visible regression, then we can discuss
for changing the default to on.

>    According to developers, overhead is small, but many people have doubts
> that it can be much more significant for intensive workloads. Obviously, it
> is not an easy task to test, because you need to put doubtfully
> non-production ready code into mission-critical production for such tests.
>    As a result it will be clear if this design should be abandoned  and we
> need to think about less-invasive solutions or this design is acceptable.
>

I think here main objection was that gettimeofday can cause
performance regression which can be taken care by using configurable
knob.  I am not aware if any other part of the design has been
discussed in detail to conclude whether it has any obvious problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Mon, Aug  8, 2016 at 04:43:40PM +0530, Amit Kapila wrote:
> >    According to developers, overhead is small, but many people have doubts
> > that it can be much more significant for intensive workloads. Obviously, it
> > is not an easy task to test, because you need to put doubtfully
> > non-production ready code into mission-critical production for such tests.
> >    As a result it will be clear if this design should be abandoned  and we
> > need to think about less-invasive solutions or this design is acceptable.
> >
> 
> I think here main objection was that gettimeofday can cause
> performance regression which can be taken care by using configurable
> knob.  I am not aware if any other part of the design has been
> discussed in detail to conclude whether it has any obvious problem.

It seems asking users to run pg_test_timing before deploying to check
the overhead would be sufficient.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
Jeff Janes
Date:
On Mon, Aug 8, 2016 at 10:03 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Aug  8, 2016 at 04:43:40PM +0530, Amit Kapila wrote:
>> >    According to developers, overhead is small, but many people have doubts
>> > that it can be much more significant for intensive workloads. Obviously, it
>> > is not an easy task to test, because you need to put doubtfully
>> > non-production ready code into mission-critical production for such tests.
>> >    As a result it will be clear if this design should be abandoned  and we
>> > need to think about less-invasive solutions or this design is acceptable.
>> >
>>
>> I think here main objection was that gettimeofday can cause
>> performance regression which can be taken care by using configurable
>> knob.  I am not aware if any other part of the design has been
>> discussed in detail to conclude whether it has any obvious problem.
>
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.

They should also run it in parallel, as sometimes the real overhead is
in synchronization between multiple CPUs and doesn't show up when only
a single CPU is involved.

Cheers,

Jeff



Re: Wait events monitoring future development

From
Ilya Kosmodemiansky
Date:
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.

I'am not sure. Time measurement for waits is slightly more complicated
than a time measurement for explain analyze: a good workload plus
using gettimeofday in a straightforward manner can cause huge
overhead. Thats why a proper testing is important - if we can see a
significant performance drop if we have for example large
shared_buffers with the same concurrency,  that shows gettimeofday is
too expensive to use. Am I correct, that we do not have such accurate
tests now?

My another concern is, that it is a bad idea to release a feature,
which allegedly has huge performance impact even if it is not turned
on by default. I often meet people who do not use exceptions in
plpgsql because a tip "A block containing an EXCEPTION clause is
significantly more expensive to enter ..." in PostgreSQL documentation


-- 
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com



Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Mon, Aug  8, 2016 at 11:47:11PM +0200, Ilya Kosmodemiansky wrote:
> On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > It seems asking users to run pg_test_timing before deploying to check
> > the overhead would be sufficient.
> 
> I'am not sure. Time measurement for waits is slightly more complicated
> than a time measurement for explain analyze: a good workload plus
> using gettimeofday in a straightforward manner can cause huge
> overhead. Thats why a proper testing is important - if we can see a
> significant performance drop if we have for example large
> shared_buffers with the same concurrency,  that shows gettimeofday is
> too expensive to use. Am I correct, that we do not have such accurate
> tests now?

Well, if we find that pg_test_timing is insufficient, we can perhaps add
a parallel test option to that utility.

> My another concern is, that it is a bad idea to release a feature,
> which allegedly has huge performance impact even if it is not turned
> on by default. I often meet people who do not use exceptions in
> plpgsql because a tip "A block containing an EXCEPTION clause is
> significantly more expensive to enter ..." in PostgreSQL documentation

Well, if we document that is can be slow, it is up to the user to decide
if they want to use it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
"Tsunakawa, Takayuki"
Date:

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ilya Kosmodemiansky
I've summarized Wait events monitoring discussion at Developer unconference in Ottawa this year on wiki:

https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring

I hope wait event monitoring will be on by default even if the overhead is not almost zero, because the data needs to be readily available for faster troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you disable it by default because of overhead, how can we convince users to enable it in production systems to solve some performance problem?  I’m afraid severe users would say “we can’t change any setting that might cause more trouble, so investigate the cause with existing information.”

 

We should positively consider the performance with wait event monitoring on as the new normal.  Then, we should develop more features that leverage the wait event data, so that wait event data is crucial.  The manual explains to users that wait event monitoring can be turned off for maximal performance but it’s not recommended.

 

BTW, taking advantage of this chance, why don’t we enrich the content of performance tuning in the manual?  At least it needs to be explained how to analyze the wait event data and tune the system.

 

Performance Tips

https://www.postgresql.org/docs/devel/static/performance-tips.html

 

Regards

Takayuki Tsunakawa

 

 

 

 

 

 

 

Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
> I hope wait event monitoring will be on by default even if the overhead is not
> almost zero, because the data needs to be readily available for faster
> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you
> disable it by default because of overhead, how can we convince users to enable
> it in production systems to solve some performance problem?  I’m afraid severe
> users would say “we can’t change any setting that might cause more trouble, so
> investigate the cause with existing information.”

If you want to know why people are against enabling this monitoring by
default, above is the reason.  What percentage of people do you think
would be willing to take a 10% performance penalty for monitoring like
this?  I would bet very few, but the argument above doesn't seem to
address the fact it is a small percentage.

In fact, the argument above goes even farther, saying that we should
enable it all the time because people will be unwilling to enable it on
their own.  I have to question the value of the information if users are
not willing to enable it.  And the solution proposed is to force the 10%
default overhead on everyone, whether they are currently doing
debugging, whether they will ever do this level of debugging, because
people will be too scared to enable it.  (Yes, I think Oracle took this
approach.)

We can talk about this feature all we want, but if we are not willing to
be realistic in how much performance penalty the _average_ user is
willing to lose to have this monitoring, I fear we will make little
progress on this feature.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
"Joshua D. Drake"
Date:
On 08/08/2016 07:37 PM, Bruce Momjian wrote:
> On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
>> I hope wait event monitoring will be on by default even if the overhead is not
>> almost zero, because the data needs to be readily available for faster
>> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you
>> disable it by default because of overhead, how can we convince users to enable
>> it in production systems to solve some performance problem?  I’m afraid severe
>> users would say “we can’t change any setting that might cause more trouble, so
>> investigate the cause with existing information.”
>
> If you want to know why people are against enabling this monitoring by
> default, above is the reason.  What percentage of people do you think
> would be willing to take a 10% performance penalty for monitoring like
> this?  I would bet very few, but the argument above doesn't seem to
> address the fact it is a small percentage.

I would argue it is zero. There are definitely users for this feature 
but to enable it by default is looking for trouble. *MOST* users do not 
need this.

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Wait events monitoring future development

From
Satoshi Nagayasu
Date:
2016-08-09 11:49 GMT+09:00 Joshua D. Drake <jd@commandprompt.com>:
> On 08/08/2016 07:37 PM, Bruce Momjian wrote:
>>
>> On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
>>>
>>> I hope wait event monitoring will be on by default even if the overhead
>>> is not
>>> almost zero, because the data needs to be readily available for faster
>>> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
>>> you
>>> disable it by default because of overhead, how can we convince users to
>>> enable
>>> it in production systems to solve some performance problem?  I’m afraid
>>> severe
>>> users would say “we can’t change any setting that might cause more
>>> trouble, so
>>> investigate the cause with existing information.”
>>
>>
>> If you want to know why people are against enabling this monitoring by
>> default, above is the reason.  What percentage of people do you think
>> would be willing to take a 10% performance penalty for monitoring like
>> this?  I would bet very few, but the argument above doesn't seem to
>> address the fact it is a small percentage.
>
>
> I would argue it is zero. There are definitely users for this feature but to
> enable it by default is looking for trouble. *MOST* users do not need this.

I used to think of that this kind of features should be enabled by default,
because when I was working at the previous company, I had only few features
to understand what is happening inside PostgreSQL by observing production
databases. I needed those features enabled in the production databases
when I was called.

However, now I have another opinion. When we release the next major release
saying 10.0 with the wait monitoring, many people will start their
benchmark test
with a configuration with *the default values*, and if they see some performance
decrease, for example around 10%, they will be talking about it as the
performance
decrease in PostgreSQL 10.0. It means PostgreSQL will be facing
difficult reputation.

So, I agree with the features should be disabled by default for a while.

Regards,
--
Satoshi Nagayasu <snaga@uptime.jp>



Re: Wait events monitoring future development

From
Satoshi Nagayasu
Date:
2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky
<ilya.kosmodemiansky@postgresql-consulting.com>:
> I've summarized Wait events monitoring discussion at Developer unconference
> in Ottawa this year on wiki:
>
> https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
>
> (Thanks to Alexander Korotkov for patiently pushing me to make this thing
> finally done)

Thanks for your effort to make us move forward.

> If you attended, fill free to point me out if I missed something, I will put
> it on the wiki too.
>
> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction. The
> importance of the topic is beyond discussion now, if you talk to any
> PostgreSQL person about implementing such a tool in Postgres and if the
> person does not get exited, probably you talk to a full-time PostgreSQL
> developer;-) Obviously it needs a better design, both the user interface and
> implementation, and perhaps this is why full-time developers are still
> sceptical.
>
> In order to move forward, imho we need at least some steps, whose steps can
> be done in parallel
>
> 1. Further requirements need to be collected from DBAs.
>
>    If you are a PostgreSQL DBA with Oracle experience and use perf for
> troubleshooting Postgres - you are an ideal person to share your experience,
> but everyone is welcome.
>
> 2. Further pg_wait_sampling performance testing needed and in different
> environments.
>
>    According to developers, overhead is small, but many people have doubts
> that it can be much more significant for intensive workloads. Obviously, it
> is not an easy task to test, because you need to put doubtfully
> non-production ready code into mission-critical production for such tests.
>    As a result it will be clear if this design should be abandoned  and we
> need to think about less-invasive solutions or this design is acceptable.
>
> Any thoughts?

Seems a good starting point. I'm interested in both, and I would like
to contribute
with running (or writing) several tests.

Regards,
-- 
Satoshi Nagayasu <snaga@uptime.jp>



Re: Wait events monitoring future development

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> If you want to know why people are against enabling this monitoring by
> default, above is the reason.  What percentage of people do you think would
> be willing to take a 10% performance penalty for monitoring like this?  I
> would bet very few, but the argument above doesn't seem to address the fact
> it is a small percentage.
> 
> In fact, the argument above goes even farther, saying that we should enable
> it all the time because people will be unwilling to enable it on their own.
> I have to question the value of the information if users are not willing
> to enable it.  And the solution proposed is to force the 10% default overhead
> on everyone, whether they are currently doing debugging, whether they will
> ever do this level of debugging, because people will be too scared to enable
> it.  (Yes, I think Oracle took this
> approach.)
> 
> We can talk about this feature all we want, but if we are not willing to
> be realistic in how much performance penalty the _average_ user is willing
> to lose to have this monitoring, I fear we will make little progress on
> this feature.

OK, 10% was an overstatement.  Anyway, As Amit said, we can discuss the default value based on the performance
evaluationbefore release.
 

As another idea, we can stand on the middle ground.  Interestingly, MySQL also enables their event monitoring
(PerformanceSchema) by default, but not all events are collected.  I guess highly encountered events are not collected
bydefault to minimize the overhead.
 

http://dev.mysql.com/doc/refman/5.7/en/performance-schema-quick-start.html
--------------------------------------------------
Assuming that the Performance Schema is available, it is enabled by default.
...
[mysqld]
performance_schema=ON
...
Initially, not all instruments and consumers are enabled, so the performance schema does not collect all events. To
turnall of these on and enable event timing, execute two statements (the row counts may differ depending on MySQL
version):
 
mysql> UPDATE setup_instruments SET ENABLED = 'YES', TIMED = 'YES';
Query OK, 560 rows affected (0.04 sec)
mysql> UPDATE setup_consumers SET ENABLED = 'YES';
Query OK, 10 rows affected (0.00 sec)
--------------------------------------------------


BTW, I remember EnterpriseDB has a wait event monitoring feature.  Is it disabled by default?  What was the overhead?

Regards
Takayuki Tsunakawa


Re: Wait events monitoring future development

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> I used to think of that this kind of features should be enabled by default,
> because when I was working at the previous company, I had only few features
> to understand what is happening inside PostgreSQL by observing production
> databases. I needed those features enabled in the production databases when
> I was called.
> 
> However, now I have another opinion. When we release the next major release
> saying 10.0 with the wait monitoring, many people will start their benchmark
> test with a configuration with *the default values*, and if they see some
> performance decrease, for example around 10%, they will be talking about
> it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will
> be facing difficult reputation.
> 
> So, I agree with the features should be disabled by default for a while.

I understand your feeling well.  This is a difficult decision.  Let's hope for trivial overhead.

Regards
Takayuki Tsunakawa



Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Tue, Aug  9, 2016 at 04:17:28AM +0000, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner@postgresql.org
> > I used to think of that this kind of features should be enabled by default,
> > because when I was working at the previous company, I had only few features
> > to understand what is happening inside PostgreSQL by observing production
> > databases. I needed those features enabled in the production databases when
> > I was called.
> > 
> > However, now I have another opinion. When we release the next major release
> > saying 10.0 with the wait monitoring, many people will start their benchmark
> > test with a configuration with *the default values*, and if they see some
> > performance decrease, for example around 10%, they will be talking about
> > it as the performance decrease in PostgreSQL 10.0. It means PostgreSQL will
> > be facing difficult reputation.
> > 
> > So, I agree with the features should be disabled by default for a while.
> 
> I understand your feeling well.  This is a difficult decision.  Let's hope for trivial overhead.

I think the goal is that some internal tracking can be enabled by
default and some internal or external tool can be turned on and off to
get more fine-grained statistics about the event durations.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
Jim Nasby
Date:
On 8/8/16 11:07 PM, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner@postgresql.org
>> > If you want to know why people are against enabling this monitoring by
>> > default, above is the reason.  What percentage of people do you think would
>> > be willing to take a 10% performance penalty for monitoring like this?  I
>> > would bet very few, but the argument above doesn't seem to address the fact
>> > it is a small percentage.
>> >
>> > In fact, the argument above goes even farther, saying that we should enable
>> > it all the time because people will be unwilling to enable it on their own.
>> > I have to question the value of the information if users are not willing
>> > to enable it.  And the solution proposed is to force the 10% default overhead
>> > on everyone, whether they are currently doing debugging, whether they will
>> > ever do this level of debugging, because people will be too scared to enable
>> > it.  (Yes, I think Oracle took this
>> > approach.)


Lets put this in perspective: there's tons of companies that spend 
thousands of dollars per month extra by running un-tuned systems in 
cloud environments. I almost called that "waste" but in reality it 
should be a simple business question: is it worth more to the company to 
spend resources on reducing the AWS bill or rolling out new features? 
It's something that can be estimated and a rational business decision made.

Where things become completely *irrational* is when a developer reads 
something like "plpgsql blocks with an EXCEPTION handler are more 
expensive" and they freak out and spend a bunch of time trying to avoid 
them, without even the faintest idea of what that overhead actually is. 
More important, they haven't the faintest idea of what that overhead 
costs the company, vs what it costs the company for them to spend an 
extra hour trying to avoid the EXCEPTION (and probably introducing code 
that's far more bug-prone in the process).

So in reality, the only people likely to notice even something as large 
as a 10% hit are those that were already close to maxing out their 
hardware anyway.

The downside to leaving stuff like this off by default is users won't 
remember it's there when they need it. At best, that means they spend 
more time debugging something than they need to. At worse, it means they 
suffer a production outage for longer than they need to, and that can 
easily exceed many months/years worth of the extra cost from the 
monitoring overhead.

>> > We can talk about this feature all we want, but if we are not willing to
>> > be realistic in how much performance penalty the _average_ user is willing
>> > to lose to have this monitoring, I fear we will make little progress on
>> > this feature.
> OK, 10% was an overstatement.  Anyway, As Amit said, we can discuss the default value based on the performance
evaluationbefore release.
 
>
> As another idea, we can stand on the middle ground.  Interestingly, MySQL also enables their event monitoring
(PerformanceSchema) by default, but not all events are collected.  I guess highly encountered events are not collected
bydefault to minimize the overhead.
 

That's what we currently do with several track_* and log_*_stats GUCs, 
several of which I forgot even existed until just now. Since there's 
question over the actual overhead maybe that's a prudent approach for 
now, but I think we should be striving to enable these things ASAP.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461



Re: Wait events monitoring future development

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> Lets put this in perspective: there's tons of companies that spend thousands
> of dollars per month extra by running un-tuned systems in cloud environments.
> I almost called that "waste" but in reality it should be a simple business
> question: is it worth more to the company to spend resources on reducing
> the AWS bill or rolling out new features?
> It's something that can be estimated and a rational business decision made.
> 
> Where things become completely *irrational* is when a developer reads
> something like "plpgsql blocks with an EXCEPTION handler are more expensive"
> and they freak out and spend a bunch of time trying to avoid them, without
> even the faintest idea of what that overhead actually is.
> More important, they haven't the faintest idea of what that overhead costs
> the company, vs what it costs the company for them to spend an extra hour
> trying to avoid the EXCEPTION (and probably introducing code that's far
> more bug-prone in the process).
> 
> So in reality, the only people likely to notice even something as large
> as a 10% hit are those that were already close to maxing out their hardware
> anyway.
> 
> The downside to leaving stuff like this off by default is users won't
> remember it's there when they need it. At best, that means they spend more
> time debugging something than they need to. At worse, it means they suffer
> a production outage for longer than they need to, and that can easily exceed
> many months/years worth of the extra cost from the monitoring overhead.

I'd rather like this way of positive thinking.  It will be better to think of the event monitoring as a positive
featurefor (daily) proactive improvement, not only as a debugging feature which gives negative image.  For example,
pgAdmin4can display 10 most time-consuming events and their solutions.  The DBA initially places the database and WAL
onthe same volume.  As the system grows and the write workload increases, the DBA can get a suggestion from pgAdmin4
thathe can prepare for the system growth by placing WAL on another volume to reduce WALWriteLock wait events.  This is
notdebugging, but proactive monitoring.
 


> > As another idea, we can stand on the middle ground.  Interestingly, MySQL
> also enables their event monitoring (Performance Schema) by default, but
> not all events are collected.  I guess highly encountered events are not
> collected by default to minimize the overhead.
> 
> That's what we currently do with several track_* and log_*_stats GUCs,
> several of which I forgot even existed until just now. Since there's question
> over the actual overhead maybe that's a prudent approach for now, but I
> think we should be striving to enable these things ASAP.

Agreed.  And as Bruce said, it may be better to be able to disable collection of some events that have visible impact
onperformance.
 

Regards
Takayuki Tsunakawa


Re: Wait events monitoring future development

From
Craig Ringer
Date:
On 10 August 2016 at 07:09, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
 
 
The downside to leaving stuff like this off by default is users won't remember it's there when they need it. At best, that means they spend more time debugging something than they need to. At worse, it means they suffer a production outage for longer than they need to, and that can easily exceed many months/years worth of the extra cost from the monitoring overhead.

Yeah.. and I've got to say, the whole "it'll hurt benchmarks if it's on by default" argument falls flat on its face when you look at our defaults for shared_buffers, etc.
 
If you don't tune Pg, it runs reliably, but slowly. If this proves to have "reasonable" overhead, I'd be inclined to say it should just be on. I frequently wish auto_explain and pg_stat_statements were in-core and on-by-default so when someone calls saying things got slow the historical data is already there. I'm sure this'll be the same.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Wait events monitoring future development

From
Alexander Korotkov
Date:
On Tue, Aug 9, 2016 at 12:47 AM, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> wrote:
On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> It seems asking users to run pg_test_timing before deploying to check
> the overhead would be sufficient.

I'am not sure. Time measurement for waits is slightly more complicated
than a time measurement for explain analyze: a good workload plus
using gettimeofday in a straightforward manner can cause huge
overhead.

What makes you think so?  Both my thoughts and observations are opposite: it's way easier to get huge overhead from EXPLAIN ANALYZE than from measuring wait events.  Current wait events are quite huge events itself related to syscalls, context switches and so on. In contrast EXPLAIN ANALYZE calls gettimeofday for very cheap operations like transfer tuple from one executor node to another.
 
Thats why a proper testing is important - if we can see a
significant performance drop if we have for example large
shared_buffers with the same concurrency,  that shows gettimeofday is
too expensive to use. Am I correct, that we do not have such accurate
tests now?

Do you think that large shared buffers is a kind a stress test for wait events monitoring? If so, why?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Wait events monitoring future development

From
Alexander Korotkov
Date:
On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
> I hope wait event monitoring will be on by default even if the overhead is not
> almost zero, because the data needs to be readily available for faster
> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If you
> disable it by default because of overhead, how can we convince users to enable
> it in production systems to solve some performance problem?  I’m afraid severe
> users would say “we can’t change any setting that might cause more trouble, so
> investigate the cause with existing information.”

If you want to know why people are against enabling this monitoring by
default, above is the reason.  What percentage of people do you think
would be willing to take a 10% performance penalty for monitoring like
this?  I would bet very few, but the argument above doesn't seem to
address the fact it is a small percentage.

Just two notes from me:

1) 10% overhead from monitoring wait events is just an idea without any proof so soon.
2) We already have functionality which trades insight into database with way more huge overhead.  auto_explain.log_analyze = true can slowdown queries *in times*.  Do you think we should remove it?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:
> On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:
>     > I hope wait event monitoring will be on by default even if the overhead
>     is not
>     > almost zero, because the data needs to be readily available for faster
>     > troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
>     you
>     > disable it by default because of overhead, how can we convince users to
>     enable
>     > it in production systems to solve some performance problem?  I’m afraid
>     severe
>     > users would say “we can’t change any setting that might cause more
>     trouble, so
>     > investigate the cause with existing information.”
> 
>     If you want to know why people are against enabling this monitoring by
>     default, above is the reason.  What percentage of people do you think
>     would be willing to take a 10% performance penalty for monitoring like
>     this?  I would bet very few, but the argument above doesn't seem to
>     address the fact it is a small percentage.
> 
> 
> Just two notes from me:
> 
> 1) 10% overhead from monitoring wait events is just an idea without any proof
> so soon.
> 2) We already have functionality which trades insight into database with way
> more huge overhead.  auto_explain.log_analyze = true can slowdown queries *in
> times*.  Do you think we should remove it?

The point is not removing it, the point is whether
auto_explain.log_analyze = true should be enabled by default, and I
think no one wants to do that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
Satoshi Nagayasu
Date:
<p dir="ltr">2016/08/10 23:22 "Bruce Momjian" <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>>:<br />
><br/> > On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:<br /> > > On Tue, Aug 9, 2016
at5:37 AM, Bruce Momjian <<a href="mailto:bruce@momjian.us">bruce@momjian.us</a>> wrote:<br /> > ><br />
>>     On Tue, Aug  9, 2016 at 02:06:40AM +0000, Tsunakawa, Takayuki wrote:<br /> > >     > I hope wait
eventmonitoring will be on by default even if the overhead<br /> > >     is not<br /> > >     > almost
zero,because the data needs to be readily available for faster<br /> > >     > troubleshooting.  IMO, the
benefitwould be worth even 10% overhead.  If<br /> > >     you<br /> > >     > disable it by default
becauseof overhead, how can we convince users to<br /> > >     enable<br /> > >     > it in production
systemsto solve some performance problem?  I’m afraid<br /> > >     severe<br /> > >     > users would
say“we can’t change any setting that might cause more<br /> > >     trouble, so<br /> > >     >
investigatethe cause with existing information.”<br /> > ><br /> > >     If you want to know why people are
againstenabling this monitoring by<br /> > >     default, above is the reason.  What percentage of people do you
think<br/> > >     would be willing to take a 10% performance penalty for monitoring like<br /> > >   
 this? I would bet very few, but the argument above doesn't seem to<br /> > >     address the fact it is a small
percentage.<br/> > ><br /> > ><br /> > > Just two notes from me:<br /> > ><br /> > > 1)
10%overhead from monitoring wait events is just an idea without any proof<br /> > > so soon.<br /> > > 2)
Wealready have functionality which trades insight into database with way<br /> > > more huge
overhead.  auto_explain.log_analyze= true can slowdown queries *in<br /> > > times*.  Do you think we should
removeit?<br /> ><br /> > The point is not removing it, the point is whether<br /> > auto_explain.log_analyze
=true should be enabled by default, and I<br /> > think no one wants to do that.<p dir="ltr">Agreed.<p dir="ltr">If
peopleare facing with some difficult situation in terms of performance, they may accept some (one-time) overhead to
resolvethe issue.<br /> But if they don't have (recognize) any issue, they may not.<p dir="ltr">That's one of the
realitiesaccording to my experiences.<p dir="ltr">Regards, 

Re: Wait events monitoring future development

From
Bruce Momjian
Date:
On Wed, Aug 10, 2016 at 11:37:36PM +0900, Satoshi Nagayasu wrote:
> Agreed.
> 
> If people are facing with some difficult situation in terms of performance,
> they may accept some (one-time) overhead to resolve the issue.
> But if they don't have (recognize) any issue, they may not.
> 
> That's one of the realities according to my experiences.

Yes.  Many people are arguing for specific defaults based on what _they_
would want, not what the average user would want.  Sophisticated users
will know about this and turn it on when desired.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Wait events monitoring future development

From
Robert Haas
Date:
On Tue, Aug 9, 2016 at 12:07 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> As another idea, we can stand on the middle ground.  Interestingly, MySQL also enables their event monitoring
(PerformanceSchema) by default, but not all events are collected.  I guess highly encountered events are not collected
bydefault to minimize the overhead. 

Yes, I think that's a sensible approach.  I can't see enabling by
default a feature that significantly regresses performance.  We work
too hard to improve performance to throw very much of it away for any
one feature, even a feature that a lot of people like.  What I really
like about what got committed to 9.6 is that it's so cheap we should
be able to use for lots of other things - latch events, network I/O,
disk I/O, etc. without hurting performance at all.

But if we start timing those events, it's going to be really
expensive.  Even just counting them or keeping a history will cost a
lot more than just publishing them while they're active, which is what
we're doing now.

> BTW, I remember EnterpriseDB has a wait event monitoring feature.  Is it disabled by default?  What was the overhead?

Timed events in Advanced Server are disabled by default.  I haven't
actually tested the overhead myself and I don't remember exactly what
the numbers were the last time someone else did, but I think if you
turned edb_timed_statistics on, it's pretty expensive.  If we can
agree on something sensible here, I imagine we'll get rid of that
feature in Advanced Server in favor of whatever the community settles
on.  But if the community agrees to turn on something by default that
costs a measurable percentage in performance, I predict that Advanced
Server 10 will ship with a different default for that feature than
PostgreSQL 10.

Personally, I think too much of this thread (and previous threads) is
devoted to arguing about whether it's OK to make performance worse,
and by how much we'd be willing to make it worse.  What I think we
ought to be talking about is how to design a feature that produces the
most useful data for the least performance cost possible, like by
avoiding measuring wait times for events that are very frequent or
waits that are very short.  Or, maybe we could have a background
process that updates a timestamp in shared memory every millisecond,
and other processes can read that value instead of making a system
call.  I think on Linux systems with fast clocks the operating system
basically does something like that for you, but there might be other
systems where it helps.  Of course, it could also skew the results if
the system is so overloaded that the clock-updater process gets
descheduled for a lengthy period of time.

Anyway, I disagree with the idea that this feature is stalled or
blocked in some way.  I (and quite a few other people, though not
everyone) oppose making performance significantly worse in the default
configuration.  I oppose that regardless of whether it is a
hypothetical patch for this feature that causes the problem or whether
it is a hypothetical patch for some other feature that causes the
problem.  I am not otherwise opposed to more work in this area; in
fact, I'm rather in favor of it.  But you can count on me to argue
against pretty much everything that causes a performance regression,
whatever the reason. Virtually every release, at least one developer
proposes some patch that slows the server down by "only" 1-2%.  If
we'd accepted all of the patches that were shot down because of such
impacts, we'd have lost a very big chunk of performance between the
time I started working on PostgreSQL and now.

As it is, our single-threaded performance seems to have regressed
noticeably since 9.1:

http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/

I think that's awful.  But if we'd accepted all of those patches that
cost "only" one or two percentage points, it would probably be -15% or
-25% rather than -4.4%.  I think that if we want to really be
successful as a project, we need to make that number go UP, not down.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Wait events monitoring future development

From
Andres Freund
Date:
Hi,

On 2016-08-07 14:03:17 +0200, Ilya Kosmodemiansky wrote:
> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction.

I see little evidence of that. If you consider "please do some
reasonable benchmarks" as being stuck...