Thread: Re: [Proposal] Add accumulated statistics for wait event

Re: [Proposal] Add accumulated statistics for wait event

From
Michael Paquier
Date:
On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote:
> This proposal is about recording additional statistics of wait events.

You should avoid sending things in html format, text format being
recommended on those mailing lists...  The patch applies after using
patch -p0 by the way.

I would recommend that you generate your patches using "git
format-patch".  Here are general guidelines on the matter:
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Please study those guidelines, those are helpful if you want to get
yourself familiar with community process.

I have comments about your patch.  First, I don't think that you need to
count precisely the number of wait events triggered as usually when it
comes to analyzing a workload's bottleneck what counts is a periodic
*sampling* of events, patterns which can be fetched already from
pg_stat_activity and stored say in a different place.  This can be
achieved easily by using a cron job with an INSERT SELECT query which
adds data on a relation storing the event counts.  I took the time to
look at your patch, and here is some feedback.

This does not need a configure switch.  I assume that what you did is
good to learn the internals of ./configure though.

There is no documentation.  What does the patch do?  What is it useful
for?

+       case PG_WAIT_IPC:
+           arrayIndex = NUM_WAIT_LWLOCK +
+               NUM_WAIT_LOCK + NUM_WAIT_BUFFER_PIN +
+               NUM_WAIT_ACTIVITY + NUM_WAIT_CLIENT +
+               NUM_WAIT_EXTENSION + eventId;
+           break;
This is ugly and unmaintainable style.  You could perhaps have
considered an enum instead.

pg_stat_get_wait_events should be a set-returning function, which you
could filter at SQL level using a PID, so no need of it as an argument.

What's the performance penalty?  I am pretty sure that this is
measurable as wait events are stored for a backend for each I/O
operation as well, and you are calling a C routine within an inlined
function which is designed to be light-weight, doing only a four-byte
atomic operation.
--
Michael

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Egor Rogov
Date:
Hi,

that will be a great feature.


On 23.07.2018 10:53, Michael Paquier wrote:
> I have comments about your patch.  First, I don't think that you need to
> count precisely the number of wait events triggered as usually when it
> comes to analyzing a workload's bottleneck what counts is a periodic
> *sampling* of events

If I get it right, this patch is not about sampling. It gathers 
cumulative statistics, which is regrettably missing in PostgreSQL. 
That's why it should not only count exact number of wait events, but 
also min time and stddev would be helpful (as in pg_stat_statements).

--
Egor Rogov
Postgres Professional http://www.postgrespro.com



Re: [Proposal] Add accumulated statistics for wait event

From
Alexander Korotkov
Date:
On Mon, Jul 23, 2018 at 10:53 AM Michael Paquier <michael@paquier.xyz> wrote:
> What's the performance penalty?  I am pretty sure that this is
> measurable as wait events are stored for a backend for each I/O
> operation as well, and you are calling a C routine within an inlined
> function which is designed to be light-weight, doing only a four-byte
> atomic operation.

Yes, the question is overhead of measuring durations of individual
wait events.  It has been proposed before, and there been heated
debates about that (see threads [1-3]).  It doesn't seem to be a
conclusion about this feature.  The thing to be said for sure:
performance penalty heavily depends on OS/hardware/workload.  In some
cases overhead is negligible, but in other cases it appears to be
huge.

1. https://www.postgresql.org/message-id/flat/559D4729.9080704%40postgrespro.ru
2.
https://www.postgresql.org/message-id/flat/CA%2BTgmoYd3GTz2_mJfUHF%2BRPe-bCy75ytJeKVv9x-o%2BSonCGApw%40mail.gmail.com
3. https://www.postgresql.org/message-id/flat/CAG95seUAQVj09KzLwU%2Bz1B-GqdMqerzEkPFR3hn0q88XzMq-PA%40mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [Proposal] Add accumulated statistics for wait event

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> This does not need a configure switch.

It probably is there because the OP realizes that most people wouldn't
accept having this code compiled in.

> What's the performance penalty?  I am pretty sure that this is
> measurable as wait events are stored for a backend for each I/O
> operation as well, and you are calling a C routine within an inlined
> function which is designed to be light-weight, doing only a four-byte
> atomic operation.

On machines with slow gettimeofday(), I suspect the cost of this
patch would be staggering.  Even with relatively fast gettimeofday,
it doesn't look acceptable for calls in hot code paths (for instance,
lwlock.c).

A bigger problem is that it breaks stuff.  There are countless
calls to pgstat_report_wait_start/pgstat_report_wait_end that
assume they have no side-effects (for example, on errno) and
can never fail.  I wouldn't trust GetCurrentTimestamp() for either.
If the report_wait calls can't be dropped into code with *complete*
certainty that they're safe, that's a big cost.

Why exactly is this insisting on logging timestamps and not,
say, just incrementing a counter?  I think doing it like this
is almost certain to end in rejection.

            regards, tom lane


RE: Re: [Proposal] Add accumulated statistics for wait event

From
MyungKyu LIM
Date:
 2018-07-23 16:53 (GMT+9), Michael Paquier wrote:
> On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote:
>> This proposal is about recording additional statistics of wait events.
 
> I have comments about your patch.  First, I don't think that you need to
> count precisely the number of wait events triggered as usually when it
> comes to analyzing a workload's bottleneck what counts is a periodic
> *sampling* of events, patterns which can be fetched already from
> pg_stat_activity and stored say in a different place.

Thanks for your feedback.

This proposal is not about *sampling*. 
Accumulated statistics of wait events information is useful for solving
issue. It can measure accurate data. 

Some case, sampling of events can not find the cause of issue. It lose detail data.
For example, some throughput issue occur(ex : disk io), but each wait point 
occurs only a few milliseconds. 
In this case, it is highly likely that will not find the cause.

> This is ugly and unmaintainable style.

I'm sorry. You're right.
Think as the PoC.
 
> What's the performance penalty?

I have same worries. I just tried pgbench several times.
Let me know what some good performance check method.

Best regards,
MyungKyu, Lim


RE: Re: [Proposal] Add accumulated statistics for wait event

From
MyungKyu LIM
Date:
> On Mon, Jul 23, 2018 at 10:53 AM Michael Paquier <michael@paquier.xyz> wrote:
>> What's the performance penalty?  I am pretty sure that this is
>> measurable as wait events are stored for a backend for each I/O
>> operation as well, and you are calling a C routine within an inlined
>> function which is designed to be light-weight, doing only a four-byte
>> atomic operation.

> Yes, the question is overhead of measuring durations of individual wait events.  It has been proposed before, and
therebeen heated debates about that (see threads [1-3]).  It doesn't seem  
> to be a conclusion about this feature.  The thing to be said for sure:
> performance penalty heavily depends on OS/hardware/workload.  In some cases overhead is negligible, but in other
casesit appears to be huge. 

Thanks for good information.
I agree. Performance penalty is exist.
But wait stats are demandable and useful. In some cases, it is worth sacrificing performance and using it.

So, what do you think about developing as extension? I have another concept proposal.
2. This feature can be implemented as extension if some hooks were provided in following functions,
 - pgstat_report_wait_start
 - pgstat_report_wait_end
This feature can be turned on/off by on-line config when necessary.

Best regards,
MyungKyu, Lim
 


RE: Re: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:

Hi,

I am skeptical about accumulated statistics.

pg_stat_activity now gives necessary information about wait events. It can be easily be used with a polling system that sleeps most of the time to limit the overhead. Measuring the duration of individual wait events is not necessary to know the repartition of the charge.

You can aggregate the results of the polling by application, query, wait events or whatever you want.

I wrote a script for that that can be used interactively or in batch mode to produce reports but many solutions exist . 

Best regards Phil





De : MyungKyu LIM <myungkyu.lim@samsung.com>
Envoyé : mardi 24 juillet 2018 12:10
À : Alexander Korotkov; pgsql-hackers@postgresql.org
Cc : Woosung Sohn; DoHyung HONG
Objet : RE: Re: [Proposal] Add accumulated statistics for wait event
 
> On Mon, Jul 23, 2018 at 10:53 AM Michael Paquier <michael@paquier.xyz> wrote:
>> What's the performance penalty?  I am pretty sure that this is
>> measurable as wait events are stored for a backend for each I/O
>> operation as well, and you are calling a C routine within an inlined
>> function which is designed to be light-weight, doing only a four-byte
>> atomic operation.

> Yes, the question is overhead of measuring durations of individual wait events.  It has been proposed before, and there been heated debates about that (see threads [1-3]).  It doesn't seem
> to be a conclusion about this feature.  The thing to be said for sure:
> performance penalty heavily depends on OS/hardware/workload.  In some cases overhead is negligible, but in other cases it appears to be huge.

Thanks for good information.
I agree. Performance penalty is exist.
But wait stats are demandable and useful. In some cases, it is worth sacrificing performance and using it.

So, what do you think about developing as extension? I have another concept proposal.
2. This feature can be implemented as extension if some hooks were provided in following functions,
 - pgstat_report_wait_start
 - pgstat_report_wait_end
This feature can be turned on/off by on-line config when necessary.

Best regards,
MyungKyu, Lim
 

Re: [Proposal] Add accumulated statistics for wait event

From
Tomas Vondra
Date:

On 07/23/2018 03:57 PM, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> This does not need a configure switch.
> 
> It probably is there because the OP realizes that most people wouldn't
> accept having this code compiled in.
> 
>> What's the performance penalty?  I am pretty sure that this is
>> measurable as wait events are stored for a backend for each I/O
>> operation as well, and you are calling a C routine within an inlined
>> function which is designed to be light-weight, doing only a four-byte
>> atomic operation.
> 
> On machines with slow gettimeofday(), I suspect the cost of this
> patch would be staggering.  Even with relatively fast gettimeofday,
> it doesn't look acceptable for calls in hot code paths (for instance,
> lwlock.c).
> 

Yeah. I wonder if we could measure the time for a small fraction of the 
wait events, and estimate the actual duration from that.

> A bigger problem is that it breaks stuff.  There are countless
> calls to pgstat_report_wait_start/pgstat_report_wait_end that
> assume they have no side-effects (for example, on errno) and
> can never fail.  I wouldn't trust GetCurrentTimestamp() for either.
> If the report_wait calls can't be dropped into code with *complete*
> certainty that they're safe, that's a big cost.
> 
> Why exactly is this insisting on logging timestamps and not,
> say, just incrementing a counter?  I think doing it like this
> is almost certain to end in rejection.
> 

Because the number of times you hit wait event may not correlate with 
the time you spent waiting on it. So a simple counter is not the most 
useful thing.

regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [Proposal] Add accumulated statistics for wait event

From
Tomas Vondra
Date:

On 07/24/2018 12:06 PM, MyungKyu LIM wrote:
>   2018-07-23 16:53 (GMT+9), Michael Paquier wrote:
>> On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote:
>>> This proposal is about recording additional statistics of wait events.
>   
>> I have comments about your patch.  First, I don't think that you need to
>> count precisely the number of wait events triggered as usually when it
>> comes to analyzing a workload's bottleneck what counts is a periodic
>> *sampling* of events, patterns which can be fetched already from
>> pg_stat_activity and stored say in a different place.
> 
> Thanks for your feedback.
> 
> This proposal is not about *sampling*.
> Accumulated statistics of wait events information is useful for solving
> issue. It can measure accurate data.
> 
> Some case, sampling of events can not find the cause of issue. It lose detail data.
> For example, some throughput issue occur(ex : disk io), but each wait point
> occurs only a few milliseconds.
> In this case, it is highly likely that will not find the cause.
> 

I think it's highly likely that it will find the cause. The idea of 
sampling is that while you don't measure the timing directly, you can 
infer it from the frequency of the wait events in the samples. So if you 
see the backend reports a particular wait event in 75% of samples, it 
probably spent 75% time waiting on it.

I'm not saying sampling is perfect and it certainly is less convenient 
than what you propose.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:

Hi,


> Some case, sampling of events can not find the cause of issue. It lose detail data.
> For example, some throughput issue occur(ex : disk io), but each wait point
> occurs only a few milliseconds.


It loses non meaningful details and it's in fact a good point. In this example, sampling will definitely find the cause and won't cost resources.

Being as precise as possible to define a wait event is very useful but knowing precisely the duration of each event is less useful in terms of tuning.


Example of sampling + group by/order by percentage of activity :


./t -d 5 -o "application_name, wait_event_type" -o "application_name, wait_event, wait_event_type"              
traqueur 2.05.00 - performance tool for PostgreSQL 9.3 => 11
INFORMATION, no connection parameters provided, connecting to dedicated database ...
INFORMATION, connected to dedicated database traqueur
INFORMATION, PostgreSQL version : 110000
INFORMATION, sql preparation ...
INFORMATION, sql execution ...
 busy_pc | distinct_exe | application_name | wait_event_type
---------+--------------+------------------+-----------------
     206 | 8 / 103      | mperf            |
      62 | 2 / 31       | mperf            | LWLock
      20 | 3 / 10       | mperf            | IO
      12 | 1 / 6        | mperf            | Client
(4 rows)

 busy_pc | distinct_exe | application_name |      wait_event       | wait_event_type
---------+--------------+------------------+-----------------------+-----------------
     206 | 8 / 103      | mperf            |                       |
      62 | 2 / 31       | mperf            | WALWriteLock          | LWLock
      14 | 1 / 7        | mperf            | DataFileImmediateSync | IO
      12 | 1 / 6        | mperf            | ClientRead            | Client
       2 | 1 / 1        | mperf            | DataFileWrite         | IO
       2 | 1 / 1        | mperf            | DataFileRead          | IO
       2 | 1 / 1        | mperf            | WALInitWrite          | IO

No need to know the exact duration of each event to identify the bottleneck(s)... 


Best regards

Phil




De : Tomas Vondra <tomas.vondra@2ndquadrant.com>
Envoyé : mardi 24 juillet 2018 17:45
À : pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
 


On 07/24/2018 12:06 PM, MyungKyu LIM wrote:
>   2018-07-23 16:53 (GMT+9), Michael Paquier wrote:
>> On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote:
>>> This proposal is about recording additional statistics of wait events.
>  
>> I have comments about your patch.  First, I don't think that you need to
>> count precisely the number of wait events triggered as usually when it
>> comes to analyzing a workload's bottleneck what counts is a periodic
>> *sampling* of events, patterns which can be fetched already from
>> pg_stat_activity and stored say in a different place.
>
> Thanks for your feedback.
>
> This proposal is not about *sampling*.
> Accumulated statistics of wait events information is useful for solving
> issue. It can measure accurate data.
>
> Some case, sampling of events can not find the cause of issue. It lose detail data.
> For example, some throughput issue occur(ex : disk io), but each wait point
> occurs only a few milliseconds.
> In this case, it is highly likely that will not find the cause.
>

I think it's highly likely that it will find the cause. The idea of
sampling is that while you don't measure the timing directly, you can
infer it from the frequency of the wait events in the samples. So if you
see the backend reports a particular wait event in 75% of samples, it
probably spent 75% time waiting on it.

I'm not saying sampling is perfect and it certainly is less convenient
than what you propose.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Add accumulated statistics for wait event

From
Michael Paquier
Date:
On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote:
> It loses non meaningful details and it's in fact a good point. In this
> example, sampling will definitely find the cause and won't cost
> resources.

The higher the sampling frequency, the more details you get, with the
most load on the instance.  So if you are able to take an infinity of
samples, where registering multiple times the same event for the same
backend also matters because its overall weight gets higher and it shows
up higher in profiles, then you would be able converge to the set of
results that this patch adds.  Sampling method, especially its
frequency, is something controlled by the client and not the server.
Approaches like the one proposed here push the load on the server-side,
unconditionally, for *all* backends, and this has its cost.

Even if you have spiky workloads, sampling may miss those, but even with
adding counters for each event you would need to query the table holding
the counters at an insane frequency to be able to perhaps get something
out of it as you need to do sampling of the counters as well to extract
deltas.

As Tomas has mentioned up-thread, sampling is light-weight, as-is the
current design for wait events.  Even if it is not perfect because it
cannot give exact numbers, it would find bottlenecks in really most
cases, and that's what matters.  If not, increasing the sampling
frequency makes things easier to detect as well.  What would be the
point of taking only one sample every checkpoint for example?

There may be a benefit in having counters, I don't know the answer to
that, though the point would be to make sure that there is a specific
set of workloads where it makes sense, still my gut feeling is that
sampling would be able to detect those anyway.

(I am not a computer scientist by default but a physicist, think fluid
dynamics and turbulence, and I had my load of random events and signal
analysis as well.  All that is just statistics with attempts to approach
reality, where sampling is a life-saver over exactitude of
measurements.)

Adding hooks is not acceptable to me either, those have a cost, and it
is not clear what's the benefit we can get into putting hooks in such a
place for cases other than sampling and counters...
--
Michael

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Bertrand DROUVOT
Date:
Hello Guys,

As you mentioned Oracle like active session history sampling in this thread, I just want to let you know that I am working on a brand new extension to provide this feature.

You can find the extension here: https://github.com/pgsentinel/pgsentinel

Basically, you could see it as samplings of pg_stat_activity (one second interval as default) currently providing more information:

ash_time: the sampling time
        blockers:  the number of blockers
        blockerpid: the pid of the blocker (if blockers = 1), the pid of one blocker (if blockers > 1)
top_level_query: the top level statement (in case PL/pgSQL is used)
query: the statement being executed (not normalised, as it is in pg_stat_statements, means you see the values)
        cmdtype: the statement type (SELECT,UPDATE,INSERT,DELETE,UTILITY,UNKNOWN,NOTHING)
queryid: the queryid of the statement (the one coming from pg_stat_statements)

Thanks to the queryid field you are able to link the session activity with the sql activity.

It's implemented as in-memory ring buffer where samples are written with given (configurable) period. 
Therefore, user can see some number of recent samples depending on history size (configurable).

Current caveats: In case of high query rate per pid, you could see (I saw it at more than 1000 queries per second) top_level_query and query not "correlated" (query, queryid and cmdtype are still well linked together). This is due to the fact that those 2 informations are currently collected independently.

If you want to have a look, give your thoughts, you are welcome.

Bertrand

On 26 July 2018 at 03:24, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote:
> It loses non meaningful details and it's in fact a good point. In this
> example, sampling will definitely find the cause and won't cost
> resources.

The higher the sampling frequency, the more details you get, with the
most load on the instance.  So if you are able to take an infinity of
samples, where registering multiple times the same event for the same
backend also matters because its overall weight gets higher and it shows
up higher in profiles, then you would be able converge to the set of
results that this patch adds.  Sampling method, especially its
frequency, is something controlled by the client and not the server.
Approaches like the one proposed here push the load on the server-side,
unconditionally, for *all* backends, and this has its cost.

Even if you have spiky workloads, sampling may miss those, but even with
adding counters for each event you would need to query the table holding
the counters at an insane frequency to be able to perhaps get something
out of it as you need to do sampling of the counters as well to extract
deltas.

As Tomas has mentioned up-thread, sampling is light-weight, as-is the
current design for wait events.  Even if it is not perfect because it
cannot give exact numbers, it would find bottlenecks in really most
cases, and that's what matters.  If not, increasing the sampling
frequency makes things easier to detect as well.  What would be the
point of taking only one sample every checkpoint for example?

There may be a benefit in having counters, I don't know the answer to
that, though the point would be to make sure that there is a specific
set of workloads where it makes sense, still my gut feeling is that
sampling would be able to detect those anyway.

(I am not a computer scientist by default but a physicist, think fluid
dynamics and turbulence, and I had my load of random events and signal
analysis as well.  All that is just statistics with attempts to approach
reality, where sampling is a life-saver over exactitude of
measurements.)

Adding hooks is not acceptable to me either, those have a cost, and it
is not clear what's the benefit we can get into putting hooks in such a
place for cases other than sampling and counters...
--
Michael

RE: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:

Hi,

I agree with that. PostgreSQL 10 is really great, tuning tools based on sampling of pg_stat_activity became accurate without any modification. 

Best regards

Phil




De : Michael Paquier <michael@paquier.xyz>
Envoyé : jeudi 26 juillet 2018 03:24
À : Phil Florent
Cc : Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
 
On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote:
> It loses non meaningful details and it's in fact a good point. In this
> example, sampling will definitely find the cause and won't cost
> resources.

The higher the sampling frequency, the more details you get, with the
most load on the instance.  So if you are able to take an infinity of
samples, where registering multiple times the same event for the same
backend also matters because its overall weight gets higher and it shows
up higher in profiles, then you would be able converge to the set of
results that this patch adds.  Sampling method, especially its
frequency, is something controlled by the client and not the server.
Approaches like the one proposed here push the load on the server-side,
unconditionally, for *all* backends, and this has its cost.

Even if you have spiky workloads, sampling may miss those, but even with
adding counters for each event you would need to query the table holding
the counters at an insane frequency to be able to perhaps get something
out of it as you need to do sampling of the counters as well to extract
deltas.

As Tomas has mentioned up-thread, sampling is light-weight, as-is the
current design for wait events.  Even if it is not perfect because it
cannot give exact numbers, it would find bottlenecks in really most
cases, and that's what matters.  If not, increasing the sampling
frequency makes things easier to detect as well.  What would be the
point of taking only one sample every checkpoint for example?

There may be a benefit in having counters, I don't know the answer to
that, though the point would be to make sure that there is a specific
set of workloads where it makes sense, still my gut feeling is that
sampling would be able to detect those anyway.

(I am not a computer scientist by default but a physicist, think fluid
dynamics and turbulence, and I had my load of random events and signal
analysis as well.  All that is just statistics with attempts to approach
reality, where sampling is a life-saver over exactitude of
measurements.)

Adding hooks is not acceptable to me either, those have a cost, and it
is not clear what's the benefit we can get into putting hooks in such a
place for cases other than sampling and counters...
--
Michael

RE: [Proposal] Add accumulated statistics for wait event

From
"Yotsunaga, Naoki"
Date:
On Thu, July 26, 2018 at 1:25 AM, Michael Paquier wrote:
> Even if you have spiky workloads, sampling may miss those, but even with adding counters for each event 
> you would need to query the table holding the counters at an insane frequency to be able to perhaps get
> something out of it as you need to do sampling of the counters as well to extract deltas.

Hi, I was wondering why PostgreSQL did not have the number of wait events and wait time that other databases such as
Oraclehad as a function, and when I was looking for related threads, I got to this thread.
 

I am a DB beginner, so please tell me. It says that you can find events that are bottlenecks in sampling, but as you
sawabove, you can not find events shorter than the sampling interval, right?
 
If this short event has occurred quite a number of times and it was a considerable amount of time in total, can you
solvethis problem with sampling?
 
# I have asked, but I could not understand much of the discussion above and I do not know if such a case can exist.
Also, I think that it can be solved by higher the sampling frequency, but the load will be high, right? I think this
methodis not very practical.
 
 
Moreover, I think that it is not implemented due to the reason that sampling is good as described above and because it
affectsperformance.
 
How about switching the function on / off and implementing with default off?
Do you care about performance degradation during bottleneck investigation?
When investigating the bottleneck, I think that it is better to find the cause even at the expense of performance.
# If you can detect with high frequency sampling, I think that it depends on whether the sampling or the function(the
numberof wait events and wait time) is high load.
 
 
Since I am a DB beginner, I think that it is saying strange things.
I am glad if you tell me.

-----
Naoki Yotsunaga



RE: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:
Hi,
I am a DB beginner, so please tell me. It says that you can find events that are bottlenecks in sampling, but as you saw above, you can not find events shorter than the sampling interval, right?

If an event occurs frequently and if it is reported in pg_stat_activity, you will catch it again and again while sampling, no matter it duration.
Hence you just need to
  • Sample the sessions and consider the active ones. You need to know if they are waiting (PostgreSQL now provides detailed wait events) or if they are on the CPU
  • Optionally collect information on the system context at the time of sampling (CPU, memory...), it can be provided by many tools like psutil python library for example
If the client application itself provides information it's even more interesting. With something like program/module/action/client_info/sofar/totalwork in application_name you are able to focus directly on different kind of activity. It can give you information like  "I/O waits are meaningful for my batch activity but not for my OLTP activity, if my goal is to improve response time for end users I have to consider that."
 
Best regards
Phil


De : Yotsunaga, Naoki <yotsunaga.naoki@jp.fujitsu.com>
Envoyé : jeudi 4 octobre 2018 10:31
À : 'Michael Paquier'; Phil Florent
Cc : Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : RE: [Proposal] Add accumulated statistics for wait event
 
On Thu, July 26, 2018 at 1:25 AM, Michael Paquier wrote:
> Even if you have spiky workloads, sampling may miss those, but even with adding counters for each event
> you would need to query the table holding the counters at an insane frequency to be able to perhaps get
> something out of it as you need to do sampling of the counters as well to extract deltas.

Hi, I was wondering why PostgreSQL did not have the number of wait events and wait time that other databases such as Oracle had as a function, and when I was looking for related threads, I got to this thread.

I am a DB beginner, so please tell me. It says that you can find events that are bottlenecks in sampling, but as you saw above, you can not find events shorter than the sampling interval, right?
If this short event has occurred quite a number of times and it was a considerable amount of time in total, can you solve this problem with sampling?
# I have asked, but I could not understand much of the discussion above and I do not know if such a case can exist.
Also, I think that it can be solved by higher the sampling frequency, but the load will be high, right? I think this method is not very practical.
 
Moreover, I think that it is not implemented due to the reason that sampling is good as described above and because it affects performance.
How about switching the function on / off and implementing with default off?
Do you care about performance degradation during bottleneck investigation?
When investigating the bottleneck, I think that it is better to find the cause even at the expense of performance.
# If you can detect with high frequency sampling, I think that it depends on whether the sampling or the function(the number of wait events and wait time) is high load.
 
Since I am a DB beginner, I think that it is saying strange things.
I am glad if you tell me.

-----
Naoki Yotsunaga

Re: [Proposal] Add accumulated statistics for wait event

From
Michael Paquier
Date:
On Thu, Oct 04, 2018 at 09:32:37AM +0000, Phil Florent wrote:
> I am a DB beginner, so please tell me. It says that you can find
> events that are bottlenecks in sampling, but as you saw above, you can
> not find events shorter than the sampling interval, right?

Yes, which is why it would be as simple as making the interval shorter,
still not too short so as it bloats the amount of information fetched
which needs to be stored and afterwards (perhaps) treated for analysis.
This gets rather close to signal processing.  A simple image is for
example, assuming that event A happens 100 times in an interval of 1s,
and event B only once in the same interval of 1s, then if the snapshot
interval is only 1s, then in the worst case A would be treated an equal
of B, which would be wrong.
--
Michael

Attachment

RE: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:
Hi,

It's the same logic with any polling system. An integration calculation using monte-carlo method with only a few points won't be accurate enough and can even be completely wrong etc. 
Polling is OK to troubleshoot a problem on the fly but 2 points are not enough. A few seconds are needed to obtain good enough data, e.g 5-10 seconds of polling with a 0.1=>0.01s interval between 2 queries of the activity.
Polling a few seconds while the user is waiting is normally enough to say if a significant part of the waits are on the database. It's very important to know that. With 1 hour of accumulated statistics, a DBA will always see something to fix. But if the user waits 10 seconds on a particular screen and 1 second is spent on the database it often won't directly help.  
Polling gives great information with postgreSQL 10 but it was already useful to catch top queries etc. in older versions.
I always check if activity is adequately reported by my tool using known cases. I want to be sure it will report adequately things in real-world troubleshooting sessions. Sometimes there are bugs in my tool, once there was an issue with postgres (pgstat_report_activty() was not called by workers in parallel index creation)

Best regards
Phil

De : Michael Paquier <michael@paquier.xyz>
Envoyé : jeudi 4 octobre 2018 12:58
À : Phil Florent
Cc : Yotsunaga, Naoki; Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
 
On Thu, Oct 04, 2018 at 09:32:37AM +0000, Phil Florent wrote:
> I am a DB beginner, so please tell me. It says that you can find
> events that are bottlenecks in sampling, but as you saw above, you can
> not find events shorter than the sampling interval, right?

Yes, which is why it would be as simple as making the interval shorter,
still not too short so as it bloats the amount of information fetched
which needs to be stored and afterwards (perhaps) treated for analysis.
This gets rather close to signal processing.  A simple image is for
example, assuming that event A happens 100 times in an interval of 1s,
and event B only once in the same interval of 1s, then if the snapshot
interval is only 1s, then in the worst case A would be treated an equal
of B, which would be wrong.
--
Michael

RE: [Proposal] Add accumulated statistics for wait event

From
"Yotsunaga, Naoki"
Date:

On Thu, Oct 4, 2018 at 0:54 AM, Phil Florent wrote:

 

Phil, Michael, I appreciate your polite comments.

I understand as follows.

We can find it if we shorten the sampling interval, but a lot of information comes out.

# The balance is important.

Also, it is not good unless we have enough samples.

And I have to do various other things.

Is my understand correct?

 

It seems to me that it is difficult for me if my understanding is right.

Is DBA really able to solve bottlenecks with sampling?

# Since I am a beginner, I feel that way. And other people may not feel it difficult.

 

What I would like to say is that if we have information on the number of wait events and the wait time(like other DB), we can investigate more easily.

Of course, I understand that it also affects performance. So, I suggest a way that it can switch on and off, defaults is off.

 

-----

Naoki, Yotsunaga.

Re: [Proposal] Add accumulated statistics for wait event

From
legrand legrand
Date:
Bertrand DROUVOT wrote
> Hello Guys,
> 
> As you mentioned Oracle like active session history sampling in this
> thread, I just want to let you know that I am working on a brand new
> extension to provide this feature.
> 
> You can find the extension here: https://github.com/pgsentinel/pgsentinel
> 
> [...]
> 
> If you want to have a look, give your thoughts, you are welcome.
> 
> Bertrand


+1!

just a regret: ash sampling interval can not be smaller than a second ;o(

Cordialement
PAscal




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


RE: [Proposal] Add accumulated statistics for wait event

From
"Yotsunaga, Naoki"
Date:

On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:

 

Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?

In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.

 

However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.

#Because legards comment.

https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com

 

Does this sampling number of 60 give information that I really want?

Perhaps it is not to miss the real problem part?

---------------------------------------

Naoki, Yotsunaga.

RE: [Proposal] Add accumulated statistics for wait event

From
legrand legrand
Date:
Hello,
You are right, sampling has to be "tuned" regarding the event(s) you want to
catch.

Sampling of 1 second interval is good with treatments that take hours, and
not enough for a minute or a second analysis.

May I invite you to try it, using PASH-viewer (github) with pgsentinel
(github).

Changing pgsentiel.c sampling from 1 second

rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
ash_sampling_period * 1000L,PG_WAIT_EXTENSION);

to 1/10 second
rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
ash_sampling_period * 100L,PG_WAIT_EXTENSION);

seems the good balance for me (for analysis periods from a few seconds to
minutes).

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


RE: [Proposal] Add accumulated statistics for wait event

From
Phil Florent
Date:
Hi,

Is DBA really able to solve bottlenecks with sampling?

What I would like to say is that if we have information on the number of wait events and the wait time(like other DB), we can investigate more easily.


Yes you will be able to solve bottlenecks with sampling. In interactive mode, a 1s interval is probably too large. I use 0s1 - 0s01 with my tool and it is normally OK. In batch mode I use 1s=>10s. If you want to visualize the results it's easy to use a dedicated tool and bottlenecks will clearly appear .
Since grafana is now able to connect directly to a postgresql source, I use it to display the information collected from pg_stat_activity and psutil ( e.g https://pgphil.ovh/traqueur_dashboard_02.php page is written in french but panels are in english)

Other DB have accumulated statistics but you can notice that sampling is also their most modern method.
E.g Oracle DB : 20 years ago you already had tools like "utlbstat/utlestat" . Then you had "statspack". Those tools were based on accumulated statistics and the reports were based on differences between 2 points. It was useful to solve major problems but it was limited and not precise enough in many cases.

The preferred feature to identify bottlenecks in the Oracle world is now ASH (active session history). It can help with major problems, specific problems AND it can identify short blockages.
Too bad it is licensed as an option of their Enterprise Edition but similar tools exist and they are also based on sampling of the activity.

With the "official" ASH, sampling and archiving are done internally and you have a circular memory zone dedicated to the feature. Hence the overhead is lower but that's all.

The most advanced interactive tool is called "snapper" and it is also based on sampling.

Best regards
Phil



De : Yotsunaga, Naoki <yotsunaga.naoki@jp.fujitsu.com>
Envoyé : lundi 29 octobre 2018 02:20
À : 'Phil Florent'; 'Michael Paquier'
Cc : 'Tomas Vondra'; 'pgsql-hackers@lists.postgresql.org'
Objet : RE: [Proposal] Add accumulated statistics for wait event
 

On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:

 

Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?

In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.

 

However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.

#Because legards comment.

https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com

 

Does this sampling number of 60 give information that I really want?

Perhaps it is not to miss the real problem part?

---------------------------------------

Naoki, Yotsunaga.

RE: [Proposal] Add accumulated statistics for wait event

From
"Yotsunaga, Naoki"
Date:

On Mon, Oct 29, 2018 at 1:52 AM, Phil Florent wrote:

 

Hi, thank you for comments.

 

>Yes you will be able to solve bottlenecks with sampling. In interactive mode, a 1s interval is probably too large. I use 0s1 - 0s01 with my tool and it is normally OK.

 

With the tool you are using, can you sample at intervals shorter than 1 second?

If you can, you can get enough sampling number and you can also acquire short events.

 

>Since grafana is now able to connect directly to a postgresql source, I use it to display the information collected from pg_stat_activity and psutil ( e.g https://pgphil.ovh/traqueur_dashboard_02.php page is written in french but panels are in english)

 

It is wonderful to visualize.

Especially for beginners like me.

 

 

>Other DB have accumulated statistics but you can notice that sampling is also their most modern method.

>E.g Oracle DB : 20 years ago you already had tools like "utlbstat/utlestat" . Then you had "statspack". Those tools were based on accumulated statistics and the reports were based on differences between 2 points. It was useful to solve major problems but it was limited and not precise enough in many cases.

 

>The preferred feature to identify bottlenecks in the Oracle world is now ASH (active session history). It can help with major problems, specific problems AND it can identify short blockages.

>Too bad it is licensed as an option of their Enterprise Edition but similar tools exist and they are also based on sampling of the activity.

 

>With the "official" ASH, sampling and archiving are done internally and you have a circular memory zone dedicated to the feature. Hence the overhead is lower but that's all.

 

>The most advanced interactive tool is called "snapper" and it is also based on sampling.

 

Thanks. I will check it.

 

The current bottleneck survey method, from sampling, I can know the number (ratio) of waiting events.

Then, investigate from those with a high number of times (ratio).

Do you agree with this recognition?

 

---------------------------------------

Naoki, Yotsunaga.

RE: [Proposal] Add accumulated statistics

From
"Yotsunaga, Naoki"
Date:

On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote:

 

>2) it consumes system resources

While the system is running, you are always sampling system information, do not you? Like Oracle ASH.

If so, does sampling have no significant impact on performance? Even if the interval is 0.01 s or more.

 

>The main interest of sampling is to discard negligible activity to allow the DBA to work on meaningful queries and events.

In other words, do you mean narrowing down candidate of problems?

 

---------------------------

Naoki Yotsunaga

 

RE: [Proposal] Add accumulated statistics

From
"Yotsunaga, Naoki"
Date:

On Mon, Nov 5, 2018 at 4:26 PM, Naoki Yotsunaga wrote:

 

>>2) it consumes system resources

>While the system is running, you are always sampling system information, do not you? Like Oracle ASH.

 

I don’t understand well how sampling is used.

In which scene do you use the sampling? Or is it both scenes? Or is it a different scene?

A)  Perform sampling in order to obtain information on the entire DB while DB is in operation.

B)  I know roughly the process to be late. When investigating the reason for delaying the processing, sampling is performed on the processing.

 

---------------------------

Naoki Yotsunaga

 

RE: [Proposal] Add accumulated statistics for wait event

From
"Yotsunaga, Naoki"
Date:
On Sun, Oct 28, 2018 at 6:39 PM, legrand legrand wrote:

Hi, Thanks for comments.
I had overlooked the reply from you.

>You are right, sampling has to be "tuned" regarding the event(s) you want to catch.
 I see. For tuning, you need to know the length of processing you want to sample?
 
> May I invite you to try it, using PASH-viewer (github) with pgsentinel (github).
 I'll check and try it.

------------------
Naoki Yotsunaga




Re: [Proposal] Add accumulated statistics

From
Bruce Momjian
Date:
On Tue, Nov  6, 2018 at 04:26:03AM +0000, Yotsunaga, Naoki wrote:
> On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote:
> 
>  
> 
> >2) it consumes system resources
> 
> While the system is running, you are always sampling system information, do not
> you? Like Oracle ASH.
> 
> If so, does sampling have no significant impact on performance? Even if the
> interval is 0.01 s or more.

I am replying late, but one of the reasons that sampling is used is that
decreasing the sampling interval increases to overhead of the sampling
process, but doesn't affect the running backends.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: [Proposal] Add accumulated statistics

From
"Yotsunaga, Naoki"
Date:
On Wed, Nov 21, 2018 at 9:27 PM, Bruce Momjian wrote:

Hi, thank you for the information.
I understood that sampling is effective for investigation of waiting events. 

By the way, you can see the number of wait events with "LWLOCK_STATS", right?
Is this function implemented because it is necessary to know the number of waiting events for investigation?
If so, is not that the number of wait events is useful information?
Now, I need to rebuild to get this information and I feel inconvenience.

So, how about checking the number of wait events in the view?
Also, I think that it will be useful if you know the waiting time.
I think that it is easy to investigate when it is clearly known how long waiting time is occupied with processing
time.

-- 
Naoki Yotsunaga



RE: [Proposal] Add accumulated statistics

From
"Tsunakawa, Takayuki"
Date:
From: Yotsunaga, Naoki [mailto:yotsunaga.naoki@jp.fujitsu.com]
> By the way, you can see the number of wait events with "LWLOCK_STATS", right?
> Is this function implemented because it is necessary to know the number
> of waiting events for investigation?
> If so, is not that the number of wait events is useful information?
> Now, I need to rebuild to get this information and I feel inconvenience.
> 
> So, how about checking the number of wait events in the view?
> Also, I think that it will be useful if you know the waiting time.
> I think that it is easy to investigate when it is clearly known how long
> waiting time is occupied with processing time.

That's interesting.  It should be convenient for PG developers to be able to see wait events of each session and the
wholeinstance to improve PostgreSQL.  I developed a flight recorder feature for a proprietary database (not based on
PostgreSQL),which recorded various events, including waits, in the ring buffer for each session.  It could dump the
ringbuffers into external files in CSV format, so that we can load them into a database or a spreadsheet to analyze the
timemodel of SQL execution.  That helped us a lot to eliminate many bottlenecks.  MySQL says a similar thing about its
PerformanceSchema:
 

https://dev.mysql.com/doc/refman/8.0/en/performance-schema-examples.html

"Tune the code (this applies to storage engine or server developers only)."



Regards
Takayuki Tsunakawa




RE: [Proposal] Add accumulated statistics

From
"Tsunakawa, Takayuki"
Date:
Hi all,

I think sampling like Oracle ASH should work for the DBA to find probable bottlenecks in many cases (, so I hope
PostgreSQLwill incorporate it...)  On the other hand, it seems to have the following disadvantages, some of which
othershave already pointed out:
 

1. Doesn't provide precise data
Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints.  This might make it
difficultor impossible to solve transient performance problems, such as infrequent 100 millisecond response times while
thenormal response time is a few milliseconds.
 
The proposed wait event collection doesn't miss anything.

2. Overuses resources
We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems.  However,
thesampled data of active sessions become voluminous in memory and storage.  It would take longer to analyze those
samples. Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter
andwalwriter tries to avoid by hibernation.
 
The proposed wait event collection just records what actually happened.  No waste.  Would it use many resources if
waitshappen frequently?  That leads to our motivation to reduce waits.
 

3. Cannot determine the impact or illness of waits just by sampling or counting without time
As the following MySQL and Oracle manual articles describe, precise measurement of wait count and time helps to judge
theimpact and justify the remedy.  They can measure the whole SQL execution and its various processing steps (parse,
plan,sort, etc.) as well as waits, so that the most significant areas can be determined.
 
Also, sampling cannot tell if a single wait took long or the same waits occurred repeatedly in succession (at least
easily.) Do the sampled waits indicate an abnormal I/O (which took 2 ms while the normal time is 50 us)?
 


[MySQL]

Chapter 26 MySQL Performance Schema
https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html
--------------------------------------------------
The Performance Schema monitors server events. An “event” is anything the server does that takes time and has been
instrumentedso that timing information can be collected. In general, an event could be a function call, a wait for the
operatingsystem, a stage of an SQL statement execution such as parsing or sorting, or an entire statement or group of
statements.Event collection provides access to information about synchronization calls (such as for mutexes) file and
tableI/O, table locks, and so forth for the server and for several storage engines.
 

Current events are available, as well as event histories and summaries. This enables you to determine how many times
instrumentedactivities were performed and how much time they took. Event information is available to show the
activitiesof specific threads, or activity associated with particular objects such as a mutex or file.
 
--------------------------------------------------


[Oracle]


https://docs.oracle.com/en/database/oracle/oracle-database/18/tdppt/automatic-database-performance-monitoring.html#GUID-32E92AEC-AF1A-4602-B998-3250920CD3BE
--------------------------------------------------
The goal of database performance tuning is to reduce the DB time of the system for a given workload. By reducing DB
time,the database can support more user requests by using the same or fewer resources. ADDM reports system resources
thatare using a significant portion of DB time as problem areas and sorts them in descending order by the amount of
relatedDB time spent.
 
--------------------------------------------------


Instance Tuning Using Performance Views

https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/instance-tuning-using-performance-views.html#GUID-07982549-507F-4465-8843-7F753BCF8F99
--------------------------------------------------
Wait event statistics include the number of times an event was waited for and the time waited for the event to
complete.If the initialization parameter TIMED_STATISTICS is set to true, then you can also see how long each resource
waswaited for.
 
To minimize user response time, reduce the time spent by server processes waiting for event completion. Not all wait
eventshave the same wait time. Therefore, it is more important to examine events with the most total time waited rather
thanwait events with a high number of occurrences. Usually, it is best to set the dynamic parameter TIMED_STATISTICS to
trueat least while monitoring performance.
 
--------------------------------------------------



https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F
5.2.3 Using Wait Events without Timed Statistics
--------------------------------------------------
If TIMED_STATISTICS is set to FALSE, then the amount of time spent waiting for an event is not available. Therefore, it
isonly possible to order wait events by the number of times each event was waited for. Although the events with the
largestnumber of waits might indicate a potential bottleneck, they might not be the main bottleneck. This situation can
happenwhen an event is waited for a large number of times, but the total time waited for that event is small.
Conversely,an event with fewer waits might be a bigger bottleneck if the wait time accounts for a significant
proportionof the total wait time. Without the wait times to use for comparison, it is difficult to determine whether a
waitevent is worth investigating.
 
--------------------------------------------------


10.2.2 Using Wait Event Statistics to Drill Down to Bottlenecks
--------------------------------------------------
The most effective way to use wait event data is to order the events by the wait time. This is only possible if
TIMED_STATISTICSis set to true. Otherwise, the wait events can only be ranked by the number of times waited, which is
oftennot the ordering that best represents the problem.
 

To get an indication of where time is spent, follow these steps:

1. Examine the data collection for V$SYSTEM_EVENT. The events of interest should be ranked by wait time.
Identify the wait events that have the most significant percentage of wait time. ...
Alternatively, look at the Top 5 Timed Events section at the beginning of the Automatic Workload Repository report.
Thissection automatically orders the wait events (omitting idle events), and calculates the relative percentage:
 

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                % Total
Event                                         Waits    Time (s) Call Time
-------------------------------------- ------------ ----------- ---------
CPU time                                                    559     88.80
log file parallel write                       2,181          28      4.42
SQL*Net more data from client               516,611          27      4.24
db file parallel write                       13,383          13      2.04
db file sequential read                         563           2       .27

2. Look at the number of waits for these events, and the average wait time. For example, for I/O related events, the
averagetime might help identify whether the I/O system is slow. The following example of this data is taken from the
WaitEvent section of the AWR report:
 

                                                             Avg
                                                Total Wait   wait     Waits
Event                           Waits  Timeouts   Time (s)   (ms)      /txn
--------------------------- --------- --------- ---------- ------ ---------
log file parallel write         2,181         0         28     13      41.2
SQL*Net more data from clie   516,611         0         27      0   9,747.4
db file parallel write         13,383         0         13      1     252.5

3. The top wait events identify the next places to investigate. A table of common wait events is listed in Table 10-1.
Itis usually a good idea to also have quick look at high-load SQL. 

4. Examine the related data indicated by the wait events to see what other information this data provides. Determine
whetherthis information is consistent with the wait event data. In most situations, there is enough data to begin
developinga theory about the potential causes of the performance bottleneck.
 

5. To determine whether this theory is valid, cross-check data you have examined with other statistics available for
consistency.The appropriate statistics vary depending on the problem, but usually include load profile-related data in
V$SYSSTAT,operating system statistics, and so on. Perform cross-checks with other data to confirm or refute the
developingtheory.
 
--------------------------------------------------



So, why don't we have the proposed wait event count/time data?  I hope we can nurture this to become a database
profilingtool like MySQL and Oracle.  This is the first step.  I think it would be useful to have both sampling and
precisestatistics.  Oracle has both, and MySQL has the latter (I don't know why MySQL doesn't provide sampling, because
thePerformance Schema should probably have been developed after Oracle's ASH.)
 

What would make us conservative about doing this?  Skimming the old thread, the remaining concern is the timer
overhead. As the following article suggests, some lightweight timers seem to be available.  We can turn the timing off
bydefault if they aren't light enough.
 


Performance Schema Timers
https://dev.mysql.com/doc/refman/8.0/en/performance-schema-timing.html


Regards
Takayuki Tsunakawa




Re: [Proposal] Add accumulated statistics

From
Adrien NAYRAT
Date:
On 1/7/19 6:34 AM, Tsunakawa, Takayuki wrote:
> 1. Doesn't provide precise data
> Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints.  This might make it
difficultor impossible to solve transient performance problems, such as infrequent 100 millisecond response times while
thenormal response time is a few milliseconds.
 
> The proposed wait event collection doesn't miss anything.
> 
> 2. Overuses resources
> We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems.  However,
thesampled data of active sessions become voluminous in memory and storage.  It would take longer to analyze those
samples. Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter
andwalwriter tries to avoid by hibernation.
 
> The proposed wait event collection just records what actually happened.  No waste.  Would it use many resources if
waitshappen frequently?  That leads to our motivation to reduce waits.
 

FIY, wait events have been added in PoWA by using pg_wait_sampling  
extension :  
https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-powa.html

pg_wait_sampling sample the wait events in shared memory and PoWA store  
them.


RE: [Proposal] Add accumulated statistics

From
"Tsunakawa, Takayuki"
Date:
From: Adrien NAYRAT [mailto:adrien.nayrat@anayrat.info]
> FIY, wait events have been added in PoWA by using pg_wait_sampling
> extension :
> https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-
> powa.html
> 
> pg_wait_sampling sample the wait events in shared memory and PoWA store
> them.

Great.  Also FYI, Amazon RDS/Aurora Performance Insights already provides a load profiling feature based on the wait
eventscollected from pg_stat_activity, which samples once a second.  An intuitive 5 minute video for introduction is
here:

Using Amazon RDS Performance Insights
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

Although I cannot see the graphics as I'm almost blind, the explanation  sounds like it mimics Oracle Enterprise
Manager. Cool.
 

Aren't you thinking of incorporating your work into PostgreSQL as a contrib like pg_stat_statements?


Regards
Takayuki Tsunakawa





Re: [Proposal] Add accumulated statistics

From
Robert Haas
Date:
On Thu, Dec 20, 2018 at 8:48 PM Yotsunaga, Naoki
<yotsunaga.naoki@jp.fujitsu.com> wrote:
> If so, is not that the number of wait events is useful information?

My theory is that the number of wait events is NOT useful information,
or at least not nearly as useful the results of a sampling approach.
The data that LWLOCK_STATS produce are downright misleading -- they
lead you to think that the bottlenecks are in different places than
they really are, because the locks that produce the most waiting can
be 5th or 10th in terms of the number of wait events.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: [Proposal] Add accumulated statistics

From
"Yotsunaga, Naoki"
Date:
On Thu, Jan 10, 2019 at 8:42 PM, Robert Hass wrote:

Thanks for comments.

>or at least not nearly as useful the results of a sampling approach.
I agree with your opinion.
Because it can't be asserted that the wait event is a bottleneck just because the number of wait event is large.
The same thing is mentioned in Oracle.
It also suggests that it is important to acquire waiting time for that.
----

https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F
5.2.3 Using Wait Events without Timed Statistics
----

>The data that LWLOCK_STATS produce are downright misleading 
Is that so?
I do not know the need for this function.

---
Naoki Yotsunaga

RE: [Proposal] Add accumulated statistics

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> My theory is that the number of wait events is NOT useful information,
> or at least not nearly as useful the results of a sampling approach.
> The data that LWLOCK_STATS produce are downright misleading -- they
> lead you to think that the bottlenecks are in different places than
> they really are, because the locks that produce the most waiting can
> be 5th or 10th in terms of the number of wait events.

I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait
withfewer counts but longer time can take a large portion of the entire SQL execution time.  So, wait time is also
useful. I think that's why Oracle describes and MySQL provides precise count and time without sampling.
 

Haven't LOCK_STATS been helpful for PG developers?  IIRC, it was used to pinpoint the bottleneck and evaluate the patch
toimprove shared buffers, WAL buffers, ProcArray, etc.
 


Regards
Takayuki Tsunakawa





Re: [Proposal] Add accumulated statistics

From
Pavel Stehule
Date:


pá 11. 1. 2019 v 2:10 odesílatel Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> napsal:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> My theory is that the number of wait events is NOT useful information,
> or at least not nearly as useful the results of a sampling approach.
> The data that LWLOCK_STATS produce are downright misleading -- they
> lead you to think that the bottlenecks are in different places than
> they really are, because the locks that produce the most waiting can
> be 5th or 10th in terms of the number of wait events.

I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait with fewer counts but longer time can take a large portion of the entire SQL execution time.  So, wait time is also useful.  I think that's why Oracle describes and MySQL provides precise count and time without sampling.

the cumulated lock statistics maybe doesn't help with debugging - but it is very good indicator of database (in production usage) health.

Regards

Pavel



RE: [Proposal] Add accumulated statistics

From
"Tsunakawa, Takayuki"
Date:
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
> the cumulated lock statistics maybe doesn't help with debugging - but it
> is very good indicator of database (in production usage) health.

I think it will help both.  But I don't think the sampling won't be as helpful as the precise lock statistics
accumulation,because the sampling doesn't give us exactly how effective our improvements to PostgreSQL code are.  I
rememberPG developers used LOCK_STATS to see how many (or ratio of) lwlock waits decreased by applying patches.
 


We can use the cumulated lock stats like:

1. SELECT * FROM pg_session_waits;
2. Run a benchmark.
3. SELECT * FROM pg_session_waits;
4. Calculate the difference between 1 and 3.

Or, reset the wait stats before the benchmark run and just use the stats as-is.

I'd like to know why you thought the cumulated wait stats isn't helpful for debugging.


Regards
Takayuki Tsunakawa



RE: [Proposal] Add accumulated statistics

From
"imai.yoshikazu@fujitsu.com"
Date:
Hi,

On Tue, Jan 15, 2019 at 1:14 AM, Tsunakawa, Takayuki wrote:

[ ... absent for a long time ]

I read the discussions of this thread.

If we want wait event info, we can currently do sampling from pg_stat_activity and get pseudo wait event total
duration.
(I understand wait event sampling does *not* represent wait event count but wait event duration.)

It is said that, in most cases, sampling is enough to find the cause of issue. On the other hand, it is said that
samplingmissed wait events that don't spent much time in the process. Can we say we don't need such wait events that
seemsapparently not important?
 

Also we can know the pseudo total duration from the sampling though, we can't know the how many times wait events
occurredand how long wait events take time in the average each time. I think it would be better to also know wait event
averagetime to investigate the cause of the issue.
 

For example, counts, total duration and average times of each wait events after executing pgbench, which is gotten from
theserver applied my patch, is like below.
 


transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100 
query mode: simple
number of clients: 8
number of threads: 8
duration: 60 s
number of transactions actually processed: 132259
latency average = 3.630 ms
tps = 2203.909727 (including connections establishing)
tps = 2204.248691 (excluding connections establishing)

 wait_event_type |      wait_event      | calls  |   times   | times/calls
-----------------+----------------------+--------+-----------+------------
 Client          | ClientRead           | 925732 | 115270950 |  124.52
 IO              | DataFileRead         | 220357 |   3219893 |   14.61
 LWLock          | WALWriteLock         | 195560 | 192684987 |  985.30
 IO              | DataFileWrite        | 156208 |   4440091 |   28.42
 IO              | WALWrite             |  51035 |   2925808 |   57.33
 IO              | WALSync              |  50812 |  49626240 |  976.66
 Lock            | transactionid        |   6179 |   9905727 | 1603.13
 LWLock          | buffer_content       |   5453 |    295123 |   54.12
 IO              | DataFileExtend       |   2367 |     90430 |   38.20
 LWLock          | wal_insert           |   2053 |     33620 |   16.38
 LWLock          | WALBufMappingLock    |   1301 |     23761 |   18.26
 LWLock          | CLogControlLock      |    514 |      3959 |    7.70
 LWLock          | buffer_mapping       |    162 |      1313 |    8.10
 LWLock          | XidGenLock           |    153 |      2399 |   15.68
 LWLock          | lock_manager         |    122 |      5051 |   41.40
 Lock            | tuple                |    105 |    114632 | 1091.73
 LWLock          | ProcArrayLock        |     99 |      1828 |   18.46
 LWLock          | CheckpointerCommLock |     33 |       854 |   25.88
 Lock            | extend               |     24 |     20493 |  853.88
 IO              | RelationMapRead      |     22 |        71 |    3.23
 LWLock          | buffer_io            |     11 |      7398 |  672.55
 IO              | SLRUWrite            |      4 |       249 |   62.25
 IO              | SLRUSync             |      4 |         0 |    0.00
 IPC             | ClogGroupUpdate      |      3 |       223 |   74.33
 LWLock          | SyncRepLock          |      1 |         1 |    1.00


We can see that WALWriteLock has the longest total duration(times), and ClientRead is the next one. We might get this
informationfrom the sampling. We can also see that WALWriteLock longest total duration is because of its high average
time,while ClientRead total duration is because of its many counts. We can also know that transactionid, tuple,
WALWriteLock,WALSync, extend have the longer average time than the others. If we want to improve the performance even
forjust a bit, such info is useful I think.
 

The patch which is proposed in this thread was not accepted because it loads overhead on the server/backend.
The function that measures the wait event time, gettimeofday and clock_gettime, takes time to execute and it is on the
frequentlycalled path(pgstat_report_wait_start/end), it is supposed a lot of overhead will be introduced.
 

Even if we only count the wait event, there would be overhead, which is pointed out in the another thread.[1]

The overhead which is induced by getting wait event info was discussed from old times, but I couldn't find the actual
measuringresults, so I want to measure its overhead.
 

[1] https://www.postgresql.org/message-id/CA%2BTgmobf1NJD%2B_DfQG5qccG5YFSnxk3CgC2mh0-UHabznCQtYA%40mail.gmail.com

--
Yoshikazu Imai 


RE: [Proposal] Add accumulated statistics

From
"imai.yoshikazu@fujitsu.com"
Date:
On Wed, Oct 30, 2019 at 5:51 AM, imai.yoshikazu@fujitsu.com wrote:
> The overhead which is induced by getting wait event info was discussed from old times, but I couldn't find the
actual
> measuring results, so I want to measure its overhead.

And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and has
severalthings to be improved.
 

You can get wait event info by executing "select * from pg_stat_waitaccum;" and reset its counters by "select
pg_stat_reset_shared('waitaccum');".

I tried to reduce the overhead of counting the wait event. It is difficult to reduce the overhead of measuring wait
eventstimes, I made measuring time is configurable like track_io_timing.
 

In the other DB, they use more light function than gettimeofday or clock_gettime and it is the CPU cycle counter,
rdtsc.
So I also created the patch which uses rdtsc for measuring wait events times.
There are some investigations for rdtsc before[1]. If we want to use rdtsc, we need more investigation for which
platformcan use it or how to prevent time go backwards in the rdtsc and so on.
 
Here, I wanted to see its overhead, so I didn't care such things in this patch.

I measured its performance with pgbench on a VM machine which has 4 core CPU and 8 GB Mem.
I tested against below four cases.

    master:      master (2fc2a88)

    only counts: 0001 patched, only counts wait events
                 (with track_wait_timing is off)
    counts/time: 0001 patched, counts wait event and measure its time by gettimeofday or clock_gettime
                 (with track_wait_timing is on)
    counts/time(rdtsc): 0001 + 0002 patched, counts wait event and measure its time by rdtsc
                 (with track_wait_timing is on)

I executed the below pgbench scripts, initializing database with scale 1 or 100 and executing pgbench with standard
modeor select-only mode.
 

    [standard mode]
    for i in `seq 1 15`
    do  
    pgbench -i -s (1 or 100) -q
    pgbench -c 8 -j 8 -n -T 60
    done

    [select only mode]
    pgbench -i -s (1 or 100) -q
    for i in `seq 1 10`
    do  
    pgbench -c 8 -j 8 -n -S -T 60
    done

The result was below.

[standard, scale 1]
      version      |   TPS  | diff(%)
            master | 813.82 |       0
       only counts | 797.48 |    2.01
       counts/time | 833.16 |   -2.38
counts/time(rdtsc) | 876.29 |   -7.68

[standard, scale 100]
      version      |   TPS   | diff(%)
            master | 2170.34 |       0
       only counts | 2125.37 |    2.07
       counts/time | 2147.8  |    1.04
counts/time(rdtsc) | 2187.37 |  -0.785

[select-only, scale 1]
      version      |   TPS   | diff(%)
            master | 28487.6 |       0
       only counts | 28481.1 |   0.023
       counts/time | 28364.7 |   0.431
counts/time(rdtsc) | 28462.6 |   0.088

[select-only, scale 100]
      version      |   TPS    | diff(%)
            master | 25767.89 |       0
       only counts | 26068.65 |  -1.167
       counts/time | 25567.69 |   0.777
counts/time(rdtsc) | 25525.26 |   0.942


An example of wait event info after executing pgbench was below.

[standard, scale 100]
number of transactions actually processed: 129844
latency average = 3.697 ms
tps = 2163.667427 (including connections establishing)
tps = 2163.918398 (excluding connections establishing)

 wait_event_type |      wait_event      | calls  |   times   
-----------------+----------------------+--------+-----------
 Client          | ClientRead           | 908807 | 114473878
 IO              | DataFileRead         | 216025 |   2867211
 LWLock          | WALWriteLock         | 191977 | 195192237
 IO              | DataFileWrite        | 154540 |   3406232
 IO              | WALWrite             |  49932 |   2728543
 IO              | WALSync              |  49737 |  49649308
 Lock            | transactionid        |   6209 |   8999545
 LWLock          | buffer_content       |   5337 |    288951
 IO              | DataFileExtend       |   2346 |     90375
 LWLock          | wal_insert           |   2013 |     25141
 LWLock          | WALBufMappingLock    |    630 |     14680
 LWLock          | CLogControlLock      |    454 |      2414
 LWLock          | buffer_mapping       |    170 |       852
 LWLock          | XidGenLock           |    146 |      3268
 LWLock          | lock_manager         |    141 |      5209
 Lock            | tuple                |    112 |    120163
 LWLock          | ProcArrayLock        |     97 |       495
 Lock            | extend               |     42 |     26875
 IO              | RelationMapRead      |     22 |        57
 LWLock          | CheckpointerCommLock |     18 |      1217
 IO              | DataFilePrefetch     |     18 |        24
 IPC             | ClogGroupUpdate      |      9 |      1867
 LWLock          | SyncRepLock          |      3 |       116
 IO              | DataFileTruncate     |      1 |       116


[select-only, scale 1]
number of transactions actually processed: 1682642
latency average = 0.285 ms
tps = 28043.407989 (including connections establishing)
tps = 28048.158085 (excluding connections establishing)
 wait_event_type |   wait_event    |  calls  |   times   
-----------------+-----------------+---------+-----------
 Client          | ClientRead      | 1682661 | 287999638
 IO              | RelationMapRead |      22 |        54
 LWLock          | lock_manager    |       2 |      2087


[select-only, scale 100]
number of transactions actually processed: 1513536
latency average = 0.317 ms
tps = 25223.558569 (including connections establishing)
tps = 25228.820644 (excluding connections establishing)
 wait_event_type |   wait_event    |  calls  |   times   
-----------------+-----------------+---------+-----------
 IO              | DataFileRead    | 2524682 |  14579531
 Client          | ClientRead      | 1513558 | 283968554
 LWLock          | buffer_mapping  |    1225 |      6392
 IO              | RelationMapRead |      22 |        46
 LWLock          | buffer_io       |      11 |       876
 LWLock          | lock_manager    |       6 |       507


I wanted to measure and compare the performance where wait_event occurs many times, but unfortunately, TPS was not
stableand differed in each standard test that wait_event occurs many times that I couldn't get consistent performance
tendencybetween each version.
 

I need to find more suitable test for clarifying its performance or if there are any good tests, please let me know.
Also,any tests are very welcome.
 


--
Yoshikazu Imai


Attachment

Re: [Proposal] Add accumulated statistics

From
Pavel Stehule
Date:


út 15. 1. 2019 v 2:14 odesílatel Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> napsal:
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
> the cumulated lock statistics maybe doesn't help with debugging - but it
> is very good indicator of database (in production usage) health.

I think it will help both.  But I don't think the sampling won't be as helpful as the precise lock statistics accumulation, because the sampling doesn't give us exactly how effective our improvements to PostgreSQL code are.  I remember PG developers used LOCK_STATS to see how many (or ratio of) lwlock waits decreased by applying patches.


We can use the cumulated lock stats like:

1. SELECT * FROM pg_session_waits;
2. Run a benchmark.
3. SELECT * FROM pg_session_waits;
4. Calculate the difference between 1 and 3.

Or, reset the wait stats before the benchmark run and just use the stats as-is.

I'd like to know why you thought the cumulated wait stats isn't helpful for debugging.

I don't remember my thoughts, and maybe I used wrong sentences - usually lock times are very small, and very unstable if you has too detailed level. But if you use aggregated values per some longer time window, then these values can be stable and very interesting. More - usually lock time has correlation with database (application) health.

Like you I don't think so sampled values are too helpful.

Regards

Pavel



Regards
Takayuki Tsunakawa


Re: [Proposal] Add accumulated statistics

From
Michael Paquier
Date:
On Wed, Oct 30, 2019 at 05:55:28AM +0000, imai.yoshikazu@fujitsu.com wrote:
> And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and has
severalthings to be improved. 

Please note the patch tester complains about the latest patch:
pgstatfuncs.c: In function ‘pg_stat_get_waitaccum’:
pgstatfuncs.c:2018:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
Datum  values[PG_STAT_GET_WAITACCUM_COLS];

I am moving it to next CF, marking it as waiting on author.
--
Michael

Attachment

RE: [Proposal] Add accumulated statistics

From
"imai.yoshikazu@fujitsu.com"
Date:
On Sun, Dec 1, 2019 at 1:10 AM, Michael Paquier wrote:
> On Wed, Oct 30, 2019 at 05:55:28AM +0000, imai(dot)yoshikazu(at)fujitsu(dot)com wrote:
> > And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and
hasseveral things to be improved.
 
> 
> Please note the patch tester complains about the latest patch:
> pgstatfuncs.c: In function ‘pg_stat_get_waitaccum’:
> pgstatfuncs.c:2018:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
> Datum  values[PG_STAT_GET_WAITACCUM_COLS]; 
> 
> I am moving it to next CF, marking it as waiting on author.

Sorry for late reply.

Unfortunately, I couldn't get your original mail even if I used "Resend email"
on pgsql-hackers archive, that I reply to my old mail.

I attach the patch I solved the warning.

--
Yoshikazu Imai

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Pavel Stehule
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            tested, passed

I like this patch, because I used similar functionality some years ago very successfully. The implementation is almost
simple,and the result should be valid by used method.
 

The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat
thesetests.
 

There are some ending whitespaces and useless tabs.

The new status of this patch is: Waiting on Author

Re: [Proposal] Add accumulated statistics for wait event

From
Imai Yoshikazu
Date:
On 2020/01/13 4:11, Pavel Stehule wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            tested, passed
> 
> I like this patch, because I used similar functionality some years ago very successfully. The implementation is
almostsimple, and the result should be valid by used method.
 

Thanks for your review!

> The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat
thesetests.
 

Yes, performance impact is the main concern. I want to know how it 
affects performance in various test cases or on various environments.

> There are some ending whitespaces and useless tabs.
> 
> The new status of this patch is: Waiting on Author
I attach v4 patches removing those extra whitespaces of the end of lines 
and useless tabs.

--
Yoshikazu Imai

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Pavel Stehule
Date:
Hi

st 15. 1. 2020 v 14:15 odesílatel Imai Yoshikazu <yoshikazu_i443@live.jp> napsal:
On 2020/01/13 4:11, Pavel Stehule wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            tested, passed
>
> I like this patch, because I used similar functionality some years ago very successfully. The implementation is almost simple, and the result should be valid by used method.

Thanks for your review!

> The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat these tests.

Yes, performance impact is the main concern. I want to know how it
affects performance in various test cases or on various environments.

> There are some ending whitespaces and useless tabs.
>
> The new status of this patch is: Waiting on Author
I attach v4 patches removing those extra whitespaces of the end of lines
and useless tabs.

today I run 120 5minutes pgbench tests to measure impact of this patch. Result is attached.

test

PSQL="/usr/local/pgsql/bin/psql"
PGBENCH="/usr/local/pgsql/bin/pgbench"
export PGDATABASE=postgres

echo "******* START *******" > ~/result.txt

for i in 1 5 10 50 100
do
  echo "scale factor $i" >> ~/result.txt

  $PSQL -c "create database bench$i"
  $PGBENCH -i -s $i "bench$i"

  for c in 1 5 10 50
  do
    $PGBENCH -c $c -T 300 "bench$i" >> ~/result.txt
  done

  $PSQL -c "vacuum full" "bench$i"
  $PSQL -c "vacuum analyze" "bench$i"

  for c in 1 5 10 50
  do
    $PGBENCH -S -c $c -T 300 "bench$i" >> ~/result.txt
  done

  $PSQL -c "drop database bench$i"
done

Tested on computer with 4CPU, 8GB RAM - configuration: shared buffers 2GB, work mem 20MB

The result is interesting - when I run pgbench in R/W mode, then I got +/- 1% changes in performance. Isn't important if tracking time is active or not (tested on Linux). In this mode the new code is not on critical path.

More interesting results are from read only tests (there are visible some higher differences)

for scale 5/ and 50 users - the tracking time increase performance about 12% (same result I got for scale/users 10/50), in other direction patched but without tracking time decreases performance about 10% for for 50/50 (with without tracking time) and 100/5

Looks so for higher scale than 5 and higher number of users 50 the results are not too much stable (for read only tests - I repeated tests) and there overhead is about 10% from 60K tps to 55Ktps - maybe I hit a hw limits (it running with 4CPU)

Thanks to Tomas Vondra and 2ndq for hw for testing

Regards

Pavel

 wait_event_type |      wait_event       |   calls    |    times    
-----------------+-----------------------+------------+--------------
 Client          | ClientRead            | 1489681408 | 221616362961
 Lock            | transactionid         |  103113369 |  71918794185
 LWLock          | WALWriteLock          |  104781468 |  20865855903
 Lock            | tuple                 |   21323744 |  15800875242
 IO              | WALSync               |   50862170 |   8666988491
 LWLock          | lock_manager          |   18415423 |    575308266
 IO              | WALWrite              |   51482764 |    205775892
 LWLock          | buffer_content        |   15385387 |    168446128
 LWLock          | wal_insert            |    1502092 |     90019731
 IPC             | ProcArrayGroupUpdate  |     178238 |     46527364
 LWLock          | ProcArrayLock         |     587356 |     13298246
 IO              | DataFileExtend        |    2715557 |     11615216
 IPC             | ClogGroupUpdate       |      54319 |     10622013
 IO              | DataFileRead          |    5805298 |      9596545
 IO              | SLRURead              |    9518930 |      7166217
 LWLock          | CLogControlLock       |     746759 |      6783602






--
Yoshikazu Imai
Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Tomas Vondra
Date:
This patch was in WoA, but that was wrong I think - we got a patch on
January 15, followed by a benchmark by Pavel Stehule, so I think it
should still be in "needs review". So I've updated it and moved it to
the next CF.


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Add accumulated statistics for wait event

From
Pavel Stehule
Date:


so 1. 2. 2020 v 12:34 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
This patch was in WoA, but that was wrong I think - we got a patch on
January 15, followed by a benchmark by Pavel Stehule, so I think it
should still be in "needs review". So I've updated it and moved it to
the next CF.

currently this patch needs a rebase

Pavel



regards


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: [Proposal] Add accumulated statistics for wait event

From
"imai.yoshikazu@fujitsu.com"
Date:
On Sat, Feb 1, 2020 at 5:50 AM, Pavel Stehule wrote:
> today I run 120 5minutes pgbench tests to measure impact of this patch. Result is attached.
...
> Thanks to Tomas Vondra and 2ndq for hw for testing

Thank you for doing a lot of these benchmarks!

> The result is interesting - when I run pgbench in R/W mode, then I got +/- 1% changes in performance. Isn't important
if
> tracking time is active or not (tested on Linux). In this mode the new code is not on critical path.

It seems performance difference is big in case of read only tests. The reason is that write time is relatively longer
thanthe
 
processing time of the logic I added in the patch.

> Looks so for higher scale than 5 and higher number of users 50 the results are not too much stable (for read only
tests- I
 
> repeated tests) and there overhead is about 10% from 60K tps to 55Ktps - maybe I hit a hw limits (it running with
4CPU)

Yes, I suspect some other bottlenecks may be happened and it causes the results unstable. However, it may be better to
investigate what is actually happened and why performance is increased/decreased for over 10%. I will inspect it.


Also I attach v5 patches which corresponds to other committed patches.

--
Yoshikazu Imai

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Craig Ringer
Date:
On Wed, 12 Feb 2020 at 12:36, imai.yoshikazu@fujitsu.com
<imai.yoshikazu@fujitsu.com> wrote:

> It seems performance difference is big in case of read only tests. The reason is that write time is relatively longer
thanthe
 
> processing time of the logic I added in the patch.

That's going to be a pretty difficult performance hit to justify.

Can we buffer collected wait events locally and spit the buffer to the
stats collector at convenient moments? We can use a limited buffer
size with an overflow flag, so we degrade the results rather than
falling over or forcing excessive stats reporting at inappropriate
times.

I suggest that this is also a good opportunity to add some more
tracepoints to PostgreSQL. The wait events facilities are not very
traceable right now. Exposing some better TRACE_POSTGRESQL_
tracepoints (SDTs) via probes.d would help users collect better
information using external tools like perf, bpftrace and systemtap.
That way we have a zero-overhead-when-unused option that can also be
used to aggregate the information per-query, per-user, etc.

(I really need to add a bunch more tracepoints to make this easier...)

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise



RE: [Proposal] Add accumulated statistics for wait event

From
"imai.yoshikazu@fujitsu.com"
Date:
On Wed, Feb 12, 2020 at 5:42 AM, Craig Ringer wrote:
> > It seems performance difference is big in case of read only tests. The reason is that write time is relatively
longerthan the
 
> > processing time of the logic I added in the patch.
> 
> That's going to be a pretty difficult performance hit to justify.
> 
> Can we buffer collected wait events locally and spit the buffer to the
> stats collector at convenient moments? We can use a limited buffer
> size with an overflow flag, so we degrade the results rather than
> falling over or forcing excessive stats reporting at inappropriate
times.

IIUC, currently each backend collects wait events locally. When each
backend goes to idle (telling the frontend that it is ready-for-query), it
reports wait event statistics to the stats collector. The interval of each
report is over than PGSTAT_STAT_INTERVAL(default 500ms). Also when
each backend exits, it also does report.

So if we do the read only test with 50 clients, each 50 backend reports
wait events statistics to the stats collector for almost every 500ms. If
that causes performance degradation, we can improve performance by
letting backends to report its statistics, for example, only at the
backend's exit.

(I think I can easily test this by building postgres with setting
PGSTAT_STAT_INTERVAL to a large value >> 500ms.)

> I suggest that this is also a good opportunity to add some more
> tracepoints to PostgreSQL. The wait events facilities are not very
> traceable right now.

Does that mean we will add TRACE_POSTGRESQL_ to every before/after
pgstat_report_wait_start?

> That way we have a zero-overhead-when-unused option that can also be
> used to aggregate the information per-query, per-user, etc.

I see. In that way, we can accomplish no overhead when DTrace is not
enabled and what we can measure is more customizable.

It is also curious how will overhead be if we implement wait events
statistics on DTrace scripts though I can't imagine how it will be because
I haven't used DTrace.


--
Yoshikazu Imai


RE: [Proposal] Add accumulated statistics for wait event

From
"imai.yoshikazu@fujitsu.com"
Date:
On Fri, Feb 14, 2020 at 11:59 AM, 王胜利 wrote: 
>        I am glad to know you are working on PG accumulated statistics feature, and I am interested on it.
>        I see these two patch file you made, can you let me know which branch of PG code based?
>
>        when I use this:  https://github.com/postgres/postgres/commits/master, and apply these patches, report some
error.

Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3.

Here, I attach correct patches.

Also I will begin to do some benchmark with higher scale and higher number of
users and try to change stats reporting implementation to not affect
performance, which I couldn't have not started because of another tasks.

--
Yoshikazu Imai

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
王胜利
Date:
Thank you for this update! I will try it.

Best regards,
Victor wang

签名由 网易邮箱大师 定制

On 02/25/2020 15:53, imai.yoshikazu@fujitsu.com wrote:
On Fri, Feb 14, 2020 at 11:59 AM, 王胜利 wrote:
>        I am glad to know you are working on PG accumulated statistics feature, and I am interested on it.
>        I see these two patch file you made, can you let me know which branch of PG code based?
>
>        when I use this:  https://github.com/postgres/postgres/commits/master, and apply these patches, report some error.

Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3.

Here, I attach correct patches.

Also I will begin to do some benchmark with higher scale and higher number of
users and try to change stats reporting implementation to not affect
performance, which I couldn't have not started because of another tasks.

--
Yoshikazu Imai

Re: [Proposal] Add accumulated statistics for wait event

From
Kyotaro Horiguchi
Date:
Hello.  I had a brief look on this and have some comments on this.

At Tue, 25 Feb 2020 07:53:26 +0000, "imai.yoshikazu@fujitsu.com" <imai.yoshikazu@fujitsu.com> wrote in 
> Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3.
> 
> Here, I attach correct patches.
> 
> Also I will begin to do some benchmark with higher scale and higher number of
> users and try to change stats reporting implementation to not affect
> performance, which I couldn't have not started because of another tasks.

It uses its own hash implement. Aside from the appropriateness of
having another implement of existing tool, in the first place hash
works well for wide, sparse and uncertain set of keys. But they are in
rather a dense and narrow set with certain and fixed members. It
registers more than 200 entries but bucket size is 461. It would be
needed to avoid colliisions, but seems a bit too wasting.

It seems trying to avoid doing needless work when the feature is not
activated by checking "if (wa_hash==NULL)", but the hash is created
being filled with possible entries regardless of whether
track_wait_timing is on or off. As the result
pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously
frequently. This should be the reason for the recent benchmark result.
I'm not sure such frequency of hash-searching is acceptable even if
the feature is turned on.

I think we need a smarter mapping scheme of events to entries.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: [Proposal] Add accumulated statistics for wait event

From
"imai.yoshikazu@fujitsu.com"
Date:
On Wed, Feb 26, 2020 at 1:39 AM, Kyotaro Horiguchi wrote:
> Hello.  I had a brief look on this and have some comments on this.

Hi, Horiguchi-san. Thank you for looking at this!

> It uses its own hash implement. Aside from the appropriateness of
> having another implement of existing tool, in the first place hash
> works well for wide, sparse and uncertain set of keys. But they are in
> rather a dense and narrow set with certain and fixed members. It
> registers more than 200 entries but bucket size is 461. It would be
> needed to avoid colliisions, but seems a bit too wasting.

Yes, wait events are grouped and wait events ID are defined as a sequential
number, starting from specified number for each group(like 0x01000000U), thus
keys are in a dense and narrow set.

=====
#define PG_WAIT_LWLOCK              0x01000000U
#define PG_WAIT_LOCK                0x03000000U
#define PG_WAIT_BUFFER_PIN          0x04000000U
#define PG_WAIT_ACTIVITY            0x05000000U
...

typedef enum
{
    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
    WAIT_EVENT_AUTOVACUUM_MAIN,
    ...
    WAIT_EVENT_WAL_WRITER_MAIN
} WaitEventActivity;
=====

The number 461 is the lowest number which avoids collisions in the hash for
current wait events set. As you pointed out, there are a bit too much unused
entries.


If we prepare arrays for each wait classes with appropriate size which just
fits the number of each wait event class, we can store wait event statistics
into those arrays with no more wastes.
Also we calculate hash number by "(wait event ID) % (bucket size)" in hash
case, while we calculate index of arrays by "(wait event ID) - (wait event class id)"
in array case. The latter one's cost is lower than the former one.

Currently I implement wait event statistics store by hash because of its
easiness of implementation, but I think it is good to implement it by array
for above reasons. One concern is that we put restrictions on wait events
that it must be defined as it is sequenced in its wait class so that there
are no unused entries in the array.


> It seems trying to avoid doing needless work when the feature is not
> activated by checking "if (wa_hash==NULL)", but the hash is created
> being filled with possible entries regardless of whether
> track_wait_timing is on or off.

This might be bad implementation but there are the cases "wa_hash = NULL"
where pgstat_init() is not called like when executing "initdb". I insert
that check for avoiding unexpected crash in those cases.

I also noticed debug codes exist around that code...I will modify it.

> As the result
> pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously
> frequently. This should be the reason for the recent benchmark result.
> I'm not sure such frequency of hash-searching is acceptable even if
> the feature is turned on.

track_wait_timing parameter determines whether we collect wait time.
Regardless of that parameter, we always collect wait count every wait event
happens. I think calling pgstat_get_wa_entry frequently is not critical to
performance. From pavel's benchmark results, if track_wait_timing parameter
is off, there are +/-1.0% performance difference between patched and unpatched
and it is just considered as a measurement error.


Thanks
--
Yoshikazu Imai



Re: [Proposal] Add accumulated statistics for wait event

From
Atsushi Torikoshi
Date:
Hi Imai-san,

I feel your 'pg_stat_waitaccum' will help us investigate the bottleneck.
So I'd like to do some benchmarks but unfortunately, the latest v6 patch
couldn't be applied to HEAD anymore.

Is it possible to share the latest patches?
If not, I'll make v6 applicable to the HEAD.

Regards,

--
Atsushi Torikoshi

On Fri, Feb 28, 2020 at 5:17 PM imai.yoshikazu@fujitsu.com <imai.yoshikazu@fujitsu.com> wrote:
On Wed, Feb 26, 2020 at 1:39 AM, Kyotaro Horiguchi wrote:
> Hello.  I had a brief look on this and have some comments on this.

Hi, Horiguchi-san. Thank you for looking at this!

> It uses its own hash implement. Aside from the appropriateness of
> having another implement of existing tool, in the first place hash
> works well for wide, sparse and uncertain set of keys. But they are in
> rather a dense and narrow set with certain and fixed members. It
> registers more than 200 entries but bucket size is 461. It would be
> needed to avoid colliisions, but seems a bit too wasting.

Yes, wait events are grouped and wait events ID are defined as a sequential
number, starting from specified number for each group(like 0x01000000U), thus
keys are in a dense and narrow set.

=====
#define PG_WAIT_LWLOCK              0x01000000U
#define PG_WAIT_LOCK                0x03000000U
#define PG_WAIT_BUFFER_PIN          0x04000000U
#define PG_WAIT_ACTIVITY            0x05000000U
...

typedef enum
{
    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
    WAIT_EVENT_AUTOVACUUM_MAIN,
    ...
    WAIT_EVENT_WAL_WRITER_MAIN
} WaitEventActivity;
=====

The number 461 is the lowest number which avoids collisions in the hash for
current wait events set. As you pointed out, there are a bit too much unused
entries.


If we prepare arrays for each wait classes with appropriate size which just
fits the number of each wait event class, we can store wait event statistics
into those arrays with no more wastes.
Also we calculate hash number by "(wait event ID) % (bucket size)" in hash
case, while we calculate index of arrays by "(wait event ID) - (wait event class id)"
in array case. The latter one's cost is lower than the former one.

Currently I implement wait event statistics store by hash because of its
easiness of implementation, but I think it is good to implement it by array
for above reasons. One concern is that we put restrictions on wait events
that it must be defined as it is sequenced in its wait class so that there
are no unused entries in the array.


> It seems trying to avoid doing needless work when the feature is not
> activated by checking "if (wa_hash==NULL)", but the hash is created
> being filled with possible entries regardless of whether
> track_wait_timing is on or off.

This might be bad implementation but there are the cases "wa_hash = NULL"
where pgstat_init() is not called like when executing "initdb". I insert
that check for avoiding unexpected crash in those cases.

I also noticed debug codes exist around that code...I will modify it.

> As the result
> pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously
> frequently. This should be the reason for the recent benchmark result.
> I'm not sure such frequency of hash-searching is acceptable even if
> the feature is turned on.

track_wait_timing parameter determines whether we collect wait time.
Regardless of that parameter, we always collect wait count every wait event
happens. I think calling pgstat_get_wa_entry frequently is not critical to
performance. From pavel's benchmark results, if track_wait_timing parameter
is off, there are +/-1.0% performance difference between patched and unpatched
and it is just considered as a measurement error.


Thanks
--
Yoshikazu Imai


Re: [Proposal] Add accumulated statistics for wait event

From
Daniel Gustafsson
Date:
Hi,

This patch fails to apply to HEAD, please submit a rebased version.  I've
marked this as as Waiting on Author.

cheers ./daniel



RE: [Proposal] Add accumulated statistics for wait event

From
"imai.yoshikazu@fujitsu.com"
Date:
> This patch fails to apply to HEAD, please submit a rebased version.  I've
> marked this as as Waiting on Author.

Sorry for my absence. Unfortunately I couldn't have time to work on this patch in this cf.
I believe I will be back in next cf, work on this patch and also review other patches.

---
Yoshikazu IMAI



Re: [Proposal] Add accumulated statistics for wait event

From
Daniel Gustafsson
Date:
> On 31 Jul 2020, at 07:23, imai.yoshikazu@fujitsu.com wrote:
>
>> This patch fails to apply to HEAD, please submit a rebased version.  I've
>> marked this as as Waiting on Author.
>
> Sorry for my absence. Unfortunately I couldn't have time to work on this patch in this cf.
> I believe I will be back in next cf, work on this patch and also review other patches.

No worries, it happens.  Since the thread has stalled and there is no updated
patch I've marked this entry Returned with Feedback.  Please feel free to
re-open a new CF entry if you return to this patch.

cheers ./daniel


Re: [Proposal] Add accumulated statistics for wait event

From
Jehan-Guillaume de Rorthais
Date:
Hi All,

I faced a few times a situation where a long running query is actually
including the time the backend is waiting for the frontend to fetch all the
rows (see [1] for details). See a sample code fe-time.c and its comments in
attachment to reproduce this behavior.

There's no simple way today to pinpoint the problem in production without
advance interactive auditing, and/or using system tools. After studying the
problem, I believe it boil down to track the wait event ClientWrite, so I ended
up on this thread.

You might catch some mismatching times thanks to auto_explain as well. Using
the fe-time.c demo code with the following command:

  PGDATABASE=postgres PGHOST=::1 time ./fe-time 100

The frontend time is 10s, the query time reported is 3228.631ms, but last row
has been produced after 20.672ms:

  LOG:  duration: 3228.631 ms  plan:
    Query Text: SELECT * FROM pgbench_accounts
    Seq Scan on pgbench_accounts (time=0.005..20.672 rows=100000 loops=1)

(Note that in contrast with localhost, through the unix socket, the backend
reported query time is always really close to 10s).

I re-based the existing patch (see in attachment), to look at the ClientWrite
for this exact query:

  # SELECT wait_event, calls, times
    FROM pg_stat_get_waitaccum(NULL)
    WHERE wait_event = 'ClientWrite';

   wait_event  | calls |  times
  -------------+-------+---------
   ClientWrite |     4 | 3132266

The "time" is expressed as µs in the patch, so 3132.266ms of the total
3228.631ms is spent sending the result to the frontend. I'm not sure where are
the missing 75ms.

The pg_wait_sampling extension might help but it requires a production restart
to install, then enable it. Whatever if the solution is sampling or cumulative,
an in-core and hot-switchable solution would be much more convenient. But
anyway, looking at pg_wait_sampling, we have a clear suspect as well for the
later query run:

  # SELECT event, count
    FROM pg_wait_sampling_profile
    WHERE queryid=4045741516911800313;

      event    | count
  -------------+-------
   ClientWrite |   309

The default profil period of pg_wait_sampling being 10ms, we can roughly
estimate the ClientWrite around 3090ms. Note that this is close enough because
we know 3132266µs has been accumulated among only 4 large wait events.

Finishing bellow.

On Mon, 3 Aug 2020 00:00:40 +0200
Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 31 Jul 2020, at 07:23, imai.yoshikazu@fujitsu.com wrote:
> >
> >> This patch fails to apply to HEAD, please submit a rebased version.  I've
> >> marked this as as Waiting on Author.

Please, find in attachment a rebase of both patches. I did some small editing
on the way. I didn't bench them.

I'm not sure this patch is the best approach though. Receive it as a
motivation to keep up with this discussion. As I wrote, whatever if the
solution is sampling or cumulative, an in-core and hot-switchable solution
would be much more convenient. The fact is that this patch was already
available and ready to keep up with a discussion.

Collecting and summing all wait events from all backends in the same place
forbid to track precisely wait events from a specific backends. Especially on a
busy system where numbers can quickly be buried by all other activities around.

I wonder if wait events should only be accumulated on backend side, making
possible to enable/disable them on the fly and to collect some reports eg. in
logs or to output. Most of the code from these patch could be recycled in a
simpler patch implementing this.

Thoughts?

> > Sorry for my absence. Unfortunately I couldn't have time to work on this
> > patch in this cf. I believe I will be back in next cf, work on this patch
> > and also review other patches.
>
> No worries, it happens.  Since the thread has stalled and there is no updated
> patch I've marked this entry Returned with Feedback.  Please feel free to
> re-open a new CF entry if you return to this patch.

I volunteer to be a reviewer on this patch.

Imai-san, do you agree to add it as new CF entry?

Regards,

[1]:  Last time I had such situation was few weeks ago. A customer was
reporting a query being randomly slow, running bellow 100ms most of the time
and sometime hitting 28s. Long story short, the number of row was the same
(10-15k), but the result set size was 9x bigger (1MB vs 9MB). As the same query
was running fine from psql, I suspected the frontend was somehow saturated.
Tcpdump helped me to compute that the throughput fall to 256kB/s after the
first 2MB of data transfert with a very narrow TCP window. I explained to the
customer their app probably doesn't pull the rows fast enough and that
some buffers were probably saturated on the frontend side, waiting for
the app and slowing down the whole transfert.
Devels fixed the problem by moving away two fields transformations (unaccent)
from their loop fetching the rows.


Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Andres Freund
Date:
HHi,

On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote:
> From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> Date: Tue, 1 Jun 2021 13:25:57 +0200
> Subject: [PATCH 1/2] Add pg_stat_waitaccum view.
> 
> pg_stat_waitaccum shows counts and duration of each wait events.
> Each backend/backgrounds counts and measures the time of wait event
> in every pgstat_report_wait_start and pgstat_report_wait_end. They
> store those info into their local variables and send to Statistics
> Collector. We can get those info via Statistics Collector.
> 
> For reducing overhead, I implemented statistic hash instead of
> dynamic hash. I also implemented track_wait_timing which
> determines wait event duration is collected or not.

I object to adding this overhead. The whole selling point for wait
events was that they are low overhead. I since spent time reducing the
overhead further, because even just the branches for checking if
track_activity is enabled are measurable (225a22b19ed).


> From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> Date: Fri, 4 Jun 2021 18:14:51 +0200
> Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from
>  INSTR_TIME to rdtsc.
> 
> This patch changes measuring method of wait event time from INSTR_TIME (which
> uses gettimeofday or clock_gettime) to rdtsc. This might reduce the overhead
> of measuring overhead.
> 
> Any supports like changing clock cycle to actual time or error handling are
> not currently implemented.

rdtsc is a serializing (*) instruction - that's the expensive part. On linux
clock_gettime() doesn't actually need a syscall. While the vdso call
implies a bit of overhead over a raw rdtsc, it's a relatively small part
of the overhead.  See
https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

Greetings,

Andres Freund

(*) it's not fully serializing, iirc it allows later instructions to be
started, but it does wait for all earlier in-flight instructions to
finish.



Re: [Proposal] Add accumulated statistics for wait event

From
Jehan-Guillaume de Rorthais
Date:
Hi Andres, Hi all,

First, thank you for your feedback!

Please find in attachment a patch implementing accumulated wait event stats
only from the backend point of view. As I wrote when I reviewed and rebased the
existing patch, I was uncomfortable with the global approach. I still volunteer
to work/review the original approach is required.

See bellow for comments and some more explanations about what I think might be
improvements over the previous patch.

On Fri, 11 Jun 2021 12:18:07 -0700
Andres Freund <andres@anarazel.de> wrote:

> On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote:
> > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> > Date: Tue, 1 Jun 2021 13:25:57 +0200
> > Subject: [PATCH 1/2] Add pg_stat_waitaccum view.
> >
> > pg_stat_waitaccum shows counts and duration of each wait events.
> > Each backend/backgrounds counts and measures the time of wait event
> > in every pgstat_report_wait_start and pgstat_report_wait_end. They
> > store those info into their local variables and send to Statistics
> > Collector. We can get those info via Statistics Collector.
> >
> > For reducing overhead, I implemented statistic hash instead of
> > dynamic hash. I also implemented track_wait_timing which
> > determines wait event duration is collected or not.
>
> I object to adding this overhead. The whole selling point for wait
> events was that they are low overhead. I since spent time reducing the
> overhead further, because even just the branches for checking if
> track_activity is enabled are measurable (225a22b19ed).

Agree. The previous patch I rebased was to review it and reopen this discussion,
I even added a small FIXME in pgstat_report_wait_end and
pgstat_report_wait_start about your work:

  //FIXME: recent patch to speed up this call.

In the patch in attachment, I tried to fix this by using kind of an internal
hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
"instrument" wait events only when required, on the fly, dynamically.

Moreover, I removed the hash structure for a simple static array for faster
access.

> > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
> > Date: Fri, 4 Jun 2021 18:14:51 +0200
> > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from
> >  INSTR_TIME to rdtsc.
> >
> > This patch changes measuring method of wait event time from INSTR_TIME
> > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the
> > overhead of measuring overhead.
> >
> > Any supports like changing clock cycle to actual time or error handling are
> > not currently implemented.
>
> rdtsc is a serializing (*) instruction - that's the expensive part. On linux
> clock_gettime() doesn't actually need a syscall. While the vdso call
> implies a bit of overhead over a raw rdtsc, it's a relatively small part
> of the overhead.  See
> https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de

I choose to remove all this rdtsc part from my patch as this wasn't clear how
much faster it was compare to simpler vdso functions and how to accurately
extract a human time.


About my take on $subject, for the sake of simplicity of this PoC, I added
instrumentation to log_statement_stats. Despite the query context of the
reported log, they are really accumulated stats.

The patch updated pg_stat_get_waitaccum() as well to be able to report the
accumulated wait events from your interactive or batch session.

So using my previous fe-time demo client, you can test it using:

  PGOPTIONS="--log_statement_stats=on" ./fe-time 100

From logs, I now have (notice the last line):

  LOG:  duration: 3837.194 ms  statement: SELECT * FROM pgbench_accounts
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
           !       0.087444 s user, 0.002106 s system, 3.837202 s elapsed
           !       [0.087444 s user, 0.003974 s system total]
           !       25860 kB max resident size
           !       0/0 [0/0] filesystem blocks in/out
           !       0/303 [0/697] page faults/reclaims, 0 [0] swaps
           !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
           !       4/18 [5/18] voluntary/involuntary context switches
           !       Client/ClientWrite 4 calls, 3747102 us elapsed


Using pgbench scale factor 10, the copy query for pgbench_accounts looks like:

  LOG:  duration: 2388.081 ms  statement: copy pgbench_accounts from stdin
  LOG:  QUERY STATISTICS
  DETAIL:  ! system usage stats:
           !       1.373756 s user, 0.252860 s system, 2.388100 s elapsed
           !       [1.397015 s user, 0.264951 s system total]
           !       37788 kB max resident size
           !       0/641584 [0/642056] filesystem blocks in/out
           !       194/4147 [195/4728] page faults/reclaims, 0 [0] swaps
           !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
           !       3263/92 [3390/102] voluntary/involuntary context switches
           !       LWLock/WALBufMapping 3 calls, 654 us elapsed
           !       LWLock/WALWrite 3 calls, 60680 us elapsed
           !       LWLock/CheckpointerComm 3 calls, 318 us elapsed
           !       Client/ClientRead 1 calls, 151 us elapsed
           !       IO/DataFileExtend 16397 calls, 94000 us elapsed
           !       IO/DataFileWrite 14346 calls, 45911 us elapsed
           !       IO/WALInitSync 6 calls, 334840 us elapsed
           !       IO/WALInitWrite 6 calls, 48040 us elapsed
           !       IO/WALSync 17 calls, 353334 us elapsed
           !       IO/WALWrite 8362 calls, 38401 us elapsed

Now, I'm on a fence about the user facing interaction. I'll keep thinking about
this and report this week. In the meantime, any feedback about the current
implementation backbone is welcome.

Thanks!

Regards,

Attachment

Re: [Proposal] Add accumulated statistics for wait event

From
Andres Freund
Date:
Hi,

On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> In the patch in attachment, I tried to fix this by using kind of an internal
> hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> "instrument" wait events only when required, on the fly, dynamically.

That's *far worse*. You're adding an indirect function call. Which requires
loading a global variable and then a far call to a different function. You're
changing a path that's ~2 instructions with minimal dependencies (and no
branches (i.e. fully out of order executable) to something on the order of ~15
instructions with plenty dependencies and at least two branches (call, ret).

I doubt there's a path towards this feature without adding the necessary
infrastructure to hot-patch the code - which is obviously quite a
substantial project.

Greetings,

Andres Freund



Re: [Proposal] Add accumulated statistics for wait event

From
Andres Freund
Date:
Hi,

On 2021-06-14 11:27:21 -0700, Andres Freund wrote:
> On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > In the patch in attachment, I tried to fix this by using kind of an internal
> > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > "instrument" wait events only when required, on the fly, dynamically.
> 
> That's *far worse*. You're adding an indirect function call. Which requires
> loading a global variable and then a far call to a different function. You're
> changing a path that's ~2 instructions with minimal dependencies (and no
> branches (i.e. fully out of order executable) to something on the order of ~15
> instructions with plenty dependencies and at least two branches (call, ret).

In the case at hand it might even be worse, because the external function call
will require registers to be spilled for the function call. Right now wait
events "use" two register (one for the wait event, one for my_wait_event_info),
but otherwise don't cause additional spilling. With your change you'd see
register spill/reload around both wait start and end.

Greetings,

Andres Freund



Re: [Proposal] Add accumulated statistics for wait event

From
Jehan-Guillaume de Rorthais
Date:
Hi,

On Mon, 14 Jun 2021 11:27:21 -0700
Andres Freund <andres@anarazel.de> wrote:

> On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > In the patch in attachment, I tried to fix this by using kind of an internal
> > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > "instrument" wait events only when required, on the fly, dynamically.  
> 
> That's *far worse*. You're adding an indirect function call. Which requires
> loading a global variable and then a far call to a different function. You're
> changing a path that's ~2 instructions with minimal dependencies (and no
> branches (i.e. fully out of order executable) to something on the order of ~15
> instructions with plenty dependencies and at least two branches (call, ret).

Oh, I didn't realized it would affect all queries, even when log_statement_stats
was off. Thank you for your explanation.

> I doubt there's a path towards this feature without adding the necessary
> infrastructure to hot-patch the code - which is obviously quite a
> substantial project.

Right. Sadly, this kind of project is far above what I can do. So I suppose
it's a dead end for me.

I'll study if/how the sampling approach can be done dynamically.

Thank you,



Re: [Proposal] Add accumulated statistics for wait event

From
Andres Freund
Date:
Hi,

On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote:
> > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:
> > > In the patch in attachment, I tried to fix this by using kind of an internal
> > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to
> > > "instrument" wait events only when required, on the fly, dynamically.  
> > 
> > That's *far worse*. You're adding an indirect function call. Which requires
> > loading a global variable and then a far call to a different function. You're
> > changing a path that's ~2 instructions with minimal dependencies (and no
> > branches (i.e. fully out of order executable) to something on the order of ~15
> > instructions with plenty dependencies and at least two branches (call, ret).
> 
> Oh, I didn't realized it would affect all queries, even when log_statement_stats
> was off. Thank you for your explanation.

Maybe I just am misunderstanding what you were doing? As far as I can
tell your patch changed pgstat_report_wait_start() to be an indirect
function call - right? Then yes, this adds overhead to everything.

You *could* add a pgstat_report_wait_(start|end)_with_time() or such and
only use that in places that won't have a high frequency. But I just
don't quite see the use-case for that.

Greetings,

Andres Freund



Re: [Proposal] Add accumulated statistics for wait event

From
Jehan-Guillaume de Rorthais
Date:
Hi Andres,

On Mon, 14 Jun 2021 15:01:14 -0700
Andres Freund <andres@anarazel.de> wrote:

> On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote:
> > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote:  
> > > > In the patch in attachment, I tried to fix this by using kind of an
> > > > internal hook for pgstat_report_wait_start and pgstat_report_wait_end.
> > > > This allows to "instrument" wait events only when required, on the fly,
> > > > dynamically.    
> > > 
> > > That's *far worse*. You're adding an indirect function call. Which
> > > requires loading a global variable and then a far call to a different
> > > function. You're changing a path that's ~2 instructions with minimal
> > > dependencies (and no branches (i.e. fully out of order executable) to
> > > something on the order of ~15 instructions with plenty dependencies and
> > > at least two branches (call, ret).  
> > 
> > Oh, I didn't realized it would affect all queries, even when
> > log_statement_stats was off. Thank you for your explanation.  
> 
> Maybe I just am misunderstanding what you were doing? As far as I can
> tell your patch changed pgstat_report_wait_start() to be an indirect
> function call - right?

Exact.

I didn't realized this indirection would be so costy on every single calls,
after the variable assignation itself.

> Then yes, this adds overhead to everything.

I understand now, thank you for the explanation.
For my own curiosity and study, I'll remove this indirection and bench my patch
anyway.

> You *could* add a pgstat_report_wait_(start|end)_with_time() or such and
> only use that in places that won't have a high frequency. But I just
> don't quite see the use-case for that.

Well, it could be useful if we decide to only track a subset of wait event.
In my scenario, I originally wanted to only track ClientWrite, but then I
realized this might be too specific and tried to generalize.

There are probably some other way to deal with this issue. Eg.:

* do NOT include the time lost waiting for the frontend side in the query
  execution time
* expose the frontend part of the query time in log_min_duration_stmt,
  auto_explain, pg_stat_statements, in the same fashion we currently do with
  planning and execution time
* having some wait-even sampling mechanism in core or as easy and hot-loadable
  than auto_explain

Thoughts?

Thanks again!

Regards,