Thread: Re: [Proposal] Add accumulated statistics for wait event
On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote: > This proposal is about recording additional statistics of wait events. You should avoid sending things in html format, text format being recommended on those mailing lists... The patch applies after using patch -p0 by the way. I would recommend that you generate your patches using "git format-patch". Here are general guidelines on the matter: https://wiki.postgresql.org/wiki/Submitting_a_Patch Please study those guidelines, those are helpful if you want to get yourself familiar with community process. I have comments about your patch. First, I don't think that you need to count precisely the number of wait events triggered as usually when it comes to analyzing a workload's bottleneck what counts is a periodic *sampling* of events, patterns which can be fetched already from pg_stat_activity and stored say in a different place. This can be achieved easily by using a cron job with an INSERT SELECT query which adds data on a relation storing the event counts. I took the time to look at your patch, and here is some feedback. This does not need a configure switch. I assume that what you did is good to learn the internals of ./configure though. There is no documentation. What does the patch do? What is it useful for? + case PG_WAIT_IPC: + arrayIndex = NUM_WAIT_LWLOCK + + NUM_WAIT_LOCK + NUM_WAIT_BUFFER_PIN + + NUM_WAIT_ACTIVITY + NUM_WAIT_CLIENT + + NUM_WAIT_EXTENSION + eventId; + break; This is ugly and unmaintainable style. You could perhaps have considered an enum instead. pg_stat_get_wait_events should be a set-returning function, which you could filter at SQL level using a PID, so no need of it as an argument. What's the performance penalty? I am pretty sure that this is measurable as wait events are stored for a backend for each I/O operation as well, and you are calling a C routine within an inlined function which is designed to be light-weight, doing only a four-byte atomic operation. -- Michael
Attachment
Hi, that will be a great feature. On 23.07.2018 10:53, Michael Paquier wrote: > I have comments about your patch. First, I don't think that you need to > count precisely the number of wait events triggered as usually when it > comes to analyzing a workload's bottleneck what counts is a periodic > *sampling* of events If I get it right, this patch is not about sampling. It gathers cumulative statistics, which is regrettably missing in PostgreSQL. That's why it should not only count exact number of wait events, but also min time and stddev would be helpful (as in pg_stat_statements). -- Egor Rogov Postgres Professional http://www.postgrespro.com
On Mon, Jul 23, 2018 at 10:53 AM Michael Paquier <michael@paquier.xyz> wrote: > What's the performance penalty? I am pretty sure that this is > measurable as wait events are stored for a backend for each I/O > operation as well, and you are calling a C routine within an inlined > function which is designed to be light-weight, doing only a four-byte > atomic operation. Yes, the question is overhead of measuring durations of individual wait events. It has been proposed before, and there been heated debates about that (see threads [1-3]). It doesn't seem to be a conclusion about this feature. The thing to be said for sure: performance penalty heavily depends on OS/hardware/workload. In some cases overhead is negligible, but in other cases it appears to be huge. 1. https://www.postgresql.org/message-id/flat/559D4729.9080704%40postgrespro.ru 2. https://www.postgresql.org/message-id/flat/CA%2BTgmoYd3GTz2_mJfUHF%2BRPe-bCy75ytJeKVv9x-o%2BSonCGApw%40mail.gmail.com 3. https://www.postgresql.org/message-id/flat/CAG95seUAQVj09KzLwU%2Bz1B-GqdMqerzEkPFR3hn0q88XzMq-PA%40mail.gmail.com ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Michael Paquier <michael@paquier.xyz> writes: > This does not need a configure switch. It probably is there because the OP realizes that most people wouldn't accept having this code compiled in. > What's the performance penalty? I am pretty sure that this is > measurable as wait events are stored for a backend for each I/O > operation as well, and you are calling a C routine within an inlined > function which is designed to be light-weight, doing only a four-byte > atomic operation. On machines with slow gettimeofday(), I suspect the cost of this patch would be staggering. Even with relatively fast gettimeofday, it doesn't look acceptable for calls in hot code paths (for instance, lwlock.c). A bigger problem is that it breaks stuff. There are countless calls to pgstat_report_wait_start/pgstat_report_wait_end that assume they have no side-effects (for example, on errno) and can never fail. I wouldn't trust GetCurrentTimestamp() for either. If the report_wait calls can't be dropped into code with *complete* certainty that they're safe, that's a big cost. Why exactly is this insisting on logging timestamps and not, say, just incrementing a counter? I think doing it like this is almost certain to end in rejection. regards, tom lane
2018-07-23 16:53 (GMT+9), Michael Paquier wrote: > On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote: >> This proposal is about recording additional statistics of wait events. > I have comments about your patch. First, I don't think that you need to > count precisely the number of wait events triggered as usually when it > comes to analyzing a workload's bottleneck what counts is a periodic > *sampling* of events, patterns which can be fetched already from > pg_stat_activity and stored say in a different place. Thanks for your feedback. This proposal is not about *sampling*. Accumulated statistics of wait events information is useful for solving issue. It can measure accurate data. Some case, sampling of events can not find the cause of issue. It lose detail data. For example, some throughput issue occur(ex : disk io), but each wait point occurs only a few milliseconds. In this case, it is highly likely that will not find the cause. > This is ugly and unmaintainable style. I'm sorry. You're right. Think as the PoC. > What's the performance penalty? I have same worries. I just tried pgbench several times. Let me know what some good performance check method. Best regards, MyungKyu, Lim
> On Mon, Jul 23, 2018 at 10:53 AM Michael Paquier <michael@paquier.xyz> wrote: >> What's the performance penalty? I am pretty sure that this is >> measurable as wait events are stored for a backend for each I/O >> operation as well, and you are calling a C routine within an inlined >> function which is designed to be light-weight, doing only a four-byte >> atomic operation. > Yes, the question is overhead of measuring durations of individual wait events. It has been proposed before, and therebeen heated debates about that (see threads [1-3]). It doesn't seem > to be a conclusion about this feature. The thing to be said for sure: > performance penalty heavily depends on OS/hardware/workload. In some cases overhead is negligible, but in other casesit appears to be huge. Thanks for good information. I agree. Performance penalty is exist. But wait stats are demandable and useful. In some cases, it is worth sacrificing performance and using it. So, what do you think about developing as extension? I have another concept proposal. 2. This feature can be implemented as extension if some hooks were provided in following functions, - pgstat_report_wait_start - pgstat_report_wait_end This feature can be turned on/off by on-line config when necessary. Best regards, MyungKyu, Lim
Hi,
I am skeptical about accumulated statistics.
pg_stat_activity now gives necessary information about wait events. It can be easily be used with a polling system that sleeps most of the time to limit the overhead. Measuring the duration of individual wait events is not necessary to know the repartition of the charge.
You can aggregate the results of the polling by application, query, wait events or whatever you want.
I wrote a script for that that can be used interactively or in batch mode to produce reports but many solutions exist .
Best regards Phil
Envoyé : mardi 24 juillet 2018 12:10
À : Alexander Korotkov; pgsql-hackers@postgresql.org
Cc : Woosung Sohn; DoHyung HONG
Objet : RE: Re: [Proposal] Add accumulated statistics for wait event
>> What's the performance penalty? I am pretty sure that this is
>> measurable as wait events are stored for a backend for each I/O
>> operation as well, and you are calling a C routine within an inlined
>> function which is designed to be light-weight, doing only a four-byte
>> atomic operation.
> Yes, the question is overhead of measuring durations of individual wait events. It has been proposed before, and there been heated debates about that (see threads [1-3]). It doesn't seem
> to be a conclusion about this feature. The thing to be said for sure:
> performance penalty heavily depends on OS/hardware/workload. In some cases overhead is negligible, but in other cases it appears to be huge.
Thanks for good information.
I agree. Performance penalty is exist.
But wait stats are demandable and useful. In some cases, it is worth sacrificing performance and using it.
So, what do you think about developing as extension? I have another concept proposal.
2. This feature can be implemented as extension if some hooks were provided in following functions,
- pgstat_report_wait_start
- pgstat_report_wait_end
This feature can be turned on/off by on-line config when necessary.
Best regards,
MyungKyu, Lim
On 07/23/2018 03:57 PM, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: >> This does not need a configure switch. > > It probably is there because the OP realizes that most people wouldn't > accept having this code compiled in. > >> What's the performance penalty? I am pretty sure that this is >> measurable as wait events are stored for a backend for each I/O >> operation as well, and you are calling a C routine within an inlined >> function which is designed to be light-weight, doing only a four-byte >> atomic operation. > > On machines with slow gettimeofday(), I suspect the cost of this > patch would be staggering. Even with relatively fast gettimeofday, > it doesn't look acceptable for calls in hot code paths (for instance, > lwlock.c). > Yeah. I wonder if we could measure the time for a small fraction of the wait events, and estimate the actual duration from that. > A bigger problem is that it breaks stuff. There are countless > calls to pgstat_report_wait_start/pgstat_report_wait_end that > assume they have no side-effects (for example, on errno) and > can never fail. I wouldn't trust GetCurrentTimestamp() for either. > If the report_wait calls can't be dropped into code with *complete* > certainty that they're safe, that's a big cost. > > Why exactly is this insisting on logging timestamps and not, > say, just incrementing a counter? I think doing it like this > is almost certain to end in rejection. > Because the number of times you hit wait event may not correlate with the time you spent waiting on it. So a simple counter is not the most useful thing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/24/2018 12:06 PM, MyungKyu LIM wrote: > 2018-07-23 16:53 (GMT+9), Michael Paquier wrote: >> On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote: >>> This proposal is about recording additional statistics of wait events. > >> I have comments about your patch. First, I don't think that you need to >> count precisely the number of wait events triggered as usually when it >> comes to analyzing a workload's bottleneck what counts is a periodic >> *sampling* of events, patterns which can be fetched already from >> pg_stat_activity and stored say in a different place. > > Thanks for your feedback. > > This proposal is not about *sampling*. > Accumulated statistics of wait events information is useful for solving > issue. It can measure accurate data. > > Some case, sampling of events can not find the cause of issue. It lose detail data. > For example, some throughput issue occur(ex : disk io), but each wait point > occurs only a few milliseconds. > In this case, it is highly likely that will not find the cause. > I think it's highly likely that it will find the cause. The idea of sampling is that while you don't measure the timing directly, you can infer it from the frequency of the wait events in the samples. So if you see the backend reports a particular wait event in 75% of samples, it probably spent 75% time waiting on it. I'm not saying sampling is perfect and it certainly is less convenient than what you propose. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
> Some case, sampling of events can not find the cause of issue. It lose detail data.
> For example, some throughput issue occur(ex : disk io), but each wait point
> occurs only a few milliseconds.
It loses non meaningful details and it's in fact a good point. In this example, sampling will definitely find the cause and won't cost resources.
Being as precise as possible to define a wait event is very useful but knowing precisely the duration of each event is less useful in terms of tuning.
Example of sampling + group by/order by percentage of activity :
traqueur 2.05.00 - performance tool for PostgreSQL 9.3 => 11
INFORMATION, no connection parameters provided, connecting to dedicated database ...
INFORMATION, connected to dedicated database traqueur
INFORMATION, PostgreSQL version : 110000
INFORMATION, sql preparation ...
INFORMATION, sql execution ...
busy_pc | distinct_exe | application_name | wait_event_type
---------+--------------+------------------+-----------------
206 | 8 / 103 | mperf |
62 | 2 / 31 | mperf | LWLock
20 | 3 / 10 | mperf | IO
12 | 1 / 6 | mperf | Client
(4 rows)
busy_pc | distinct_exe | application_name | wait_event | wait_event_type
---------+--------------+------------------+-----------------------+-----------------
206 | 8 / 103 | mperf | |
62 | 2 / 31 | mperf | WALWriteLock | LWLock
14 | 1 / 7 | mperf | DataFileImmediateSync | IO
12 | 1 / 6 | mperf | ClientRead | Client
2 | 1 / 1 | mperf | DataFileWrite | IO
2 | 1 / 1 | mperf | DataFileRead | IO
2 | 1 / 1 | mperf | WALInitWrite | IO
No need to know the exact duration of each event to identify the bottleneck(s)...
Best regards
Phil
Envoyé : mardi 24 juillet 2018 17:45
À : pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
On 07/24/2018 12:06 PM, MyungKyu LIM wrote:
> 2018-07-23 16:53 (GMT+9), Michael Paquier wrote:
>> On Mon, Jul 23, 2018 at 04:04:42PM +0900, 임명규 wrote:
>>> This proposal is about recording additional statistics of wait events.
>
>> I have comments about your patch. First, I don't think that you need to
>> count precisely the number of wait events triggered as usually when it
>> comes to analyzing a workload's bottleneck what counts is a periodic
>> *sampling* of events, patterns which can be fetched already from
>> pg_stat_activity and stored say in a different place.
>
> Thanks for your feedback.
>
> This proposal is not about *sampling*.
> Accumulated statistics of wait events information is useful for solving
> issue. It can measure accurate data.
>
> Some case, sampling of events can not find the cause of issue. It lose detail data.
> For example, some throughput issue occur(ex : disk io), but each wait point
> occurs only a few milliseconds.
> In this case, it is highly likely that will not find the cause.
>
I think it's highly likely that it will find the cause. The idea of
sampling is that while you don't measure the timing directly, you can
infer it from the frequency of the wait events in the samples. So if you
see the backend reports a particular wait event in 75% of samples, it
probably spent 75% time waiting on it.
I'm not saying sampling is perfect and it certainly is less convenient
than what you propose.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote: > It loses non meaningful details and it's in fact a good point. In this > example, sampling will definitely find the cause and won't cost > resources. The higher the sampling frequency, the more details you get, with the most load on the instance. So if you are able to take an infinity of samples, where registering multiple times the same event for the same backend also matters because its overall weight gets higher and it shows up higher in profiles, then you would be able converge to the set of results that this patch adds. Sampling method, especially its frequency, is something controlled by the client and not the server. Approaches like the one proposed here push the load on the server-side, unconditionally, for *all* backends, and this has its cost. Even if you have spiky workloads, sampling may miss those, but even with adding counters for each event you would need to query the table holding the counters at an insane frequency to be able to perhaps get something out of it as you need to do sampling of the counters as well to extract deltas. As Tomas has mentioned up-thread, sampling is light-weight, as-is the current design for wait events. Even if it is not perfect because it cannot give exact numbers, it would find bottlenecks in really most cases, and that's what matters. If not, increasing the sampling frequency makes things easier to detect as well. What would be the point of taking only one sample every checkpoint for example? There may be a benefit in having counters, I don't know the answer to that, though the point would be to make sure that there is a specific set of workloads where it makes sense, still my gut feeling is that sampling would be able to detect those anyway. (I am not a computer scientist by default but a physicist, think fluid dynamics and turbulence, and I had my load of random events and signal analysis as well. All that is just statistics with attempts to approach reality, where sampling is a life-saver over exactitude of measurements.) Adding hooks is not acceptable to me either, those have a cost, and it is not clear what's the benefit we can get into putting hooks in such a place for cases other than sampling and counters... -- Michael
Attachment
On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote:
> It loses non meaningful details and it's in fact a good point. In this
> example, sampling will definitely find the cause and won't cost
> resources.
The higher the sampling frequency, the more details you get, with the
most load on the instance. So if you are able to take an infinity of
samples, where registering multiple times the same event for the same
backend also matters because its overall weight gets higher and it shows
up higher in profiles, then you would be able converge to the set of
results that this patch adds. Sampling method, especially its
frequency, is something controlled by the client and not the server.
Approaches like the one proposed here push the load on the server-side,
unconditionally, for *all* backends, and this has its cost.
Even if you have spiky workloads, sampling may miss those, but even with
adding counters for each event you would need to query the table holding
the counters at an insane frequency to be able to perhaps get something
out of it as you need to do sampling of the counters as well to extract
deltas.
As Tomas has mentioned up-thread, sampling is light-weight, as-is the
current design for wait events. Even if it is not perfect because it
cannot give exact numbers, it would find bottlenecks in really most
cases, and that's what matters. If not, increasing the sampling
frequency makes things easier to detect as well. What would be the
point of taking only one sample every checkpoint for example?
There may be a benefit in having counters, I don't know the answer to
that, though the point would be to make sure that there is a specific
set of workloads where it makes sense, still my gut feeling is that
sampling would be able to detect those anyway.
(I am not a computer scientist by default but a physicist, think fluid
dynamics and turbulence, and I had my load of random events and signal
analysis as well. All that is just statistics with attempts to approach
reality, where sampling is a life-saver over exactitude of
measurements.)
Adding hooks is not acceptable to me either, those have a cost, and it
is not clear what's the benefit we can get into putting hooks in such a
place for cases other than sampling and counters...
--
Michael
Hi,
I agree with that. PostgreSQL 10 is really great, tuning tools based on sampling of pg_stat_activity became accurate without any modification.
Best regards
Phil
Envoyé : jeudi 26 juillet 2018 03:24
À : Phil Florent
Cc : Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
> It loses non meaningful details and it's in fact a good point. In this
> example, sampling will definitely find the cause and won't cost
> resources.
The higher the sampling frequency, the more details you get, with the
most load on the instance. So if you are able to take an infinity of
samples, where registering multiple times the same event for the same
backend also matters because its overall weight gets higher and it shows
up higher in profiles, then you would be able converge to the set of
results that this patch adds. Sampling method, especially its
frequency, is something controlled by the client and not the server.
Approaches like the one proposed here push the load on the server-side,
unconditionally, for *all* backends, and this has its cost.
Even if you have spiky workloads, sampling may miss those, but even with
adding counters for each event you would need to query the table holding
the counters at an insane frequency to be able to perhaps get something
out of it as you need to do sampling of the counters as well to extract
deltas.
As Tomas has mentioned up-thread, sampling is light-weight, as-is the
current design for wait events. Even if it is not perfect because it
cannot give exact numbers, it would find bottlenecks in really most
cases, and that's what matters. If not, increasing the sampling
frequency makes things easier to detect as well. What would be the
point of taking only one sample every checkpoint for example?
There may be a benefit in having counters, I don't know the answer to
that, though the point would be to make sure that there is a specific
set of workloads where it makes sense, still my gut feeling is that
sampling would be able to detect those anyway.
(I am not a computer scientist by default but a physicist, think fluid
dynamics and turbulence, and I had my load of random events and signal
analysis as well. All that is just statistics with attempts to approach
reality, where sampling is a life-saver over exactitude of
measurements.)
Adding hooks is not acceptable to me either, those have a cost, and it
is not clear what's the benefit we can get into putting hooks in such a
place for cases other than sampling and counters...
--
Michael
On Thu, July 26, 2018 at 1:25 AM, Michael Paquier wrote: > Even if you have spiky workloads, sampling may miss those, but even with adding counters for each event > you would need to query the table holding the counters at an insane frequency to be able to perhaps get > something out of it as you need to do sampling of the counters as well to extract deltas. Hi, I was wondering why PostgreSQL did not have the number of wait events and wait time that other databases such as Oraclehad as a function, and when I was looking for related threads, I got to this thread. I am a DB beginner, so please tell me. It says that you can find events that are bottlenecks in sampling, but as you sawabove, you can not find events shorter than the sampling interval, right? If this short event has occurred quite a number of times and it was a considerable amount of time in total, can you solvethis problem with sampling? # I have asked, but I could not understand much of the discussion above and I do not know if such a case can exist. Also, I think that it can be solved by higher the sampling frequency, but the load will be high, right? I think this methodis not very practical. Moreover, I think that it is not implemented due to the reason that sampling is good as described above and because it affectsperformance. How about switching the function on / off and implementing with default off? Do you care about performance degradation during bottleneck investigation? When investigating the bottleneck, I think that it is better to find the cause even at the expense of performance. # If you can detect with high frequency sampling, I think that it depends on whether the sampling or the function(the numberof wait events and wait time) is high load. Since I am a DB beginner, I think that it is saying strange things. I am glad if you tell me. ----- Naoki Yotsunaga
- Sample the sessions and consider the active ones. You need to know if they are waiting (PostgreSQL now provides detailed wait events) or if they are on the CPU
- Optionally collect information on the system context at the time of sampling (CPU, memory...), it can be provided by many tools like psutil python library for example
Envoyé : jeudi 4 octobre 2018 10:31
À : 'Michael Paquier'; Phil Florent
Cc : Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : RE: [Proposal] Add accumulated statistics for wait event
> Even if you have spiky workloads, sampling may miss those, but even with adding counters for each event
> you would need to query the table holding the counters at an insane frequency to be able to perhaps get
> something out of it as you need to do sampling of the counters as well to extract deltas.
Hi, I was wondering why PostgreSQL did not have the number of wait events and wait time that other databases such as Oracle had as a function, and when I was looking for related threads, I got to this thread.
I am a DB beginner, so please tell me. It says that you can find events that are bottlenecks in sampling, but as you saw above, you can not find events shorter than the sampling interval, right?
If this short event has occurred quite a number of times and it was a considerable amount of time in total, can you solve this problem with sampling?
# I have asked, but I could not understand much of the discussion above and I do not know if such a case can exist.
Also, I think that it can be solved by higher the sampling frequency, but the load will be high, right? I think this method is not very practical.
Moreover, I think that it is not implemented due to the reason that sampling is good as described above and because it affects performance.
How about switching the function on / off and implementing with default off?
Do you care about performance degradation during bottleneck investigation?
When investigating the bottleneck, I think that it is better to find the cause even at the expense of performance.
# If you can detect with high frequency sampling, I think that it depends on whether the sampling or the function(the number of wait events and wait time) is high load.
Since I am a DB beginner, I think that it is saying strange things.
I am glad if you tell me.
-----
Naoki Yotsunaga
On Thu, Oct 04, 2018 at 09:32:37AM +0000, Phil Florent wrote: > I am a DB beginner, so please tell me. It says that you can find > events that are bottlenecks in sampling, but as you saw above, you can > not find events shorter than the sampling interval, right? Yes, which is why it would be as simple as making the interval shorter, still not too short so as it bloats the amount of information fetched which needs to be stored and afterwards (perhaps) treated for analysis. This gets rather close to signal processing. A simple image is for example, assuming that event A happens 100 times in an interval of 1s, and event B only once in the same interval of 1s, then if the snapshot interval is only 1s, then in the worst case A would be treated an equal of B, which would be wrong. -- Michael
Attachment
À : Phil Florent
Cc : Yotsunaga, Naoki; Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event
> I am a DB beginner, so please tell me. It says that you can find
> events that are bottlenecks in sampling, but as you saw above, you can
> not find events shorter than the sampling interval, right?
Yes, which is why it would be as simple as making the interval shorter,
still not too short so as it bloats the amount of information fetched
which needs to be stored and afterwards (perhaps) treated for analysis.
This gets rather close to signal processing. A simple image is for
example, assuming that event A happens 100 times in an interval of 1s,
and event B only once in the same interval of 1s, then if the snapshot
interval is only 1s, then in the worst case A would be treated an equal
of B, which would be wrong.
--
Michael
On Thu, Oct 4, 2018 at 0:54 AM, Phil Florent wrote:
Phil, Michael, I appreciate your polite comments.
I understand as follows.
We can find it if we shorten the sampling interval, but a lot of information comes out.
# The balance is important.
Also, it is not good unless we have enough samples.
And I have to do various other things.
Is my understand correct?
It seems to me that it is difficult for me if my understanding is right.
Is DBA really able to solve bottlenecks with sampling?
# Since I am a beginner, I feel that way. And other people may not feel it difficult.
What I would like to say is that if we have information on the number of wait events and the wait time(like other DB), we can investigate more easily.
Of course, I understand that it also affects performance. So, I suggest a way that it can switch on and off, defaults is off.
-----
Naoki, Yotsunaga.
Bertrand DROUVOT wrote > Hello Guys, > > As you mentioned Oracle like active session history sampling in this > thread, I just want to let you know that I am working on a brand new > extension to provide this feature. > > You can find the extension here: https://github.com/pgsentinel/pgsentinel > > [...] > > If you want to have a look, give your thoughts, you are welcome. > > Bertrand +1! just a regret: ash sampling interval can not be smaller than a second ;o( Cordialement PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:
Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?
In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.
However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.
#Because legard’s comment.
https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com
Does this sampling number of 60 give information that I really want?
Perhaps it is not to miss the real problem part?
---------------------------------------
Naoki, Yotsunaga.
Hello, You are right, sampling has to be "tuned" regarding the event(s) you want to catch. Sampling of 1 second interval is good with treatments that take hours, and not enough for a minute or a second analysis. May I invite you to try it, using PASH-viewer (github) with pgsentinel (github). Changing pgsentiel.c sampling from 1 second rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ash_sampling_period * 1000L,PG_WAIT_EXTENSION); to 1/10 second rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, ash_sampling_period * 100L,PG_WAIT_EXTENSION); seems the good balance for me (for analysis periods from a few seconds to minutes). Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Is DBA really able to solve bottlenecks with sampling?
What I would like to say is that if we have information on the number of wait events and the wait time(like other DB), we can investigate more easily.
Envoyé : lundi 29 octobre 2018 02:20
À : 'Phil Florent'; 'Michael Paquier'
Cc : 'Tomas Vondra'; 'pgsql-hackers@lists.postgresql.org'
Objet : RE: [Proposal] Add accumulated statistics for wait event
On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:
Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?
In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.
However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.
#Because legard’s comment.
https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com
Does this sampling number of 60 give information that I really want?
Perhaps it is not to miss the real problem part?
---------------------------------------
Naoki, Yotsunaga.
On Mon, Oct 29, 2018 at 1:52 AM, Phil Florent wrote:
Hi, thank you for comments.
>Yes you will be able to solve bottlenecks with sampling. In interactive mode, a 1s interval is probably too large. I use 0s1 - 0s01 with my tool and it is normally OK.
With the tool you are using, can you sample at intervals shorter than 1 second?
If you can, you can get enough sampling number and you can also acquire short events.
>Since grafana is now able to connect directly to a postgresql source, I use it to display the information collected from pg_stat_activity and psutil ( e.g https://pgphil.ovh/traqueur_dashboard_02.php page is written in french but panels are in english)
It is wonderful to visualize.
Especially for beginners like me.
>Other DB have accumulated statistics but you can notice that sampling is also their most modern method.
>E.g Oracle DB : 20 years ago you already had tools like "utlbstat/utlestat" . Then you had "statspack". Those tools were based on accumulated statistics and the reports were based on differences between 2 points. It was useful to solve major problems but it was limited and not precise enough in many cases.
>The preferred feature to identify bottlenecks in the Oracle world is now ASH (active session history). It can help with major problems, specific problems AND it can identify short blockages.
>Too bad it is licensed as an option of their Enterprise Edition but similar tools exist and they are also based on sampling of the activity.
>With the "official" ASH, sampling and archiving are done internally and you have a circular memory zone dedicated to the feature. Hence the overhead is lower but that's all.
>The most advanced interactive tool is called "snapper" and it is also based on sampling.
Thanks. I will check it.
The current bottleneck survey method, from sampling, I can know the number (ratio) of waiting events.
Then, investigate from those with a high number of times (ratio).
Do you agree with this recognition?
---------------------------------------
Naoki, Yotsunaga.
On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote:
>2) it consumes system resources
While the system is running, you are always sampling system information, do not you? Like Oracle ASH.
If so, does sampling have no significant impact on performance? Even if the interval is 0.01 s or more.
>The main interest of sampling is to discard negligible activity to allow the DBA to work on meaningful queries and events.
In other words, do you mean narrowing down candidate of problems?
---------------------------
Naoki Yotsunaga
On Mon, Nov 5, 2018 at 4:26 PM, Naoki Yotsunaga wrote:
>>2) it consumes system resources
>While the system is running, you are always sampling system information, do not you? Like Oracle ASH.
I don’t understand well how sampling is used.
In which scene do you use the sampling? Or is it both scenes? Or is it a different scene?
A) Perform sampling in order to obtain information on the entire DB while DB is in operation.
B) I know roughly the process to be late. When investigating the reason for delaying the processing, sampling is performed on the processing.
---------------------------
Naoki Yotsunaga
On Sun, Oct 28, 2018 at 6:39 PM, legrand legrand wrote: Hi, Thanks for comments. I had overlooked the reply from you. >You are right, sampling has to be "tuned" regarding the event(s) you want to catch. I see. For tuning, you need to know the length of processing you want to sample? > May I invite you to try it, using PASH-viewer (github) with pgsentinel (github). I'll check and try it. ------------------ Naoki Yotsunaga
On Tue, Nov 6, 2018 at 04:26:03AM +0000, Yotsunaga, Naoki wrote: > On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote: > > > > >2) it consumes system resources > > While the system is running, you are always sampling system information, do not > you? Like Oracle ASH. > > If so, does sampling have no significant impact on performance? Even if the > interval is 0.01 s or more. I am replying late, but one of the reasons that sampling is used is that decreasing the sampling interval increases to overhead of the sampling process, but doesn't affect the running backends. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Nov 21, 2018 at 9:27 PM, Bruce Momjian wrote: Hi, thank you for the information. I understood that sampling is effective for investigation of waiting events. By the way, you can see the number of wait events with "LWLOCK_STATS", right? Is this function implemented because it is necessary to know the number of waiting events for investigation? If so, is not that the number of wait events is useful information? Now, I need to rebuild to get this information and I feel inconvenience. So, how about checking the number of wait events in the view? Also, I think that it will be useful if you know the waiting time. I think that it is easy to investigate when it is clearly known how long waiting time is occupied with processing time. -- Naoki Yotsunaga
From: Yotsunaga, Naoki [mailto:yotsunaga.naoki@jp.fujitsu.com] > By the way, you can see the number of wait events with "LWLOCK_STATS", right? > Is this function implemented because it is necessary to know the number > of waiting events for investigation? > If so, is not that the number of wait events is useful information? > Now, I need to rebuild to get this information and I feel inconvenience. > > So, how about checking the number of wait events in the view? > Also, I think that it will be useful if you know the waiting time. > I think that it is easy to investigate when it is clearly known how long > waiting time is occupied with processing time. That's interesting. It should be convenient for PG developers to be able to see wait events of each session and the wholeinstance to improve PostgreSQL. I developed a flight recorder feature for a proprietary database (not based on PostgreSQL),which recorded various events, including waits, in the ring buffer for each session. It could dump the ringbuffers into external files in CSV format, so that we can load them into a database or a spreadsheet to analyze the timemodel of SQL execution. That helped us a lot to eliminate many bottlenecks. MySQL says a similar thing about its PerformanceSchema: https://dev.mysql.com/doc/refman/8.0/en/performance-schema-examples.html "Tune the code (this applies to storage engine or server developers only)." Regards Takayuki Tsunakawa
Hi all, I think sampling like Oracle ASH should work for the DBA to find probable bottlenecks in many cases (, so I hope PostgreSQLwill incorporate it...) On the other hand, it seems to have the following disadvantages, some of which othershave already pointed out: 1. Doesn't provide precise data Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints. This might make it difficultor impossible to solve transient performance problems, such as infrequent 100 millisecond response times while thenormal response time is a few milliseconds. The proposed wait event collection doesn't miss anything. 2. Overuses resources We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems. However, thesampled data of active sessions become voluminous in memory and storage. It would take longer to analyze those samples. Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter andwalwriter tries to avoid by hibernation. The proposed wait event collection just records what actually happened. No waste. Would it use many resources if waitshappen frequently? That leads to our motivation to reduce waits. 3. Cannot determine the impact or illness of waits just by sampling or counting without time As the following MySQL and Oracle manual articles describe, precise measurement of wait count and time helps to judge theimpact and justify the remedy. They can measure the whole SQL execution and its various processing steps (parse, plan,sort, etc.) as well as waits, so that the most significant areas can be determined. Also, sampling cannot tell if a single wait took long or the same waits occurred repeatedly in succession (at least easily.) Do the sampled waits indicate an abnormal I/O (which took 2 ms while the normal time is 50 us)? [MySQL] Chapter 26 MySQL Performance Schema https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html -------------------------------------------------- The Performance Schema monitors server events. An “event” is anything the server does that takes time and has been instrumentedso that timing information can be collected. In general, an event could be a function call, a wait for the operatingsystem, a stage of an SQL statement execution such as parsing or sorting, or an entire statement or group of statements.Event collection provides access to information about synchronization calls (such as for mutexes) file and tableI/O, table locks, and so forth for the server and for several storage engines. Current events are available, as well as event histories and summaries. This enables you to determine how many times instrumentedactivities were performed and how much time they took. Event information is available to show the activitiesof specific threads, or activity associated with particular objects such as a mutex or file. -------------------------------------------------- [Oracle] https://docs.oracle.com/en/database/oracle/oracle-database/18/tdppt/automatic-database-performance-monitoring.html#GUID-32E92AEC-AF1A-4602-B998-3250920CD3BE -------------------------------------------------- The goal of database performance tuning is to reduce the DB time of the system for a given workload. By reducing DB time,the database can support more user requests by using the same or fewer resources. ADDM reports system resources thatare using a significant portion of DB time as problem areas and sorts them in descending order by the amount of relatedDB time spent. -------------------------------------------------- Instance Tuning Using Performance Views https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/instance-tuning-using-performance-views.html#GUID-07982549-507F-4465-8843-7F753BCF8F99 -------------------------------------------------- Wait event statistics include the number of times an event was waited for and the time waited for the event to complete.If the initialization parameter TIMED_STATISTICS is set to true, then you can also see how long each resource waswaited for. To minimize user response time, reduce the time spent by server processes waiting for event completion. Not all wait eventshave the same wait time. Therefore, it is more important to examine events with the most total time waited rather thanwait events with a high number of occurrences. Usually, it is best to set the dynamic parameter TIMED_STATISTICS to trueat least while monitoring performance. -------------------------------------------------- https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F 5.2.3 Using Wait Events without Timed Statistics -------------------------------------------------- If TIMED_STATISTICS is set to FALSE, then the amount of time spent waiting for an event is not available. Therefore, it isonly possible to order wait events by the number of times each event was waited for. Although the events with the largestnumber of waits might indicate a potential bottleneck, they might not be the main bottleneck. This situation can happenwhen an event is waited for a large number of times, but the total time waited for that event is small. Conversely,an event with fewer waits might be a bigger bottleneck if the wait time accounts for a significant proportionof the total wait time. Without the wait times to use for comparison, it is difficult to determine whether a waitevent is worth investigating. -------------------------------------------------- 10.2.2 Using Wait Event Statistics to Drill Down to Bottlenecks -------------------------------------------------- The most effective way to use wait event data is to order the events by the wait time. This is only possible if TIMED_STATISTICSis set to true. Otherwise, the wait events can only be ranked by the number of times waited, which is oftennot the ordering that best represents the problem. To get an indication of where time is spent, follow these steps: 1. Examine the data collection for V$SYSTEM_EVENT. The events of interest should be ranked by wait time. Identify the wait events that have the most significant percentage of wait time. ... Alternatively, look at the Top 5 Timed Events section at the beginning of the Automatic Workload Repository report. Thissection automatically orders the wait events (omitting idle events), and calculates the relative percentage: Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Call Time -------------------------------------- ------------ ----------- --------- CPU time 559 88.80 log file parallel write 2,181 28 4.42 SQL*Net more data from client 516,611 27 4.24 db file parallel write 13,383 13 2.04 db file sequential read 563 2 .27 2. Look at the number of waits for these events, and the average wait time. For example, for I/O related events, the averagetime might help identify whether the I/O system is slow. The following example of this data is taken from the WaitEvent section of the AWR report: Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn --------------------------- --------- --------- ---------- ------ --------- log file parallel write 2,181 0 28 13 41.2 SQL*Net more data from clie 516,611 0 27 0 9,747.4 db file parallel write 13,383 0 13 1 252.5 3. The top wait events identify the next places to investigate. A table of common wait events is listed in Table 10-1. Itis usually a good idea to also have quick look at high-load SQL. 4. Examine the related data indicated by the wait events to see what other information this data provides. Determine whetherthis information is consistent with the wait event data. In most situations, there is enough data to begin developinga theory about the potential causes of the performance bottleneck. 5. To determine whether this theory is valid, cross-check data you have examined with other statistics available for consistency.The appropriate statistics vary depending on the problem, but usually include load profile-related data in V$SYSSTAT,operating system statistics, and so on. Perform cross-checks with other data to confirm or refute the developingtheory. -------------------------------------------------- So, why don't we have the proposed wait event count/time data? I hope we can nurture this to become a database profilingtool like MySQL and Oracle. This is the first step. I think it would be useful to have both sampling and precisestatistics. Oracle has both, and MySQL has the latter (I don't know why MySQL doesn't provide sampling, because thePerformance Schema should probably have been developed after Oracle's ASH.) What would make us conservative about doing this? Skimming the old thread, the remaining concern is the timer overhead. As the following article suggests, some lightweight timers seem to be available. We can turn the timing off bydefault if they aren't light enough. Performance Schema Timers https://dev.mysql.com/doc/refman/8.0/en/performance-schema-timing.html Regards Takayuki Tsunakawa
On 1/7/19 6:34 AM, Tsunakawa, Takayuki wrote: > 1. Doesn't provide precise data > Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints. This might make it difficultor impossible to solve transient performance problems, such as infrequent 100 millisecond response times while thenormal response time is a few milliseconds. > The proposed wait event collection doesn't miss anything. > > 2. Overuses resources > We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems. However, thesampled data of active sessions become voluminous in memory and storage. It would take longer to analyze those samples. Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter andwalwriter tries to avoid by hibernation. > The proposed wait event collection just records what actually happened. No waste. Would it use many resources if waitshappen frequently? That leads to our motivation to reduce waits. FIY, wait events have been added in PoWA by using pg_wait_sampling extension : https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-powa.html pg_wait_sampling sample the wait events in shared memory and PoWA store them.
From: Adrien NAYRAT [mailto:adrien.nayrat@anayrat.info] > FIY, wait events have been added in PoWA by using pg_wait_sampling > extension : > https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for- > powa.html > > pg_wait_sampling sample the wait events in shared memory and PoWA store > them. Great. Also FYI, Amazon RDS/Aurora Performance Insights already provides a load profiling feature based on the wait eventscollected from pg_stat_activity, which samples once a second. An intuitive 5 minute video for introduction is here: Using Amazon RDS Performance Insights https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html Although I cannot see the graphics as I'm almost blind, the explanation sounds like it mimics Oracle Enterprise Manager. Cool. Aren't you thinking of incorporating your work into PostgreSQL as a contrib like pg_stat_statements? Regards Takayuki Tsunakawa
On Thu, Dec 20, 2018 at 8:48 PM Yotsunaga, Naoki <yotsunaga.naoki@jp.fujitsu.com> wrote: > If so, is not that the number of wait events is useful information? My theory is that the number of wait events is NOT useful information, or at least not nearly as useful the results of a sampling approach. The data that LWLOCK_STATS produce are downright misleading -- they lead you to think that the bottlenecks are in different places than they really are, because the locks that produce the most waiting can be 5th or 10th in terms of the number of wait events. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 10, 2019 at 8:42 PM, Robert Hass wrote: Thanks for comments. >or at least not nearly as useful the results of a sampling approach. I agree with your opinion. Because it can't be asserted that the wait event is a bottleneck just because the number of wait event is large. The same thing is mentioned in Oracle. It also suggests that it is important to acquire waiting time for that. ---- https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F 5.2.3 Using Wait Events without Timed Statistics ---- >The data that LWLOCK_STATS produce are downright misleading Is that so? I do not know the need for this function. --- Naoki Yotsunaga
From: Robert Haas [mailto:robertmhaas@gmail.com] > My theory is that the number of wait events is NOT useful information, > or at least not nearly as useful the results of a sampling approach. > The data that LWLOCK_STATS produce are downright misleading -- they > lead you to think that the bottlenecks are in different places than > they really are, because the locks that produce the most waiting can > be 5th or 10th in terms of the number of wait events. I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait withfewer counts but longer time can take a large portion of the entire SQL execution time. So, wait time is also useful. I think that's why Oracle describes and MySQL provides precise count and time without sampling. Haven't LOCK_STATS been helpful for PG developers? IIRC, it was used to pinpoint the bottleneck and evaluate the patch toimprove shared buffers, WAL buffers, ProcArray, etc. Regards Takayuki Tsunakawa
From: Robert Haas [mailto:robertmhaas@gmail.com]
> My theory is that the number of wait events is NOT useful information,
> or at least not nearly as useful the results of a sampling approach.
> The data that LWLOCK_STATS produce are downright misleading -- they
> lead you to think that the bottlenecks are in different places than
> they really are, because the locks that produce the most waiting can
> be 5th or 10th in terms of the number of wait events.
I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait with fewer counts but longer time can take a large portion of the entire SQL execution time. So, wait time is also useful. I think that's why Oracle describes and MySQL provides precise count and time without sampling.
From: Pavel Stehule [mailto:pavel.stehule@gmail.com] > the cumulated lock statistics maybe doesn't help with debugging - but it > is very good indicator of database (in production usage) health. I think it will help both. But I don't think the sampling won't be as helpful as the precise lock statistics accumulation,because the sampling doesn't give us exactly how effective our improvements to PostgreSQL code are. I rememberPG developers used LOCK_STATS to see how many (or ratio of) lwlock waits decreased by applying patches. We can use the cumulated lock stats like: 1. SELECT * FROM pg_session_waits; 2. Run a benchmark. 3. SELECT * FROM pg_session_waits; 4. Calculate the difference between 1 and 3. Or, reset the wait stats before the benchmark run and just use the stats as-is. I'd like to know why you thought the cumulated wait stats isn't helpful for debugging. Regards Takayuki Tsunakawa
Hi, On Tue, Jan 15, 2019 at 1:14 AM, Tsunakawa, Takayuki wrote: [ ... absent for a long time ] I read the discussions of this thread. If we want wait event info, we can currently do sampling from pg_stat_activity and get pseudo wait event total duration. (I understand wait event sampling does *not* represent wait event count but wait event duration.) It is said that, in most cases, sampling is enough to find the cause of issue. On the other hand, it is said that samplingmissed wait events that don't spent much time in the process. Can we say we don't need such wait events that seemsapparently not important? Also we can know the pseudo total duration from the sampling though, we can't know the how many times wait events occurredand how long wait events take time in the average each time. I think it would be better to also know wait event averagetime to investigate the cause of the issue. For example, counts, total duration and average times of each wait events after executing pgbench, which is gotten from theserver applied my patch, is like below. transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 8 duration: 60 s number of transactions actually processed: 132259 latency average = 3.630 ms tps = 2203.909727 (including connections establishing) tps = 2204.248691 (excluding connections establishing) wait_event_type | wait_event | calls | times | times/calls -----------------+----------------------+--------+-----------+------------ Client | ClientRead | 925732 | 115270950 | 124.52 IO | DataFileRead | 220357 | 3219893 | 14.61 LWLock | WALWriteLock | 195560 | 192684987 | 985.30 IO | DataFileWrite | 156208 | 4440091 | 28.42 IO | WALWrite | 51035 | 2925808 | 57.33 IO | WALSync | 50812 | 49626240 | 976.66 Lock | transactionid | 6179 | 9905727 | 1603.13 LWLock | buffer_content | 5453 | 295123 | 54.12 IO | DataFileExtend | 2367 | 90430 | 38.20 LWLock | wal_insert | 2053 | 33620 | 16.38 LWLock | WALBufMappingLock | 1301 | 23761 | 18.26 LWLock | CLogControlLock | 514 | 3959 | 7.70 LWLock | buffer_mapping | 162 | 1313 | 8.10 LWLock | XidGenLock | 153 | 2399 | 15.68 LWLock | lock_manager | 122 | 5051 | 41.40 Lock | tuple | 105 | 114632 | 1091.73 LWLock | ProcArrayLock | 99 | 1828 | 18.46 LWLock | CheckpointerCommLock | 33 | 854 | 25.88 Lock | extend | 24 | 20493 | 853.88 IO | RelationMapRead | 22 | 71 | 3.23 LWLock | buffer_io | 11 | 7398 | 672.55 IO | SLRUWrite | 4 | 249 | 62.25 IO | SLRUSync | 4 | 0 | 0.00 IPC | ClogGroupUpdate | 3 | 223 | 74.33 LWLock | SyncRepLock | 1 | 1 | 1.00 We can see that WALWriteLock has the longest total duration(times), and ClientRead is the next one. We might get this informationfrom the sampling. We can also see that WALWriteLock longest total duration is because of its high average time,while ClientRead total duration is because of its many counts. We can also know that transactionid, tuple, WALWriteLock,WALSync, extend have the longer average time than the others. If we want to improve the performance even forjust a bit, such info is useful I think. The patch which is proposed in this thread was not accepted because it loads overhead on the server/backend. The function that measures the wait event time, gettimeofday and clock_gettime, takes time to execute and it is on the frequentlycalled path(pgstat_report_wait_start/end), it is supposed a lot of overhead will be introduced. Even if we only count the wait event, there would be overhead, which is pointed out in the another thread.[1] The overhead which is induced by getting wait event info was discussed from old times, but I couldn't find the actual measuringresults, so I want to measure its overhead. [1] https://www.postgresql.org/message-id/CA%2BTgmobf1NJD%2B_DfQG5qccG5YFSnxk3CgC2mh0-UHabznCQtYA%40mail.gmail.com -- Yoshikazu Imai
On Wed, Oct 30, 2019 at 5:51 AM, imai.yoshikazu@fujitsu.com wrote: > The overhead which is induced by getting wait event info was discussed from old times, but I couldn't find the actual > measuring results, so I want to measure its overhead. And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and has severalthings to be improved. You can get wait event info by executing "select * from pg_stat_waitaccum;" and reset its counters by "select pg_stat_reset_shared('waitaccum');". I tried to reduce the overhead of counting the wait event. It is difficult to reduce the overhead of measuring wait eventstimes, I made measuring time is configurable like track_io_timing. In the other DB, they use more light function than gettimeofday or clock_gettime and it is the CPU cycle counter, rdtsc. So I also created the patch which uses rdtsc for measuring wait events times. There are some investigations for rdtsc before[1]. If we want to use rdtsc, we need more investigation for which platformcan use it or how to prevent time go backwards in the rdtsc and so on. Here, I wanted to see its overhead, so I didn't care such things in this patch. I measured its performance with pgbench on a VM machine which has 4 core CPU and 8 GB Mem. I tested against below four cases. master: master (2fc2a88) only counts: 0001 patched, only counts wait events (with track_wait_timing is off) counts/time: 0001 patched, counts wait event and measure its time by gettimeofday or clock_gettime (with track_wait_timing is on) counts/time(rdtsc): 0001 + 0002 patched, counts wait event and measure its time by rdtsc (with track_wait_timing is on) I executed the below pgbench scripts, initializing database with scale 1 or 100 and executing pgbench with standard modeor select-only mode. [standard mode] for i in `seq 1 15` do pgbench -i -s (1 or 100) -q pgbench -c 8 -j 8 -n -T 60 done [select only mode] pgbench -i -s (1 or 100) -q for i in `seq 1 10` do pgbench -c 8 -j 8 -n -S -T 60 done The result was below. [standard, scale 1] version | TPS | diff(%) master | 813.82 | 0 only counts | 797.48 | 2.01 counts/time | 833.16 | -2.38 counts/time(rdtsc) | 876.29 | -7.68 [standard, scale 100] version | TPS | diff(%) master | 2170.34 | 0 only counts | 2125.37 | 2.07 counts/time | 2147.8 | 1.04 counts/time(rdtsc) | 2187.37 | -0.785 [select-only, scale 1] version | TPS | diff(%) master | 28487.6 | 0 only counts | 28481.1 | 0.023 counts/time | 28364.7 | 0.431 counts/time(rdtsc) | 28462.6 | 0.088 [select-only, scale 100] version | TPS | diff(%) master | 25767.89 | 0 only counts | 26068.65 | -1.167 counts/time | 25567.69 | 0.777 counts/time(rdtsc) | 25525.26 | 0.942 An example of wait event info after executing pgbench was below. [standard, scale 100] number of transactions actually processed: 129844 latency average = 3.697 ms tps = 2163.667427 (including connections establishing) tps = 2163.918398 (excluding connections establishing) wait_event_type | wait_event | calls | times -----------------+----------------------+--------+----------- Client | ClientRead | 908807 | 114473878 IO | DataFileRead | 216025 | 2867211 LWLock | WALWriteLock | 191977 | 195192237 IO | DataFileWrite | 154540 | 3406232 IO | WALWrite | 49932 | 2728543 IO | WALSync | 49737 | 49649308 Lock | transactionid | 6209 | 8999545 LWLock | buffer_content | 5337 | 288951 IO | DataFileExtend | 2346 | 90375 LWLock | wal_insert | 2013 | 25141 LWLock | WALBufMappingLock | 630 | 14680 LWLock | CLogControlLock | 454 | 2414 LWLock | buffer_mapping | 170 | 852 LWLock | XidGenLock | 146 | 3268 LWLock | lock_manager | 141 | 5209 Lock | tuple | 112 | 120163 LWLock | ProcArrayLock | 97 | 495 Lock | extend | 42 | 26875 IO | RelationMapRead | 22 | 57 LWLock | CheckpointerCommLock | 18 | 1217 IO | DataFilePrefetch | 18 | 24 IPC | ClogGroupUpdate | 9 | 1867 LWLock | SyncRepLock | 3 | 116 IO | DataFileTruncate | 1 | 116 [select-only, scale 1] number of transactions actually processed: 1682642 latency average = 0.285 ms tps = 28043.407989 (including connections establishing) tps = 28048.158085 (excluding connections establishing) wait_event_type | wait_event | calls | times -----------------+-----------------+---------+----------- Client | ClientRead | 1682661 | 287999638 IO | RelationMapRead | 22 | 54 LWLock | lock_manager | 2 | 2087 [select-only, scale 100] number of transactions actually processed: 1513536 latency average = 0.317 ms tps = 25223.558569 (including connections establishing) tps = 25228.820644 (excluding connections establishing) wait_event_type | wait_event | calls | times -----------------+-----------------+---------+----------- IO | DataFileRead | 2524682 | 14579531 Client | ClientRead | 1513558 | 283968554 LWLock | buffer_mapping | 1225 | 6392 IO | RelationMapRead | 22 | 46 LWLock | buffer_io | 11 | 876 LWLock | lock_manager | 6 | 507 I wanted to measure and compare the performance where wait_event occurs many times, but unfortunately, TPS was not stableand differed in each standard test that wait_event occurs many times that I couldn't get consistent performance tendencybetween each version. I need to find more suitable test for clarifying its performance or if there are any good tests, please let me know. Also,any tests are very welcome. -- Yoshikazu Imai
Attachment
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
> the cumulated lock statistics maybe doesn't help with debugging - but it
> is very good indicator of database (in production usage) health.
I think it will help both. But I don't think the sampling won't be as helpful as the precise lock statistics accumulation, because the sampling doesn't give us exactly how effective our improvements to PostgreSQL code are. I remember PG developers used LOCK_STATS to see how many (or ratio of) lwlock waits decreased by applying patches.
We can use the cumulated lock stats like:
1. SELECT * FROM pg_session_waits;
2. Run a benchmark.
3. SELECT * FROM pg_session_waits;
4. Calculate the difference between 1 and 3.
Or, reset the wait stats before the benchmark run and just use the stats as-is.
I'd like to know why you thought the cumulated wait stats isn't helpful for debugging.
Regards
Takayuki Tsunakawa
On Wed, Oct 30, 2019 at 05:55:28AM +0000, imai.yoshikazu@fujitsu.com wrote: > And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and has severalthings to be improved. Please note the patch tester complains about the latest patch: pgstatfuncs.c: In function ‘pg_stat_get_waitaccum’: pgstatfuncs.c:2018:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] Datum values[PG_STAT_GET_WAITACCUM_COLS]; I am moving it to next CF, marking it as waiting on author. -- Michael
Attachment
On Sun, Dec 1, 2019 at 1:10 AM, Michael Paquier wrote: > On Wed, Oct 30, 2019 at 05:55:28AM +0000, imai(dot)yoshikazu(at)fujitsu(dot)com wrote: > > And here is the patch which counts the wait event and measuring the wait event time. It is currently like POC and hasseveral things to be improved. > > Please note the patch tester complains about the latest patch: > pgstatfuncs.c: In function ‘pg_stat_get_waitaccum’: > pgstatfuncs.c:2018:3: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] > Datum values[PG_STAT_GET_WAITACCUM_COLS]; > > I am moving it to next CF, marking it as waiting on author. Sorry for late reply. Unfortunately, I couldn't get your original mail even if I used "Resend email" on pgsql-hackers archive, that I reply to my old mail. I attach the patch I solved the warning. -- Yoshikazu Imai
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: tested, passed I like this patch, because I used similar functionality some years ago very successfully. The implementation is almost simple,and the result should be valid by used method. The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat thesetests. There are some ending whitespaces and useless tabs. The new status of this patch is: Waiting on Author
On 2020/01/13 4:11, Pavel Stehule wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: tested, passed > > I like this patch, because I used similar functionality some years ago very successfully. The implementation is almostsimple, and the result should be valid by used method. Thanks for your review! > The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat thesetests. Yes, performance impact is the main concern. I want to know how it affects performance in various test cases or on various environments. > There are some ending whitespaces and useless tabs. > > The new status of this patch is: Waiting on Author I attach v4 patches removing those extra whitespaces of the end of lines and useless tabs. -- Yoshikazu Imai
Attachment
On 2020/01/13 4:11, Pavel Stehule wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world: tested, passed
> Implements feature: tested, passed
> Spec compliant: not tested
> Documentation: tested, passed
>
> I like this patch, because I used similar functionality some years ago very successfully. The implementation is almost simple, and the result should be valid by used method.
Thanks for your review!
> The potential problem is performance impact. Very early test show impact cca 3% worst case, but I'll try to repeat these tests.
Yes, performance impact is the main concern. I want to know how it
affects performance in various test cases or on various environments.
> There are some ending whitespaces and useless tabs.
>
> The new status of this patch is: Waiting on Author
I attach v4 patches removing those extra whitespaces of the end of lines
and useless tabs.
PGBENCH="/usr/local/pgsql/bin/pgbench"
export PGDATABASE=postgres
echo "******* START *******" > ~/result.txt
for i in 1 5 10 50 100
do
echo "scale factor $i" >> ~/result.txt
$PSQL -c "create database bench$i"
$PGBENCH -i -s $i "bench$i"
for c in 1 5 10 50
do
$PGBENCH -c $c -T 300 "bench$i" >> ~/result.txt
done
$PSQL -c "vacuum full" "bench$i"
$PSQL -c "vacuum analyze" "bench$i"
for c in 1 5 10 50
do
$PGBENCH -S -c $c -T 300 "bench$i" >> ~/result.txt
done
$PSQL -c "drop database bench$i"
done
-----------------+-----------------------+------------+--------------
Client | ClientRead | 1489681408 | 221616362961
Lock | transactionid | 103113369 | 71918794185
LWLock | WALWriteLock | 104781468 | 20865855903
Lock | tuple | 21323744 | 15800875242
IO | WALSync | 50862170 | 8666988491
LWLock | lock_manager | 18415423 | 575308266
IO | WALWrite | 51482764 | 205775892
LWLock | buffer_content | 15385387 | 168446128
LWLock | wal_insert | 1502092 | 90019731
IPC | ProcArrayGroupUpdate | 178238 | 46527364
LWLock | ProcArrayLock | 587356 | 13298246
IO | DataFileExtend | 2715557 | 11615216
IPC | ClogGroupUpdate | 54319 | 10622013
IO | DataFileRead | 5805298 | 9596545
IO | SLRURead | 9518930 | 7166217
LWLock | CLogControlLock | 746759 | 6783602
--
Yoshikazu Imai
Attachment
This patch was in WoA, but that was wrong I think - we got a patch on January 15, followed by a benchmark by Pavel Stehule, so I think it should still be in "needs review". So I've updated it and moved it to the next CF. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
This patch was in WoA, but that was wrong I think - we got a patch on
January 15, followed by a benchmark by Pavel Stehule, so I think it
should still be in "needs review". So I've updated it and moved it to
the next CF.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 1, 2020 at 5:50 AM, Pavel Stehule wrote: > today I run 120 5minutes pgbench tests to measure impact of this patch. Result is attached. ... > Thanks to Tomas Vondra and 2ndq for hw for testing Thank you for doing a lot of these benchmarks! > The result is interesting - when I run pgbench in R/W mode, then I got +/- 1% changes in performance. Isn't important if > tracking time is active or not (tested on Linux). In this mode the new code is not on critical path. It seems performance difference is big in case of read only tests. The reason is that write time is relatively longer thanthe processing time of the logic I added in the patch. > Looks so for higher scale than 5 and higher number of users 50 the results are not too much stable (for read only tests- I > repeated tests) and there overhead is about 10% from 60K tps to 55Ktps - maybe I hit a hw limits (it running with 4CPU) Yes, I suspect some other bottlenecks may be happened and it causes the results unstable. However, it may be better to investigate what is actually happened and why performance is increased/decreased for over 10%. I will inspect it. Also I attach v5 patches which corresponds to other committed patches. -- Yoshikazu Imai
Attachment
On Wed, 12 Feb 2020 at 12:36, imai.yoshikazu@fujitsu.com <imai.yoshikazu@fujitsu.com> wrote: > It seems performance difference is big in case of read only tests. The reason is that write time is relatively longer thanthe > processing time of the logic I added in the patch. That's going to be a pretty difficult performance hit to justify. Can we buffer collected wait events locally and spit the buffer to the stats collector at convenient moments? We can use a limited buffer size with an overflow flag, so we degrade the results rather than falling over or forcing excessive stats reporting at inappropriate times. I suggest that this is also a good opportunity to add some more tracepoints to PostgreSQL. The wait events facilities are not very traceable right now. Exposing some better TRACE_POSTGRESQL_ tracepoints (SDTs) via probes.d would help users collect better information using external tools like perf, bpftrace and systemtap. That way we have a zero-overhead-when-unused option that can also be used to aggregate the information per-query, per-user, etc. (I really need to add a bunch more tracepoints to make this easier...) -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
On Wed, Feb 12, 2020 at 5:42 AM, Craig Ringer wrote: > > It seems performance difference is big in case of read only tests. The reason is that write time is relatively longerthan the > > processing time of the logic I added in the patch. > > That's going to be a pretty difficult performance hit to justify. > > Can we buffer collected wait events locally and spit the buffer to the > stats collector at convenient moments? We can use a limited buffer > size with an overflow flag, so we degrade the results rather than > falling over or forcing excessive stats reporting at inappropriate times. IIUC, currently each backend collects wait events locally. When each backend goes to idle (telling the frontend that it is ready-for-query), it reports wait event statistics to the stats collector. The interval of each report is over than PGSTAT_STAT_INTERVAL(default 500ms). Also when each backend exits, it also does report. So if we do the read only test with 50 clients, each 50 backend reports wait events statistics to the stats collector for almost every 500ms. If that causes performance degradation, we can improve performance by letting backends to report its statistics, for example, only at the backend's exit. (I think I can easily test this by building postgres with setting PGSTAT_STAT_INTERVAL to a large value >> 500ms.) > I suggest that this is also a good opportunity to add some more > tracepoints to PostgreSQL. The wait events facilities are not very > traceable right now. Does that mean we will add TRACE_POSTGRESQL_ to every before/after pgstat_report_wait_start? > That way we have a zero-overhead-when-unused option that can also be > used to aggregate the information per-query, per-user, etc. I see. In that way, we can accomplish no overhead when DTrace is not enabled and what we can measure is more customizable. It is also curious how will overhead be if we implement wait events statistics on DTrace scripts though I can't imagine how it will be because I haven't used DTrace. -- Yoshikazu Imai
On Fri, Feb 14, 2020 at 11:59 AM, 王胜利 wrote: > I am glad to know you are working on PG accumulated statistics feature, and I am interested on it. > I see these two patch file you made, can you let me know which branch of PG code based? > > when I use this: https://github.com/postgres/postgres/commits/master, and apply these patches, report some error. Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3. Here, I attach correct patches. Also I will begin to do some benchmark with higher scale and higher number of users and try to change stats reporting implementation to not affect performance, which I couldn't have not started because of another tasks. -- Yoshikazu Imai
Attachment
Best regards,
Victor wang
签名由 网易邮箱大师 定制
On Fri, Feb 14, 2020 at 11:59 AM, 王胜利 wrote:On 02/25/2020 15:53, imai.yoshikazu@fujitsu.com wrote:
> I am glad to know you are working on PG accumulated statistics feature, and I am interested on it.
> I see these two patch file you made, can you let me know which branch of PG code based?
>
> when I use this: https://github.com/postgres/postgres/commits/master, and apply these patches, report some error.
Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3.
Here, I attach correct patches.
Also I will begin to do some benchmark with higher scale and higher number of
users and try to change stats reporting implementation to not affect
performance, which I couldn't have not started because of another tasks.
--
Yoshikazu Imai
Hello. I had a brief look on this and have some comments on this. At Tue, 25 Feb 2020 07:53:26 +0000, "imai.yoshikazu@fujitsu.com" <imai.yoshikazu@fujitsu.com> wrote in > Thanks for Wang's mail, I noticed my 0002 patch was wrong from v3. > > Here, I attach correct patches. > > Also I will begin to do some benchmark with higher scale and higher number of > users and try to change stats reporting implementation to not affect > performance, which I couldn't have not started because of another tasks. It uses its own hash implement. Aside from the appropriateness of having another implement of existing tool, in the first place hash works well for wide, sparse and uncertain set of keys. But they are in rather a dense and narrow set with certain and fixed members. It registers more than 200 entries but bucket size is 461. It would be needed to avoid colliisions, but seems a bit too wasting. It seems trying to avoid doing needless work when the feature is not activated by checking "if (wa_hash==NULL)", but the hash is created being filled with possible entries regardless of whether track_wait_timing is on or off. As the result pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously frequently. This should be the reason for the recent benchmark result. I'm not sure such frequency of hash-searching is acceptable even if the feature is turned on. I think we need a smarter mapping scheme of events to entries. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 26, 2020 at 1:39 AM, Kyotaro Horiguchi wrote: > Hello. I had a brief look on this and have some comments on this. Hi, Horiguchi-san. Thank you for looking at this! > It uses its own hash implement. Aside from the appropriateness of > having another implement of existing tool, in the first place hash > works well for wide, sparse and uncertain set of keys. But they are in > rather a dense and narrow set with certain and fixed members. It > registers more than 200 entries but bucket size is 461. It would be > needed to avoid colliisions, but seems a bit too wasting. Yes, wait events are grouped and wait events ID are defined as a sequential number, starting from specified number for each group(like 0x01000000U), thus keys are in a dense and narrow set. ===== #define PG_WAIT_LWLOCK 0x01000000U #define PG_WAIT_LOCK 0x03000000U #define PG_WAIT_BUFFER_PIN 0x04000000U #define PG_WAIT_ACTIVITY 0x05000000U ... typedef enum { WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY, WAIT_EVENT_AUTOVACUUM_MAIN, ... WAIT_EVENT_WAL_WRITER_MAIN } WaitEventActivity; ===== The number 461 is the lowest number which avoids collisions in the hash for current wait events set. As you pointed out, there are a bit too much unused entries. If we prepare arrays for each wait classes with appropriate size which just fits the number of each wait event class, we can store wait event statistics into those arrays with no more wastes. Also we calculate hash number by "(wait event ID) % (bucket size)" in hash case, while we calculate index of arrays by "(wait event ID) - (wait event class id)" in array case. The latter one's cost is lower than the former one. Currently I implement wait event statistics store by hash because of its easiness of implementation, but I think it is good to implement it by array for above reasons. One concern is that we put restrictions on wait events that it must be defined as it is sequenced in its wait class so that there are no unused entries in the array. > It seems trying to avoid doing needless work when the feature is not > activated by checking "if (wa_hash==NULL)", but the hash is created > being filled with possible entries regardless of whether > track_wait_timing is on or off. This might be bad implementation but there are the cases "wa_hash = NULL" where pgstat_init() is not called like when executing "initdb". I insert that check for avoiding unexpected crash in those cases. I also noticed debug codes exist around that code...I will modify it. > As the result > pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously > frequently. This should be the reason for the recent benchmark result. > I'm not sure such frequency of hash-searching is acceptable even if > the feature is turned on. track_wait_timing parameter determines whether we collect wait time. Regardless of that parameter, we always collect wait count every wait event happens. I think calling pgstat_get_wa_entry frequently is not critical to performance. From pavel's benchmark results, if track_wait_timing parameter is off, there are +/-1.0% performance difference between patched and unpatched and it is just considered as a measurement error. Thanks -- Yoshikazu Imai
I feel your 'pg_stat_waitaccum' will help us investigate the bottleneck.
So I'd like to do some benchmarks but unfortunately, the latest v6 patch
Is it possible to share the latest patches?
If not, I'll make v6 applicable to the HEAD.
Regards,
--
Atsushi Torikoshi
On Wed, Feb 26, 2020 at 1:39 AM, Kyotaro Horiguchi wrote:
> Hello. I had a brief look on this and have some comments on this.
Hi, Horiguchi-san. Thank you for looking at this!
> It uses its own hash implement. Aside from the appropriateness of
> having another implement of existing tool, in the first place hash
> works well for wide, sparse and uncertain set of keys. But they are in
> rather a dense and narrow set with certain and fixed members. It
> registers more than 200 entries but bucket size is 461. It would be
> needed to avoid colliisions, but seems a bit too wasting.
Yes, wait events are grouped and wait events ID are defined as a sequential
number, starting from specified number for each group(like 0x01000000U), thus
keys are in a dense and narrow set.
=====
#define PG_WAIT_LWLOCK 0x01000000U
#define PG_WAIT_LOCK 0x03000000U
#define PG_WAIT_BUFFER_PIN 0x04000000U
#define PG_WAIT_ACTIVITY 0x05000000U
...
typedef enum
{
WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
WAIT_EVENT_AUTOVACUUM_MAIN,
...
WAIT_EVENT_WAL_WRITER_MAIN
} WaitEventActivity;
=====
The number 461 is the lowest number which avoids collisions in the hash for
current wait events set. As you pointed out, there are a bit too much unused
entries.
If we prepare arrays for each wait classes with appropriate size which just
fits the number of each wait event class, we can store wait event statistics
into those arrays with no more wastes.
Also we calculate hash number by "(wait event ID) % (bucket size)" in hash
case, while we calculate index of arrays by "(wait event ID) - (wait event class id)"
in array case. The latter one's cost is lower than the former one.
Currently I implement wait event statistics store by hash because of its
easiness of implementation, but I think it is good to implement it by array
for above reasons. One concern is that we put restrictions on wait events
that it must be defined as it is sequenced in its wait class so that there
are no unused entries in the array.
> It seems trying to avoid doing needless work when the feature is not
> activated by checking "if (wa_hash==NULL)", but the hash is created
> being filled with possible entries regardless of whether
> track_wait_timing is on or off.
This might be bad implementation but there are the cases "wa_hash = NULL"
where pgstat_init() is not called like when executing "initdb". I insert
that check for avoiding unexpected crash in those cases.
I also noticed debug codes exist around that code...I will modify it.
> As the result
> pgstat_report_waitaccum_end calls pgstat_get_wa_entry tremendously
> frequently. This should be the reason for the recent benchmark result.
> I'm not sure such frequency of hash-searching is acceptable even if
> the feature is turned on.
track_wait_timing parameter determines whether we collect wait time.
Regardless of that parameter, we always collect wait count every wait event
happens. I think calling pgstat_get_wa_entry frequently is not critical to
performance. From pavel's benchmark results, if track_wait_timing parameter
is off, there are +/-1.0% performance difference between patched and unpatched
and it is just considered as a measurement error.
Thanks
--
Yoshikazu Imai
Hi, This patch fails to apply to HEAD, please submit a rebased version. I've marked this as as Waiting on Author. cheers ./daniel
> This patch fails to apply to HEAD, please submit a rebased version. I've > marked this as as Waiting on Author. Sorry for my absence. Unfortunately I couldn't have time to work on this patch in this cf. I believe I will be back in next cf, work on this patch and also review other patches. --- Yoshikazu IMAI
> On 31 Jul 2020, at 07:23, imai.yoshikazu@fujitsu.com wrote: > >> This patch fails to apply to HEAD, please submit a rebased version. I've >> marked this as as Waiting on Author. > > Sorry for my absence. Unfortunately I couldn't have time to work on this patch in this cf. > I believe I will be back in next cf, work on this patch and also review other patches. No worries, it happens. Since the thread has stalled and there is no updated patch I've marked this entry Returned with Feedback. Please feel free to re-open a new CF entry if you return to this patch. cheers ./daniel
Hi All, I faced a few times a situation where a long running query is actually including the time the backend is waiting for the frontend to fetch all the rows (see [1] for details). See a sample code fe-time.c and its comments in attachment to reproduce this behavior. There's no simple way today to pinpoint the problem in production without advance interactive auditing, and/or using system tools. After studying the problem, I believe it boil down to track the wait event ClientWrite, so I ended up on this thread. You might catch some mismatching times thanks to auto_explain as well. Using the fe-time.c demo code with the following command: PGDATABASE=postgres PGHOST=::1 time ./fe-time 100 The frontend time is 10s, the query time reported is 3228.631ms, but last row has been produced after 20.672ms: LOG: duration: 3228.631 ms plan: Query Text: SELECT * FROM pgbench_accounts Seq Scan on pgbench_accounts (time=0.005..20.672 rows=100000 loops=1) (Note that in contrast with localhost, through the unix socket, the backend reported query time is always really close to 10s). I re-based the existing patch (see in attachment), to look at the ClientWrite for this exact query: # SELECT wait_event, calls, times FROM pg_stat_get_waitaccum(NULL) WHERE wait_event = 'ClientWrite'; wait_event | calls | times -------------+-------+--------- ClientWrite | 4 | 3132266 The "time" is expressed as µs in the patch, so 3132.266ms of the total 3228.631ms is spent sending the result to the frontend. I'm not sure where are the missing 75ms. The pg_wait_sampling extension might help but it requires a production restart to install, then enable it. Whatever if the solution is sampling or cumulative, an in-core and hot-switchable solution would be much more convenient. But anyway, looking at pg_wait_sampling, we have a clear suspect as well for the later query run: # SELECT event, count FROM pg_wait_sampling_profile WHERE queryid=4045741516911800313; event | count -------------+------- ClientWrite | 309 The default profil period of pg_wait_sampling being 10ms, we can roughly estimate the ClientWrite around 3090ms. Note that this is close enough because we know 3132266µs has been accumulated among only 4 large wait events. Finishing bellow. On Mon, 3 Aug 2020 00:00:40 +0200 Daniel Gustafsson <daniel@yesql.se> wrote: > > On 31 Jul 2020, at 07:23, imai.yoshikazu@fujitsu.com wrote: > > > >> This patch fails to apply to HEAD, please submit a rebased version. I've > >> marked this as as Waiting on Author. Please, find in attachment a rebase of both patches. I did some small editing on the way. I didn't bench them. I'm not sure this patch is the best approach though. Receive it as a motivation to keep up with this discussion. As I wrote, whatever if the solution is sampling or cumulative, an in-core and hot-switchable solution would be much more convenient. The fact is that this patch was already available and ready to keep up with a discussion. Collecting and summing all wait events from all backends in the same place forbid to track precisely wait events from a specific backends. Especially on a busy system where numbers can quickly be buried by all other activities around. I wonder if wait events should only be accumulated on backend side, making possible to enable/disable them on the fly and to collect some reports eg. in logs or to output. Most of the code from these patch could be recycled in a simpler patch implementing this. Thoughts? > > Sorry for my absence. Unfortunately I couldn't have time to work on this > > patch in this cf. I believe I will be back in next cf, work on this patch > > and also review other patches. > > No worries, it happens. Since the thread has stalled and there is no updated > patch I've marked this entry Returned with Feedback. Please feel free to > re-open a new CF entry if you return to this patch. I volunteer to be a reviewer on this patch. Imai-san, do you agree to add it as new CF entry? Regards, [1]: Last time I had such situation was few weeks ago. A customer was reporting a query being randomly slow, running bellow 100ms most of the time and sometime hitting 28s. Long story short, the number of row was the same (10-15k), but the result set size was 9x bigger (1MB vs 9MB). As the same query was running fine from psql, I suspected the frontend was somehow saturated. Tcpdump helped me to compute that the throughput fall to 256kB/s after the first 2MB of data transfert with a very narrow TCP window. I explained to the customer their app probably doesn't pull the rows fast enough and that some buffers were probably saturated on the frontend side, waiting for the app and slowing down the whole transfert. Devels fixed the problem by moving away two fields transformations (unaccent) from their loop fetching the rows.
Attachment
HHi, On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote: > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001 > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > Date: Tue, 1 Jun 2021 13:25:57 +0200 > Subject: [PATCH 1/2] Add pg_stat_waitaccum view. > > pg_stat_waitaccum shows counts and duration of each wait events. > Each backend/backgrounds counts and measures the time of wait event > in every pgstat_report_wait_start and pgstat_report_wait_end. They > store those info into their local variables and send to Statistics > Collector. We can get those info via Statistics Collector. > > For reducing overhead, I implemented statistic hash instead of > dynamic hash. I also implemented track_wait_timing which > determines wait event duration is collected or not. I object to adding this overhead. The whole selling point for wait events was that they are low overhead. I since spent time reducing the overhead further, because even just the branches for checking if track_activity is enabled are measurable (225a22b19ed). > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001 > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > Date: Fri, 4 Jun 2021 18:14:51 +0200 > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from > INSTR_TIME to rdtsc. > > This patch changes measuring method of wait event time from INSTR_TIME (which > uses gettimeofday or clock_gettime) to rdtsc. This might reduce the overhead > of measuring overhead. > > Any supports like changing clock cycle to actual time or error handling are > not currently implemented. rdtsc is a serializing (*) instruction - that's the expensive part. On linux clock_gettime() doesn't actually need a syscall. While the vdso call implies a bit of overhead over a raw rdtsc, it's a relatively small part of the overhead. See https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de Greetings, Andres Freund (*) it's not fully serializing, iirc it allows later instructions to be started, but it does wait for all earlier in-flight instructions to finish.
Hi Andres, Hi all, First, thank you for your feedback! Please find in attachment a patch implementing accumulated wait event stats only from the backend point of view. As I wrote when I reviewed and rebased the existing patch, I was uncomfortable with the global approach. I still volunteer to work/review the original approach is required. See bellow for comments and some more explanations about what I think might be improvements over the previous patch. On Fri, 11 Jun 2021 12:18:07 -0700 Andres Freund <andres@anarazel.de> wrote: > On 2021-06-05 00:53:44 +0200, Jehan-Guillaume de Rorthais wrote: > > From 88c2779679c5c9625ca5348eec0543daab5ccab4 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > > Date: Tue, 1 Jun 2021 13:25:57 +0200 > > Subject: [PATCH 1/2] Add pg_stat_waitaccum view. > > > > pg_stat_waitaccum shows counts and duration of each wait events. > > Each backend/backgrounds counts and measures the time of wait event > > in every pgstat_report_wait_start and pgstat_report_wait_end. They > > store those info into their local variables and send to Statistics > > Collector. We can get those info via Statistics Collector. > > > > For reducing overhead, I implemented statistic hash instead of > > dynamic hash. I also implemented track_wait_timing which > > determines wait event duration is collected or not. > > I object to adding this overhead. The whole selling point for wait > events was that they are low overhead. I since spent time reducing the > overhead further, because even just the branches for checking if > track_activity is enabled are measurable (225a22b19ed). Agree. The previous patch I rebased was to review it and reopen this discussion, I even added a small FIXME in pgstat_report_wait_end and pgstat_report_wait_start about your work: //FIXME: recent patch to speed up this call. In the patch in attachment, I tried to fix this by using kind of an internal hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to "instrument" wait events only when required, on the fly, dynamically. Moreover, I removed the hash structure for a simple static array for faster access. > > From ddb1adc5cd9acc9bc9de16d0cf057124b09fe1e3 Mon Sep 17 00:00:00 2001 > > From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> > > Date: Fri, 4 Jun 2021 18:14:51 +0200 > > Subject: [PATCH 2/2] [POC] Change measuring method of wait event time from > > INSTR_TIME to rdtsc. > > > > This patch changes measuring method of wait event time from INSTR_TIME > > (which uses gettimeofday or clock_gettime) to rdtsc. This might reduce the > > overhead of measuring overhead. > > > > Any supports like changing clock cycle to actual time or error handling are > > not currently implemented. > > rdtsc is a serializing (*) instruction - that's the expensive part. On linux > clock_gettime() doesn't actually need a syscall. While the vdso call > implies a bit of overhead over a raw rdtsc, it's a relatively small part > of the overhead. See > https://www.postgresql.org/message-id/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de I choose to remove all this rdtsc part from my patch as this wasn't clear how much faster it was compare to simpler vdso functions and how to accurately extract a human time. About my take on $subject, for the sake of simplicity of this PoC, I added instrumentation to log_statement_stats. Despite the query context of the reported log, they are really accumulated stats. The patch updated pg_stat_get_waitaccum() as well to be able to report the accumulated wait events from your interactive or batch session. So using my previous fe-time demo client, you can test it using: PGOPTIONS="--log_statement_stats=on" ./fe-time 100 From logs, I now have (notice the last line): LOG: duration: 3837.194 ms statement: SELECT * FROM pgbench_accounts LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 0.087444 s user, 0.002106 s system, 3.837202 s elapsed ! [0.087444 s user, 0.003974 s system total] ! 25860 kB max resident size ! 0/0 [0/0] filesystem blocks in/out ! 0/303 [0/697] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 4/18 [5/18] voluntary/involuntary context switches ! Client/ClientWrite 4 calls, 3747102 us elapsed Using pgbench scale factor 10, the copy query for pgbench_accounts looks like: LOG: duration: 2388.081 ms statement: copy pgbench_accounts from stdin LOG: QUERY STATISTICS DETAIL: ! system usage stats: ! 1.373756 s user, 0.252860 s system, 2.388100 s elapsed ! [1.397015 s user, 0.264951 s system total] ! 37788 kB max resident size ! 0/641584 [0/642056] filesystem blocks in/out ! 194/4147 [195/4728] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 3263/92 [3390/102] voluntary/involuntary context switches ! LWLock/WALBufMapping 3 calls, 654 us elapsed ! LWLock/WALWrite 3 calls, 60680 us elapsed ! LWLock/CheckpointerComm 3 calls, 318 us elapsed ! Client/ClientRead 1 calls, 151 us elapsed ! IO/DataFileExtend 16397 calls, 94000 us elapsed ! IO/DataFileWrite 14346 calls, 45911 us elapsed ! IO/WALInitSync 6 calls, 334840 us elapsed ! IO/WALInitWrite 6 calls, 48040 us elapsed ! IO/WALSync 17 calls, 353334 us elapsed ! IO/WALWrite 8362 calls, 38401 us elapsed Now, I'm on a fence about the user facing interaction. I'll keep thinking about this and report this week. In the meantime, any feedback about the current implementation backbone is welcome. Thanks! Regards,
Attachment
Hi, On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > In the patch in attachment, I tried to fix this by using kind of an internal > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > "instrument" wait events only when required, on the fly, dynamically. That's *far worse*. You're adding an indirect function call. Which requires loading a global variable and then a far call to a different function. You're changing a path that's ~2 instructions with minimal dependencies (and no branches (i.e. fully out of order executable) to something on the order of ~15 instructions with plenty dependencies and at least two branches (call, ret). I doubt there's a path towards this feature without adding the necessary infrastructure to hot-patch the code - which is obviously quite a substantial project. Greetings, Andres Freund
Hi, On 2021-06-14 11:27:21 -0700, Andres Freund wrote: > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > In the patch in attachment, I tried to fix this by using kind of an internal > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > "instrument" wait events only when required, on the fly, dynamically. > > That's *far worse*. You're adding an indirect function call. Which requires > loading a global variable and then a far call to a different function. You're > changing a path that's ~2 instructions with minimal dependencies (and no > branches (i.e. fully out of order executable) to something on the order of ~15 > instructions with plenty dependencies and at least two branches (call, ret). In the case at hand it might even be worse, because the external function call will require registers to be spilled for the function call. Right now wait events "use" two register (one for the wait event, one for my_wait_event_info), but otherwise don't cause additional spilling. With your change you'd see register spill/reload around both wait start and end. Greetings, Andres Freund
Hi, On Mon, 14 Jun 2021 11:27:21 -0700 Andres Freund <andres@anarazel.de> wrote: > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > In the patch in attachment, I tried to fix this by using kind of an internal > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > "instrument" wait events only when required, on the fly, dynamically. > > That's *far worse*. You're adding an indirect function call. Which requires > loading a global variable and then a far call to a different function. You're > changing a path that's ~2 instructions with minimal dependencies (and no > branches (i.e. fully out of order executable) to something on the order of ~15 > instructions with plenty dependencies and at least two branches (call, ret). Oh, I didn't realized it would affect all queries, even when log_statement_stats was off. Thank you for your explanation. > I doubt there's a path towards this feature without adding the necessary > infrastructure to hot-patch the code - which is obviously quite a > substantial project. Right. Sadly, this kind of project is far above what I can do. So I suppose it's a dead end for me. I'll study if/how the sampling approach can be done dynamically. Thank you,
Hi, On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote: > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > > In the patch in attachment, I tried to fix this by using kind of an internal > > > hook for pgstat_report_wait_start and pgstat_report_wait_end. This allows to > > > "instrument" wait events only when required, on the fly, dynamically. > > > > That's *far worse*. You're adding an indirect function call. Which requires > > loading a global variable and then a far call to a different function. You're > > changing a path that's ~2 instructions with minimal dependencies (and no > > branches (i.e. fully out of order executable) to something on the order of ~15 > > instructions with plenty dependencies and at least two branches (call, ret). > > Oh, I didn't realized it would affect all queries, even when log_statement_stats > was off. Thank you for your explanation. Maybe I just am misunderstanding what you were doing? As far as I can tell your patch changed pgstat_report_wait_start() to be an indirect function call - right? Then yes, this adds overhead to everything. You *could* add a pgstat_report_wait_(start|end)_with_time() or such and only use that in places that won't have a high frequency. But I just don't quite see the use-case for that. Greetings, Andres Freund
Hi Andres, On Mon, 14 Jun 2021 15:01:14 -0700 Andres Freund <andres@anarazel.de> wrote: > On 2021-06-14 23:20:47 +0200, Jehan-Guillaume de Rorthais wrote: > > > On 2021-06-14 16:10:32 +0200, Jehan-Guillaume de Rorthais wrote: > > > > In the patch in attachment, I tried to fix this by using kind of an > > > > internal hook for pgstat_report_wait_start and pgstat_report_wait_end. > > > > This allows to "instrument" wait events only when required, on the fly, > > > > dynamically. > > > > > > That's *far worse*. You're adding an indirect function call. Which > > > requires loading a global variable and then a far call to a different > > > function. You're changing a path that's ~2 instructions with minimal > > > dependencies (and no branches (i.e. fully out of order executable) to > > > something on the order of ~15 instructions with plenty dependencies and > > > at least two branches (call, ret). > > > > Oh, I didn't realized it would affect all queries, even when > > log_statement_stats was off. Thank you for your explanation. > > Maybe I just am misunderstanding what you were doing? As far as I can > tell your patch changed pgstat_report_wait_start() to be an indirect > function call - right? Exact. I didn't realized this indirection would be so costy on every single calls, after the variable assignation itself. > Then yes, this adds overhead to everything. I understand now, thank you for the explanation. For my own curiosity and study, I'll remove this indirection and bench my patch anyway. > You *could* add a pgstat_report_wait_(start|end)_with_time() or such and > only use that in places that won't have a high frequency. But I just > don't quite see the use-case for that. Well, it could be useful if we decide to only track a subset of wait event. In my scenario, I originally wanted to only track ClientWrite, but then I realized this might be too specific and tried to generalize. There are probably some other way to deal with this issue. Eg.: * do NOT include the time lost waiting for the frontend side in the query execution time * expose the frontend part of the query time in log_min_duration_stmt, auto_explain, pg_stat_statements, in the same fashion we currently do with planning and execution time * having some wait-even sampling mechanism in core or as easy and hot-loadable than auto_explain Thoughts? Thanks again! Regards,