Thread: RFC: replace pg_stat_activity.waiting with something more descriptive
When a PostgreSQL system wedges, or when it becomes dreadfully slow for some reason, I often find myself relying on tools like strace, gdb, or perf to figure out what is happening. This doesn't tend to instill customers with confidence; they would like (quite understandably) a process that doesn't require installing developer tools on their production systems, and doesn't require a developer to interpret the results, and perhaps even something that they could connect up to PEM or Nagios or whatever alerting system they are using. There are obviously many ways that we might think about improving things here, but what I'd like to do is try to put some better information in pg_stat_activity, so that when a process is not running, users can get some better information about *why* it's not running. The basic idea is that pg_stat_activity.waiting would be replaced by a new column pg_stat_activity.wait_event, which would display the reason why that backend is waiting. This wouldn't be a free-form text field, because that would be too expensive to populate. Instead it would contain a "reason code" which would be chosen from a list of reason codes and translated to text for display. Internally, pgstat_report_waiting() would be changed to take an integer argument rather than a Boolean (possibly uint8 would be enough, certainly uint16 would be), and called from more places. It would continue to use an ordinary store into shared memory, with no atomic ops or locking. Currently, the only time we report a process as waiting is when it is waiting for a heavyweight lock. I'd like to make that somewhat more fine-grained, by reporting the type of heavyweight lock it's awaiting (relation, relation extension, transaction, etc.). Also, I'd like to report when we're waiting for a lwlock, and report either the specific fixed lwlock for which we are waiting, or else the type of lock (lock manager lock, buffer content lock, etc.) for locks of which there is more than one. I'm less sure about this next part, but I think we might also want to report ourselves as waiting when we are doing an OS read or an OS write, because it's pretty common for people to think that a PostgreSQL bug is to blame when in fact it's the operating system that isn't servicing our I/O requests very quickly. We could also invent codes for things like "I'm doing a pg_usleep because I've exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a buffer" and maybe a few others. I realize that in many cases these states will be quite transient and you won't see them in pg_stat_activity for very long before they vanish; whether you can catch them at all is quite uncertain. It's not my goal here to create some kind of a performance counter system, even though that would be valuable and could possibly be based on the same infrastructure, but rather just to create a very simple system that lets people know, without any developer tools, what is causing a backend that has accepted a query and not yet returned a result to be off-CPU rather than on-CPU. In the cases where there are many backends, you may be able to see non-NULL results often enough to get a sense of where the problem is; or in the case where there's one backend that is persistently stuck, you will hopefully be able to tell where it's stuck. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
"Joshua D. Drake"
Date:
On 06/22/2015 10:37 AM, Robert Haas wrote: > I'm less sure about this next part, but I think we > might also want to report ourselves as waiting when we are doing an OS > read or an OS write, because it's pretty common for people to think > that a PostgreSQL bug is to blame when in fact it's the operating > system that isn't servicing our I/O requests very quickly. We could > also invent codes for things like "I'm doing a pg_usleep because I've > exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a > buffer" and maybe a few others. This would be a great improvement. Many, many times the problem really has nothing to do with PostgreSQL. It is a relation falling out of cache, swapping, a process waiting on IO to be allocated to it. If it is possible to have a view within PostgreSQL that allows us to see that, it would be absolutely awesome. It would be great if we could somehow monitor what the postgresql processes are doing within PostgreSQL. Imagine if we had pgsar ... Sincerely, jD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
"David G. Johnston"
Date:
and doesn't require a developer to
interpret the results,
[...]
We could
also invent codes for things like "I'm doing a pg_usleep because I've
exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a
buffer" and maybe a few others.
In addition to the codes themselves I think it would aid less-experienced operators if we would provide a meta-data categorization of the codes. Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc..
There could be a section in the documentation with these topics as section headings and a listing and explanation of each of the possible code would then be described within.
The meta-information is already embedded within the code/descriptions but explicitly pulling them out would be, IMO, more user-friendly and likely also aid in triage and speed-of-recognition when reading the corresponding code/description.
David J.
On Mon, Jun 22, 2015 at 1:59 PM, David G. Johnston <david.g.johnston@gmail.com> wrote: > In addition to the codes themselves I think it would aid less-experienced > operators if we would provide a meta-data categorization of the codes. > Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc.. > > There could be a section in the documentation with these topics as section > headings and a listing and explanation of each of the possible code would > then be described within. > > The meta-information is already embedded within the code/descriptions but > explicitly pulling them out would be, IMO, more user-friendly and likely > also aid in triage and speed-of-recognition when reading the corresponding > code/description. I was thinking that the codes would probably be fairly straightforward renderings of the underlying C identifiers, e.g.: Lock (Relation) Lock (Relation Extension) Lock (Page) ... ... LWLock (ShmemIndexLock) LWLock (OidGenLock) LWLock (XidGenLock) LWLock (ProcArrayLock) ... ... Spin Lock Delay Buffer Cleanup Lock We'd then have to figure out how to document that stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
"David G. Johnston"
Date:
On Mon, Jun 22, 2015 at 1:59 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> In addition to the codes themselves I think it would aid less-experienced
> operators if we would provide a meta-data categorization of the codes.
> Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc..
>
> There could be a section in the documentation with these topics as section
> headings and a listing and explanation of each of the possible code would
> then be described within.
>
> The meta-information is already embedded within the code/descriptions but
> explicitly pulling them out would be, IMO, more user-friendly and likely
> also aid in triage and speed-of-recognition when reading the corresponding
> code/description.
I was thinking that the codes would probably be fairly straightforward
renderings of the underlying C identifiers, e.g.:
Lock (Relation)
Lock (Relation Extension)
Lock (Page)
...
...
LWLock (ShmemIndexLock)
LWLock (OidGenLock)
LWLock (XidGenLock)
LWLock (ProcArrayLock)
...
...
Spin Lock Delay
Buffer Cleanup Lock
We'd then have to figure out how to document that stuff.
Just tossing stuff at the wall...
CREATE TABLE pg_stat_wait_event_info (
wait_event integer PRIMARY KEY,
category text, --possibly a FK to a description-holding table too
event_type text, --Lock, LWLock, etc...or something higher-level; was pondering whether ltree could be brought into core and used here...
event_documentation --asciidoc or something similar
);
Add \psql commands:
\dproc [ proc_id] --runs pg_stat_activity more-or-less
\dproc+ proc_id --shows the event information w/ description; and ideally info from pg_locks among other possibilities
That said, the same documentation should be made available online as well - but having this allows tools to make use the info to put the documentation closer to the user.
David J.
On Mon, Jun 22, 2015 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > When a PostgreSQL system wedges, or when it becomes dreadfully slow > for some reason, I often find myself relying on tools like strace, > gdb, or perf to figure out what is happening. This doesn't tend to > instill customers with confidence; they would like (quite > understandably) a process that doesn't require installing developer > tools on their production systems, and doesn't require a developer to > interpret the results, and perhaps even something that they could > connect up to PEM or Nagios or whatever alerting system they are > using. > > There are obviously many ways that we might think about improving > things here, but what I'd like to do is try to put some better > information in pg_stat_activity, so that when a process is not > running, users can get some better information about *why* it's not > running. The basic idea is that pg_stat_activity.waiting would be > replaced by a new column pg_stat_activity.wait_event, which would > display the reason why that backend is waiting. This wouldn't be a > free-form text field, because that would be too expensive to populate. > Instead it would contain a "reason code" which would be chosen from a > list of reason codes and translated to text for display. Instead of changing the column, can't we add a new one? Adjusting columns in PSA requires the innumerable queries written against it to be adjusted along with all the wiki instructions to dev ops for emergency stuck query detection etc etc. I would also prefer to query 'waiting' in some cases, especially when in emergency situations; it's faster to type. merlin
On Mon, Jun 22, 2015 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Instead of changing the column, can't we add a new one? Adjusting > columns in PSA requires the innumerable queries written against it to > be adjusted along with all the wiki instructions to dev ops for > emergency stuck query detection etc etc. I would also prefer to > query 'waiting' in some cases, especially when in emergency > situations; it's faster to type. If people feel strongly about backward compatibility, yes, we can do that. However, if waiting continues to mean "on a heavyweight lock" for backward compatibility, then you could sometimes have waiting = false but wait_state non-null. That seems confusing enough to be a bad plan, at least to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Merlin Moncure <mmoncure@gmail.com> writes: > On Mon, Jun 22, 2015 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> ... The basic idea is that pg_stat_activity.waiting would be >> replaced by a new column pg_stat_activity.wait_event, which would >> display the reason why that backend is waiting. > Instead of changing the column, can't we add a new one? Adjusting > columns in PSA requires the innumerable queries written against it to > be adjusted along with all the wiki instructions to dev ops for > emergency stuck query detection etc etc. +1. Removing the boolean column seems like it will arbitrarily break a whole lot of client-side code, for not-very-adequate reasons. regards, tom lane
On 6/22/15 12:37 PM, Robert Haas wrote: > It's > not my goal here to create some kind of a performance counter system, > even though that would be valuable and could possibly be based on the > same infrastructure, but rather just to create a very simple system > that lets people know, without any developer tools, what is causing a > backend that has accepted a query and not yet returned a result to be > off-CPU rather than on-CPU. Ilya Kosmodemiansky presented such a system at pgCon[1], and hopes to submit an initial patch in the coming weeks. The general idea was to do something similar to what you're describing (though, I believe even more granular) and have a bgworker accumulating that information. [1] http://www.pgcon.org/2015/schedule/events/809.en.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 22, 2015 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> > Instead of changing the column, can't we add a new one? Adjusting
> > columns in PSA requires the innumerable queries written against it to
> > be adjusted along with all the wiki instructions to dev ops for
> > emergency stuck query detection etc etc. I would also prefer to
> > query 'waiting' in some cases, especially when in emergency
> > situations; it's faster to type.
>
> If people feel strongly about backward compatibility, yes, we can do
> that. However, if waiting continues to mean "on a heavyweight lock"
> for backward compatibility, then you could sometimes have waiting =
> false but wait_state non-null. That seems confusing enough to be a
> bad plan, at least to me.
>
That's right if we leave the 'waiting' as it is for the sake of backward
>
> On Mon, Jun 22, 2015 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> > Instead of changing the column, can't we add a new one? Adjusting
> > columns in PSA requires the innumerable queries written against it to
> > be adjusted along with all the wiki instructions to dev ops for
> > emergency stuck query detection etc etc. I would also prefer to
> > query 'waiting' in some cases, especially when in emergency
> > situations; it's faster to type.
>
> If people feel strongly about backward compatibility, yes, we can do
> that. However, if waiting continues to mean "on a heavyweight lock"
> for backward compatibility, then you could sometimes have waiting =
> false but wait_state non-null. That seems confusing enough to be a
> bad plan, at least to me.
>
That's right if we leave the 'waiting' as it is for the sake of backward
compatibility, then it will be confusing after we add wait_event to
pg_stat_activity and if we change it such that for any kind of wait_event
waiting will be true (or entirely remove waiting), then it will break the
backward compatibility. So we have below alternatives here:
1. Remove/Change 'waiting' in pg_stat_activity and break the backward
compatibility. I think we should try to avoid going via this route.
2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event
3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etc
Do you think 2nd or 3rd could be viable way to proceed for this feature?
On 2015-06-25 16:07:45 +0530, Amit Kapila wrote: > On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > If people feel strongly about backward compatibility, yes, we can do > > that. However, if waiting continues to mean "on a heavyweight lock" > > for backward compatibility, then you could sometimes have waiting = > > false but wait_state non-null. That seems confusing enough to be a > > bad plan, at least to me. > > > > That's right if we leave the 'waiting' as it is for the sake of backward > compatibility, then it will be confusing after we add wait_event to > pg_stat_activity and if we change it such that for any kind of wait_event > waiting will be true (or entirely remove waiting), then it will break the > backward compatibility. So we have below alternatives here: > 1. Remove/Change 'waiting' in pg_stat_activity and break the backward > compatibility. I think we should try to avoid going via this route. > > 2. Add 2 new columns to pg_stat_activity > waiting_resource - true for waits other heavy wait locks, false > otherwise > wait_event - description code for the wait event > > 3. Add new view 'pg_stat_wait_event' with following info: > pid - process id of this backend > waiting - true for any form of wait, false otherwise > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc > wait_event - Lock (Relation), Lock (Relation Extension), etc > > Do you think 2nd or 3rd could be viable way to proceed for this feature? 3) sounds best to me. Keeping 'waiting' even makes sense in that case, because it'll tell whether wait_event_type is currently being blocked on. We can leave the former contents in until the next thing is being blocked... Greetings, Andres Freund
On Thu, Jun 25, 2015 at 4:16 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-06-25 16:07:45 +0530, Amit Kapila wrote:
> > On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > If people feel strongly about backward compatibility, yes, we can do
> > > that. However, if waiting continues to mean "on a heavyweight lock"
> > > for backward compatibility, then you could sometimes have waiting =
> > > false but wait_state non-null. That seems confusing enough to be a
> > > bad plan, at least to me.
> > >
> >
> > That's right if we leave the 'waiting' as it is for the sake of backward
> > compatibility, then it will be confusing after we add wait_event to
> > pg_stat_activity and if we change it such that for any kind of wait_event
> > waiting will be true (or entirely remove waiting), then it will break the
> > backward compatibility. So we have below alternatives here:
>
> > 1. Remove/Change 'waiting' in pg_stat_activity and break the backward
> > compatibility. I think we should try to avoid going via this route.
> >
> > 2. Add 2 new columns to pg_stat_activity
> > waiting_resource - true for waits other heavy wait locks, false
> > otherwise
> > wait_event - description code for the wait event
> >
> > 3. Add new view 'pg_stat_wait_event' with following info:
> > pid - process id of this backend
> > waiting - true for any form of wait, false otherwise
> > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> > wait_event - Lock (Relation), Lock (Relation Extension), etc
> >
> > Do you think 2nd or 3rd could be viable way to proceed for this feature?
>
> 3) sounds best to me. Keeping 'waiting' even makes sense in that case,
> because it'll tell whether wait_event_type is currently being blocked
> on. We can leave the former contents in until the next thing is being
> blocked...
>
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
>
> On 2015-06-25 16:07:45 +0530, Amit Kapila wrote:
> > On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > If people feel strongly about backward compatibility, yes, we can do
> > > that. However, if waiting continues to mean "on a heavyweight lock"
> > > for backward compatibility, then you could sometimes have waiting =
> > > false but wait_state non-null. That seems confusing enough to be a
> > > bad plan, at least to me.
> > >
> >
> > That's right if we leave the 'waiting' as it is for the sake of backward
> > compatibility, then it will be confusing after we add wait_event to
> > pg_stat_activity and if we change it such that for any kind of wait_event
> > waiting will be true (or entirely remove waiting), then it will break the
> > backward compatibility. So we have below alternatives here:
>
> > 1. Remove/Change 'waiting' in pg_stat_activity and break the backward
> > compatibility. I think we should try to avoid going via this route.
> >
> > 2. Add 2 new columns to pg_stat_activity
> > waiting_resource - true for waits other heavy wait locks, false
> > otherwise
> > wait_event - description code for the wait event
> >
> > 3. Add new view 'pg_stat_wait_event' with following info:
> > pid - process id of this backend
> > waiting - true for any form of wait, false otherwise
> > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> > wait_event - Lock (Relation), Lock (Relation Extension), etc
> >
> > Do you think 2nd or 3rd could be viable way to proceed for this feature?
>
> 3) sounds best to me. Keeping 'waiting' even makes sense in that case,
> because it'll tell whether wait_event_type is currently being blocked
> on. We can leave the former contents in until the next thing is being
> blocked...
>
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
as false as soon as Heavy Weight Lock is over, so following that way
sounds more appropriate, is there any reason why you want it differently
than what we are doing currently?
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ilya Kosmodemiansky
Date:
Hi all On Thu, Jun 25, 2015 at 12:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > 2. Add 2 new columns to pg_stat_activity > waiting_resource - true for waits other heavy wait locks, false > otherwise > wait_event - description code for the wait event > > 3. Add new view 'pg_stat_wait_event' with following info: > pid - process id of this backend > waiting - true for any form of wait, false otherwise > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc > wait_event - Lock (Relation), Lock (Relation Extension), etc Personally I think, that tracking waits is a not a good idea for pg_stat_activity (at least in that straight-forward manner). One process can wait for lots of things between 2 sampling of pg_stat_activity and that sampling can be pretty useless. My approach (about which Ive had a talk mentioned by Jim and which I hope to finalize and submit within a few days) is a bit different and I believe is more useful: 1. Some sort of histogram of top waits within entire database by pid. That will be an approximate one, because I hardly believe there is a possibility to make a precise one without significant overhead. 2. Some cyclic buffer of more precise wait statistic inside each worker. Sampling may be turned on if we see some issues in histogram (1) and want to have some more details. > Do you think 2nd or 3rd could be viable way to proceed for this feature? > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Ilya Kosmodemiansky, PostgreSQL-Consulting.com tel. +14084142500 cell. +4915144336040 ik@postgresql-consulting.com
On 2015-06-25 16:26:39 +0530, Amit Kapila wrote: > Won't leaving former contents as it is (until the next thing is being > blocked) could give misleading information. Currently we mark 'waiting' > as false as soon as Heavy Weight Lock is over, so following that way > sounds more appropriate, is there any reason why you want it differently > than what we are doing currently? But we don't do the same for query, so I don't think that says much. I think it'd be useful because it gives you a bit more chance to see what you blocked on last, even if the time the backend was blocked was very short. Greetings, Andres Freund
On Thu, Jun 25, 2015 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-06-25 16:26:39 +0530, Amit Kapila wrote:
> > Won't leaving former contents as it is (until the next thing is being
> > blocked) could give misleading information. Currently we mark 'waiting'
> > as false as soon as Heavy Weight Lock is over, so following that way
> > sounds more appropriate, is there any reason why you want it differently
> > than what we are doing currently?
>
> But we don't do the same for query, so I don't think that says much. I
> think it'd be useful because it gives you a bit more chance to see what
> you blocked on last, even if the time the backend was blocked was very
> short.
>
Sure, that's another way to look at it, if you and or others feels that is better,
>
> On 2015-06-25 16:26:39 +0530, Amit Kapila wrote:
> > Won't leaving former contents as it is (until the next thing is being
> > blocked) could give misleading information. Currently we mark 'waiting'
> > as false as soon as Heavy Weight Lock is over, so following that way
> > sounds more appropriate, is there any reason why you want it differently
> > than what we are doing currently?
>
> But we don't do the same for query, so I don't think that says much. I
> think it'd be useful because it gives you a bit more chance to see what
> you blocked on last, even if the time the backend was blocked was very
> short.
>
Sure, that's another way to look at it, if you and or others feels that is better,
then we can follow that way.
On Thu, Jun 25, 2015 at 4:28 PM, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> wrote:
>
> On Thu, Jun 25, 2015 at 12:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 2. Add 2 new columns to pg_stat_activity
> > waiting_resource - true for waits other heavy wait locks, false
> > otherwise
> > wait_event - description code for the wait event
> >
> > 3. Add new view 'pg_stat_wait_event' with following info:
> > pid - process id of this backend
> > waiting - true for any form of wait, false otherwise
> > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> > wait_event - Lock (Relation), Lock (Relation Extension), etc
>
> Personally I think, that tracking waits is a not a good idea for
> pg_stat_activity (at least in that straight-forward manner).
> My approach (about which Ive had a talk mentioned by Jim and which I
> hope to finalize and submit within a few days) is a bit different and
> I believe is more useful:
>
> 1. Some sort of histogram of top waits within entire database by pid.
> That will be an approximate one, because I hardly believe there is a
> possibility to make a precise one without significant overhead.
>
> 2. Some cyclic buffer of more precise wait statistic inside each
> worker. Sampling may be turned on if we see some issues in histogram
> (1) and want to have some more details.
>
>
> On Thu, Jun 25, 2015 at 12:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 2. Add 2 new columns to pg_stat_activity
> > waiting_resource - true for waits other heavy wait locks, false
> > otherwise
> > wait_event - description code for the wait event
> >
> > 3. Add new view 'pg_stat_wait_event' with following info:
> > pid - process id of this backend
> > waiting - true for any form of wait, false otherwise
> > wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> > wait_event - Lock (Relation), Lock (Relation Extension), etc
>
> Personally I think, that tracking waits is a not a good idea for
> pg_stat_activity (at least in that straight-forward manner).
As mentioned in the initial mail by Robert, that sometimes system becomes
slow (either due to contention on various kinds of locks or due to I/O or due
to some other such reasons) that such kind of handy information via some
view is quite useful. Recently while working on one of the performance/scalability
projects, I need to use gdb to attach to different processes to see what they
are doing (of course one can use perf or some other utilities as well) and I
found most of them were trying to wait on some LW locks, now having such
an information available via view could be really useful, because sometimes
at customer sites, we can't use gdb or perf to see what's going on.
> One
> process can wait for lots of things between 2 sampling of
> pg_stat_activity and that sampling can be pretty useless.
>
> process can wait for lots of things between 2 sampling of
> pg_stat_activity and that sampling can be pretty useless.
>
Yeah, that's right and I am not sure if we should bother about such scenario's
as the system is generally fine in such situations, however there are other
cases where we can find most of the backends are waiting on one or other
thing.
> My approach (about which Ive had a talk mentioned by Jim and which I
> hope to finalize and submit within a few days) is a bit different and
> I believe is more useful:
>
> 1. Some sort of histogram of top waits within entire database by pid.
> That will be an approximate one, because I hardly believe there is a
> possibility to make a precise one without significant overhead.
>
> 2. Some cyclic buffer of more precise wait statistic inside each
> worker. Sampling may be turned on if we see some issues in histogram
> (1) and want to have some more details.
>
I think this is some what different kind of utility which can give us
aggregated information and I think this will address different kind of
usecase and will have somewhat more complex design and it doesn't
look impossible to use part of what will be developed as part of this
proposal.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ilya Kosmodemiansky
Date:
On Thu, Jun 25, 2015 at 1:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Personally I think, that tracking waits is a not a good idea for >> pg_stat_activity (at least in that straight-forward manner). > > As mentioned in the initial mail by Robert, that sometimes system becomes > slow (either due to contention on various kinds of locks or due to I/O or > due > to some other such reasons) that such kind of handy information via some > view is quite useful. Recently while working on one of the > performance/scalability > projects, I need to use gdb to attach to different processes to see what > they > are doing (of course one can use perf or some other utilities as well) and I > found most of them were trying to wait on some LW locks, now having such > an information available via view could be really useful, because sometimes > at customer sites, we can't use gdb or perf to see what's going on. Yes, I understand such a use-case. But I hardly see if suggested design can help for such cases. Basically, a DBA has two reasons to take a look on waits: 1. Long response time for particular query (or some type of queries). In that case it is good to know how much time we spend on waiting for particular resources we need to get query results 2. Overall bad performance of a database. We know, that something goes wrong and consumes resources, we need to identify which backend, which query causes the most of waits. In both cases we need a) some historical data rather than simple snapshot b) some approach how to aggregate it because the will be certainly a lot of events So my point is, we need separate interface for waits, instead of integrating in pg_stat_activity. And it should be several interfaces: one for approximate top of waiting sessions (like active_sessions_history in oracle), one for detailed tracing of a session, one for waits per resource statistics etc. >> One >> process can wait for lots of things between 2 sampling of >> pg_stat_activity and that sampling can be pretty useless. >> > > Yeah, that's right and I am not sure if we should bother about such > scenario's > as the system is generally fine in such situations, however there are other > cases where we can find most of the backends are waiting on one or other > thing. I think approach with top of waiting sessions covers both scenarios (well, with only one exception: if we have billions of very short waits and high contention is the problem) However, it maybe a good idea, to identify the resource we are waiting for from pg_stat_activity if we are waiting for a long time. > > I think this is some what different kind of utility which can give us > aggregated information and I think this will address different kind of > usecase and will have somewhat more complex design and it doesn't > look impossible to use part of what will be developed as part of this > proposal. > I think it is more than possible to mix both approaches. My proof of concept now is only about LWLocks - yours and Robert's is more general, and certainly some wait event classification will be needed for both approaches and its much better to implement one rather than two different. And at least, I will be interesting in reviewing your approach. > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Ilya Kosmodemiansky, PostgreSQL-Consulting.com tel. +14084142500 cell. +4915144336040 ik@postgresql-consulting.com
On Thu, Jun 25, 2015 at 6:10 PM, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> wrote:
>
> On Thu, Jun 25, 2015 at 1:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Personally I think, that tracking waits is a not a good idea for
> >> pg_stat_activity (at least in that straight-forward manner).
> >
> > As mentioned in the initial mail by Robert, that sometimes system becomes
> > slow (either due to contention on various kinds of locks or due to I/O or
> > due
> > to some other such reasons) that such kind of handy information via some
> > view is quite useful. Recently while working on one of the
> > performance/scalability
> > projects, I need to use gdb to attach to different processes to see what
> > they
> > are doing (of course one can use perf or some other utilities as well) and I
> > found most of them were trying to wait on some LW locks, now having such
> > an information available via view could be really useful, because sometimes
> > at customer sites, we can't use gdb or perf to see what's going on.
>
> Yes, I understand such a use-case. But I hardly see if suggested
> design can help for such cases.
>
> Basically, a DBA has two reasons to take a look on waits:
>
> 1. Long response time for particular query (or some type of queries).
> In that case it is good to know how much time we spend on waiting for
> particular resources we need to get query results
> 2. Overall bad performance of a database. We know, that something goes
> wrong and consumes resources, we need to identify which backend, which
> query causes the most of waits.
>
> In both cases we need a) some historical data rather than simple
> snapshot b) some approach how to aggregate it because the will be
> certainly a lot of events
>
>
> I think it is more than possible to mix both approaches. My proof of
> concept now is only about LWLocks - yours and Robert's is more
> general, and certainly some wait event classification will be needed
> for both approaches and its much better to implement one rather than
> two different.
>
> And at least, I will be interesting in reviewing your approach.
>
Okay, I am planning to spend time on this patch in coming few days
>
> On Thu, Jun 25, 2015 at 1:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Personally I think, that tracking waits is a not a good idea for
> >> pg_stat_activity (at least in that straight-forward manner).
> >
> > As mentioned in the initial mail by Robert, that sometimes system becomes
> > slow (either due to contention on various kinds of locks or due to I/O or
> > due
> > to some other such reasons) that such kind of handy information via some
> > view is quite useful. Recently while working on one of the
> > performance/scalability
> > projects, I need to use gdb to attach to different processes to see what
> > they
> > are doing (of course one can use perf or some other utilities as well) and I
> > found most of them were trying to wait on some LW locks, now having such
> > an information available via view could be really useful, because sometimes
> > at customer sites, we can't use gdb or perf to see what's going on.
>
> Yes, I understand such a use-case. But I hardly see if suggested
> design can help for such cases.
>
> Basically, a DBA has two reasons to take a look on waits:
>
> 1. Long response time for particular query (or some type of queries).
> In that case it is good to know how much time we spend on waiting for
> particular resources we need to get query results
> 2. Overall bad performance of a database. We know, that something goes
> wrong and consumes resources, we need to identify which backend, which
> query causes the most of waits.
>
> In both cases we need a) some historical data rather than simple
> snapshot b) some approach how to aggregate it because the will be
> certainly a lot of events
>
I think this thread's proposal will help for cases, when user/DBA wants to
see where currently database is spending most time (during waits).
I understand that there is a use of historical information which can
be helpful for the kind of cases which you have explained above.
>
> I think it is more than possible to mix both approaches. My proof of
> concept now is only about LWLocks - yours and Robert's is more
> general, and certainly some wait event classification will be needed
> for both approaches and its much better to implement one rather than
> two different.
>
> And at least, I will be interesting in reviewing your approach.
>
Okay, I am planning to spend time on this patch in coming few days
and when that's ready, may be we can see if that could be useful
for what you are planning to do.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Peter Eisentraut
Date:
On 6/22/15 1:37 PM, Robert Haas wrote: > Currently, the only time we report a process as waiting is when it is > waiting for a heavyweight lock. I'd like to make that somewhat more > fine-grained, by reporting the type of heavyweight lock it's awaiting > (relation, relation extension, transaction, etc.). Also, I'd like to > report when we're waiting for a lwlock, and report either the specific > fixed lwlock for which we are waiting, or else the type of lock (lock > manager lock, buffer content lock, etc.) for locks of which there is > more than one. I'm less sure about this next part, but I think we > might also want to report ourselves as waiting when we are doing an OS > read or an OS write, because it's pretty common for people to think > that a PostgreSQL bug is to blame when in fact it's the operating > system that isn't servicing our I/O requests very quickly. Could that also cover waiting on network?
Andres Freund <andres@anarazel.de> writes: > On 2015-06-25 16:26:39 +0530, Amit Kapila wrote: >> Won't leaving former contents as it is (until the next thing is being >> blocked) could give misleading information. Currently we mark 'waiting' >> as false as soon as Heavy Weight Lock is over, so following that way >> sounds more appropriate, is there any reason why you want it differently >> than what we are doing currently? > But we don't do the same for query, so I don't think that says much. I > think it'd be useful because it gives you a bit more chance to see what > you blocked on last, even if the time the backend was blocked was very > short. The problem with the query analogy is that it's possible to tell whether the query is active or not, by looking at the status column. We need to avoid a situation where you can't tell if the wait status is current or merely the last thing waited for. At the moment I'm inclined to think we should put this on the back burner until we see what Ilya submits. None of the proposals for changing pg_stat_activity sound terribly clean to me. regards, tom lane
On 2015-06-25 10:01:39 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-06-25 16:26:39 +0530, Amit Kapila wrote: > >> Won't leaving former contents as it is (until the next thing is being > >> blocked) could give misleading information. Currently we mark 'waiting' > >> as false as soon as Heavy Weight Lock is over, so following that way > >> sounds more appropriate, is there any reason why you want it differently > >> than what we are doing currently? > > > But we don't do the same for query, so I don't think that says much. I > > think it'd be useful because it gives you a bit more chance to see what > > you blocked on last, even if the time the backend was blocked was very > > short. > > The problem with the query analogy is that it's possible to tell whether > the query is active or not, by looking at the status column. We need to > avoid a situation where you can't tell if the wait status is current or > merely the last thing waited for. Well, that's what the 'waiting' column would be about in the proposal I'm commenting about. > At the moment I'm inclined to think we should put this on the back burner > until we see what Ilya submits. None of the proposals for changing > pg_stat_activity sound terribly clean to me. We'll see. To me that's two different things. Knowing what a backend is currently blocked on is a somewhat different use case from keeping longer running stats. E.g. debugging why vacuum is not progressing (waiting for a cleanup lock on a page that needs to be frozen) is just about impossible right now. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2015-06-25 10:01:39 -0400, Tom Lane wrote: >> The problem with the query analogy is that it's possible to tell whether >> the query is active or not, by looking at the status column. We need to >> avoid a situation where you can't tell if the wait status is current or >> merely the last thing waited for. > Well, that's what the 'waiting' column would be about in the proposal I'm > commenting about. To do that, we'd have to change the semantics of the 'waiting' column so that it becomes true for non-heavyweight-lock waits. I'm not sure whether that's a good idea or not; I'm afraid there may be client-side code that expects 'waiting' to indicate that there's a corresponding row in pg_locks. If we're willing to do that, then I'd be okay with allowing wait_status to be defined as "last thing waited for"; but the two points aren't separable. regards, tom lane
On Thu, Jun 25, 2015 at 8:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-06-25 10:01:39 -0400, Tom Lane wrote:
> >> The problem with the query analogy is that it's possible to tell whether
> >> the query is active or not, by looking at the status column. We need to
> >> avoid a situation where you can't tell if the wait status is current or
> >> merely the last thing waited for.
>
> > Well, that's what the 'waiting' column would be about in the proposal I'm
> > commenting about.
>
> To do that, we'd have to change the semantics of the 'waiting' column so
> that it becomes true for non-heavyweight-lock waits.
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-06-25 10:01:39 -0400, Tom Lane wrote:
> >> The problem with the query analogy is that it's possible to tell whether
> >> the query is active or not, by looking at the status column. We need to
> >> avoid a situation where you can't tell if the wait status is current or
> >> merely the last thing waited for.
>
> > Well, that's what the 'waiting' column would be about in the proposal I'm
> > commenting about.
>
> To do that, we'd have to change the semantics of the 'waiting' column so
> that it becomes true for non-heavyweight-lock waits.
If we introduce a new view like pg_stat_wait_event as mentioned above,
then we can avoid this problem, existing 'waiting' in pg_stat_activity
would mean same as it mean today and new column 'waiting' in
pg_stat_wait_event could indicate the waits for non-heavyweight-lock.
On Thu, Jun 25, 2015 at 6:46 AM, Andres Freund <andres@anarazel.de> wrote: >> 1. Remove/Change 'waiting' in pg_stat_activity and break the backward >> compatibility. I think we should try to avoid going via this route. >> >> 2. Add 2 new columns to pg_stat_activity >> waiting_resource - true for waits other heavy wait locks, false >> otherwise >> wait_event - description code for the wait event >> >> 3. Add new view 'pg_stat_wait_event' with following info: >> pid - process id of this backend >> waiting - true for any form of wait, false otherwise >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc >> wait_event - Lock (Relation), Lock (Relation Extension), etc >> >> Do you think 2nd or 3rd could be viable way to proceed for this feature? > > 3) sounds best to me. Keeping 'waiting' even makes sense in that case, > because it'll tell whether wait_event_type is currently being blocked > on. We can leave the former contents in until the next thing is being > blocked... So, that's still redefining the "waiting" column, because it will now indicate whether we are waiting on some wait event, not whether we are waiting on specifically a heavyweight lock. But that doesn't bother me, because I think it's going to be darn confusing if we keep "waiting" around with the specific meaning of "waiting for a heavyweight lock" while also now having a notion of "waiting for something else". I like the idea of indicating both the most recent wait event and whether or not we are still waiting for it - we refined current_query to query not too long ago, and I certainly think that was a significant improvement even if it broke some people's scripts. I am pretty unconvinced that it's a good idea to try to split up the wait event into two columns. I'm only proposing ~20 wait states, so there's something like 5 bits of information here. Spreading that over two text columns is a lot, and note that Amit's text would basically recapitulate the contents of the first column in the second one, which I cannot get excited about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 25, 2015 at 6:58 AM, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> wrote: > 1. Some sort of histogram of top waits within entire database by pid. > That will be an approximate one, because I hardly believe there is a > possibility to make a precise one without significant overhead. You could compute that histogram from the data I am proposing to publish. Indeed, it's hard to see what other fundamentally different mechanism you would use. The backends have got to advertise their state in shared memory someplace, which my proposal would do, and then you've got to poll that data somewhere else, which I'm not proposing to do but it could be done. > 2. Some cyclic buffer of more precise wait statistic inside each > worker. Sampling may be turned on if we see some issues in histogram > (1) and want to have some more details. That could be built on top of this, too. Both of those ideas require the information that my proposal would provide, but the information this proposal would provide is still useful if we don't do those other things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 6/22/15 1:37 PM, Robert Haas wrote: >> Currently, the only time we report a process as waiting is when it is >> waiting for a heavyweight lock. I'd like to make that somewhat more >> fine-grained, by reporting the type of heavyweight lock it's awaiting >> (relation, relation extension, transaction, etc.). Also, I'd like to >> report when we're waiting for a lwlock, and report either the specific >> fixed lwlock for which we are waiting, or else the type of lock (lock >> manager lock, buffer content lock, etc.) for locks of which there is >> more than one. I'm less sure about this next part, but I think we >> might also want to report ourselves as waiting when we are doing an OS >> read or an OS write, because it's pretty common for people to think >> that a PostgreSQL bug is to blame when in fact it's the operating >> system that isn't servicing our I/O requests very quickly. > > Could that also cover waiting on network? Possibly. My approach requires that the number of wait states be kept relatively small, ideally fitting in a single byte. And it also requires that we insert pgstat_report_waiting() calls around the thing that is notionally blocking. So, if there are a small number of places in the code where we do network I/O, we could stick those calls around those places, and this would work just fine. But if a foreign data wrapper, or any other piece of code, does network I/O - or any other blocking operation - without calling pgstat_report_waiting(), we just won't know about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 26, 2015 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 25, 2015 at 6:46 AM, Andres Freund <andres@anarazel.de> wrote:
> >> 1. Remove/Change 'waiting' in pg_stat_activity and break the backward
> >> compatibility. I think we should try to avoid going via this route.
> >>
> >> 2. Add 2 new columns to pg_stat_activity
> >> waiting_resource - true for waits other heavy wait locks, false
> >> otherwise
> >> wait_event - description code for the wait event
> >>
> >> 3. Add new view 'pg_stat_wait_event' with following info:
> >> pid - process id of this backend
> >> waiting - true for any form of wait, false otherwise
> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> >> wait_event - Lock (Relation), Lock (Relation Extension), etc
> >>
> >> Do you think 2nd or 3rd could be viable way to proceed for this feature?
> >
> > 3) sounds best to me. Keeping 'waiting' even makes sense in that case,
> > because it'll tell whether wait_event_type is currently being blocked
> > on. We can leave the former contents in until the next thing is being
> > blocked...
>
> So, that's still redefining the "waiting" column, because it will now
> indicate whether we are waiting on some wait event, not whether we are
> waiting on specifically a heavyweight lock. But that doesn't bother
> me, because I think it's going to be darn confusing if we keep
> "waiting" around with the specific meaning of "waiting for a
> heavyweight lock" while also now having a notion of "waiting for
> something else". I like the idea of indicating both the most recent
> wait event and whether or not we are still waiting for it - we refined
> current_query to query not too long ago, and I certainly think that
> was a significant improvement even if it broke some people's scripts.
>
> I am pretty unconvinced that it's a good idea to try to split up the
> wait event into two columns. I'm only proposing ~20 wait states, so
> there's something like 5 bits of information here. Spreading that
> over two text columns is a lot, and note that Amit's text would
> basically recapitulate the contents of the first column in the second
> one, which I cannot get excited about.
>
There is an advantage in splitting the columns which is if wait_event_type
>
> On Thu, Jun 25, 2015 at 6:46 AM, Andres Freund <andres@anarazel.de> wrote:
> >> 1. Remove/Change 'waiting' in pg_stat_activity and break the backward
> >> compatibility. I think we should try to avoid going via this route.
> >>
> >> 2. Add 2 new columns to pg_stat_activity
> >> waiting_resource - true for waits other heavy wait locks, false
> >> otherwise
> >> wait_event - description code for the wait event
> >>
> >> 3. Add new view 'pg_stat_wait_event' with following info:
> >> pid - process id of this backend
> >> waiting - true for any form of wait, false otherwise
> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> >> wait_event - Lock (Relation), Lock (Relation Extension), etc
> >>
> >> Do you think 2nd or 3rd could be viable way to proceed for this feature?
> >
> > 3) sounds best to me. Keeping 'waiting' even makes sense in that case,
> > because it'll tell whether wait_event_type is currently being blocked
> > on. We can leave the former contents in until the next thing is being
> > blocked...
>
> So, that's still redefining the "waiting" column, because it will now
> indicate whether we are waiting on some wait event, not whether we are
> waiting on specifically a heavyweight lock. But that doesn't bother
> me, because I think it's going to be darn confusing if we keep
> "waiting" around with the specific meaning of "waiting for a
> heavyweight lock" while also now having a notion of "waiting for
> something else". I like the idea of indicating both the most recent
> wait event and whether or not we are still waiting for it - we refined
> current_query to query not too long ago, and I certainly think that
> was a significant improvement even if it broke some people's scripts.
>
> I am pretty unconvinced that it's a good idea to try to split up the
> wait event into two columns. I'm only proposing ~20 wait states, so
> there's something like 5 bits of information here. Spreading that
> over two text columns is a lot, and note that Amit's text would
> basically recapitulate the contents of the first column in the second
> one, which I cannot get excited about.
>
There is an advantage in splitting the columns which is if wait_event_type
column indicates Heavy Weight Lock, then user can go and check further
details in pg_locks, I think he can do that even by seeing wait_event
column, but that might not be as straightforward as with wait_event_type
column.
On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> 3. Add new view 'pg_stat_wait_event' with following info: >> >> pid - process id of this backend >> >> waiting - true for any form of wait, false otherwise >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc >> I am pretty unconvinced that it's a good idea to try to split up the >> wait event into two columns. I'm only proposing ~20 wait states, so >> there's something like 5 bits of information here. Spreading that >> over two text columns is a lot, and note that Amit's text would >> basically recapitulate the contents of the first column in the second >> one, which I cannot get excited about. > There is an advantage in splitting the columns which is if wait_event_type > column indicates Heavy Weight Lock, then user can go and check further > details in pg_locks, I think he can do that even by seeing wait_event > column, but that might not be as straightforward as with wait_event_type > column. It's just a matter of writing event_type LIKE 'Lock %' instead of event_type = 'Lock'. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Peter Eisentraut
Date:
On 6/25/15 11:39 PM, Robert Haas wrote: >> Could that also cover waiting on network? > > Possibly. My approach requires that the number of wait states be kept > relatively small, ideally fitting in a single byte. And it also > requires that we insert pgstat_report_waiting() calls around the thing > that is notionally blocking. So, if there are a small number of > places in the code where we do network I/O, we could stick those calls > around those places, and this would work just fine. But if a foreign > data wrapper, or any other piece of code, does network I/O - or any > other blocking operation - without calling pgstat_report_waiting(), we > just won't know about it. That sounds doable, assuming that extension authors play along. I see that network problems because of connection poolers, foreign-data connections, and so on, are a significant cause of session "hangs", so it would be good if they could be covered.
On Fri, Jun 26, 2015 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >> 3. Add new view 'pg_stat_wait_event' with following info:
> >> >> pid - process id of this backend
> >> >> waiting - true for any form of wait, false otherwise
> >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc
> >> I am pretty unconvinced that it's a good idea to try to split up the
> >> wait event into two columns. I'm only proposing ~20 wait states, so
> >> there's something like 5 bits of information here. Spreading that
> >> over two text columns is a lot, and note that Amit's text would
> >> basically recapitulate the contents of the first column in the second
> >> one, which I cannot get excited about.
> > There is an advantage in splitting the columns which is if wait_event_type
> > column indicates Heavy Weight Lock, then user can go and check further
> > details in pg_locks, I think he can do that even by seeing wait_event
> > column, but that might not be as straightforward as with wait_event_type
> > column.
>
> It's just a matter of writing event_type LIKE 'Lock %' instead of
> event_type = 'Lock'.
>
Yes that way it can be done and may be that is not inconvenient for user,
>
> On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >> 3. Add new view 'pg_stat_wait_event' with following info:
> >> >> pid - process id of this backend
> >> >> waiting - true for any form of wait, false otherwise
> >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
> >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc
> >> I am pretty unconvinced that it's a good idea to try to split up the
> >> wait event into two columns. I'm only proposing ~20 wait states, so
> >> there's something like 5 bits of information here. Spreading that
> >> over two text columns is a lot, and note that Amit's text would
> >> basically recapitulate the contents of the first column in the second
> >> one, which I cannot get excited about.
> > There is an advantage in splitting the columns which is if wait_event_type
> > column indicates Heavy Weight Lock, then user can go and check further
> > details in pg_locks, I think he can do that even by seeing wait_event
> > column, but that might not be as straightforward as with wait_event_type
> > column.
>
> It's just a matter of writing event_type LIKE 'Lock %' instead of
> event_type = 'Lock'.
>
Yes that way it can be done and may be that is not inconvenient for user,
but then there is other type of information which user might need like what
distinct resources on which wait is possible, which again he can easily find with
different event_type column. I think some more discussion is required before we
could freeze the user interface for this feature, but in the meantime I have
prepared an initial patch by adding a new column wait_event in pg_stat_activity.
For now, I have added the support for Heavy-Weight locks, Light-Weight locks [1]
and Buffer Cleanup Lock. I could add for other types (spin lock delay sleep, IO,
network IO, etc.) if there is no objection in the approach used in patch to implement
this feature.
[1] For LWLocks, currently I have used wait_event as OtherLock for locks
other than NUM_FIXED_LWLOCKS (Refer function NumLWLocks to see all
type of LWLocks). The reason is that there is no straight forward way to get
the id (lockid) of such locks as for some of those (like shared_buffers,
MaxBackends) the number of locks will depend on run-time configuration
parameters. I think if we want to handle those then we could either do some
math to find out the lockid based on runtime values of these parameters or
we could add tag in LWLock structure (which indicates the lock type) and
fill it during Lock initialization or may be some other better way to do it.
I have still not added documentation and have not changed anything for
waiting column in pg_stat_activity as I think before that we need to finalize
the user interface. Apart from that as mentioned above still wait for
some event types (like IO, netwrok IO, etc.) is not added and also I think
separate function/'s (like we have for existing ones pg_stat_get_backend_waiting)
will be required which again depends upon user interface.
Suggestions?
Attachment
On Fri, Jun 26, 2015 at 12:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >> On 6/22/15 1:37 PM, Robert Haas wrote: >>> Currently, the only time we report a process as waiting is when it is >>> waiting for a heavyweight lock. I'd like to make that somewhat more >>> fine-grained, by reporting the type of heavyweight lock it's awaiting >>> (relation, relation extension, transaction, etc.). Also, I'd like to >>> report when we're waiting for a lwlock, and report either the specific >>> fixed lwlock for which we are waiting, or else the type of lock (lock >>> manager lock, buffer content lock, etc.) for locks of which there is >>> more than one. I'm less sure about this next part, but I think we >>> might also want to report ourselves as waiting when we are doing an OS >>> read or an OS write, because it's pretty common for people to think >>> that a PostgreSQL bug is to blame when in fact it's the operating >>> system that isn't servicing our I/O requests very quickly. >> >> Could that also cover waiting on network? > > Possibly. My approach requires that the number of wait states be kept > relatively small, ideally fitting in a single byte. And it also > requires that we insert pgstat_report_waiting() calls around the thing > that is notionally blocking. So, if there are a small number of > places in the code where we do network I/O, we could stick those calls > around those places, and this would work just fine. But if a foreign > data wrapper, or any other piece of code, does network I/O - or any > other blocking operation - without calling pgstat_report_waiting(), we > just won't know about it. Probably Itagaki-san's very similar proposal and patch would be useful to consider what wait events to track. http://www.postgresql.org/message-id/20090309125146.913C.52131E4D@oss.ntt.co.jp According to his patch, the wait events that he was thinking to add were: + typedef enum PgCondition + { + PGCOND_UNUSED = 0, /* unused */ + + /* 10000 - CPU */ + PGCOND_CPU = 10000, /* generic cpu operations */ + /* 11000 - CPU:PARSE */ + PGCOND_CPU_PARSE = 11000, /* pg_parse_query */ + PGCOND_CPU_PARSE_ANALYZE = 11100, /* parse_analyze */ + /* 12000 - CPU:REWRITE */ + PGCOND_CPU_REWRITE = 12000, /* pg_rewrite_query */ + /* 13000 - CPU:PLAN */ + PGCOND_CPU_PLAN = 13000, /* pg_plan_query */ + /* 14000 - CPU:EXECUTE */ + PGCOND_CPU_EXECUTE = 14000, /* PortalRun or PortalRunMulti */ + PGCOND_CPU_TRIGGER = 14100, /* ExecCallTriggerFunc */ + PGCOND_CPU_SORT = 14200, /* (generic sort operation) */ + PGCOND_CPU_SORT_HEAP = 14210, /* tuplesort_begin_heap */ + PGCOND_CPU_SORT_INDEX = 14220, /* tuplesort_begin_index_btree */ + PGCOND_CPU_SORT_DATUM = 14230, /* tuplesort_begin_datum */ + /* 15000 - CPU:UTILITY */ + PGCOND_CPU_UTILITY = 15000, /* ProcessUtility */ + PGCOND_CPU_COMMIT = 15100, /* CommitTransaction */ + PGCOND_CPU_ROLLBACK = 15200, /* AbortTransaction */ + /* 16000 - CPU:TEXT */ + PGCOND_CPU_TEXT = 16000, /* (generic text operation) */ + PGCOND_CPU_DECODE = 16100, /* pg_client_to_server */ + PGCOND_CPU_ENCODE = 16200, /* pg_server_to_client */ + PGCOND_CPU_LIKE = 16310, /* GenericMatchText */ + PGCOND_CPU_ILIKE = 16320, /* Generic_Text_IC_like */ + PGCOND_CPU_RE = 16400, /* (generic regexp operation) */ + PGCOND_CPU_RE_COMPILE = 16410, /* RE_compile_and_cache */ + PGCOND_CPU_RE_EXECUTE = 16420, /* RE_execute */ + + /* 20000 - NETWORK */ + PGCOND_NETWORK = 20000, /* (generic network operation) */ + PGCOND_NETWORK_RECV = 21000, /* secure_read */ + PGCOND_NETWORK_SEND = 22000, /* secure_write */ + + /* 30000 - IDLE (should be larger than network to distinguish idle or recv) */ + PGCOND_IDLE = 30000, /* <IDLE> */ + PGCOND_IDLE_IN_TRANSACTION = 31000, /* <IDLE> in transaction */ + PGCOND_IDLE_SLEEP = 32000, /* pg_usleep */ + + /* 40000 - XLOG */ + PGCOND_XLOG = 40000, /* (generic xlog operation) */ + PGCOND_XLOG_CRC = 41000, /* crc calculation in XLogInsert */ + PGCOND_XLOG_INSERT = 42000, /* insert in XLogInsert */ + PGCOND_XLOG_OPEN = 43000, /* XLogFileOpen */ + PGCOND_XLOG_CLOSE = 44000, /* XLogFileClose */ + PGCOND_XLOG_WRITE = 45000, /* write in XLogWrite */ + PGCOND_XLOG_FLUSH = 46000, /* issue_xlog_fsync */ + + /* 50000 - DATA */ + PGCOND_DATA = 50000, /* (generic data operation) */ + PGCOND_DATA_CREATE = 51000, /* smgrcreate */ + PGCOND_DATA_OPEN = 52000, /* smgropen */ + PGCOND_DATA_CLOSE = 53000, /* smgrclose */ + PGCOND_DATA_STAT = 54000, /* smgrnblocks */ + PGCOND_DATA_READ = 55000, /* smgrread */ + PGCOND_DATA_PREFETCH = 56000, /* smgrprefetch */ + PGCOND_DATA_WRITE = 57000, /* smgrwrite */ + PGCOND_DATA_EXTEND = 58000, /* smgrextend */ + + /* 60000 - TEMP */ + PGCOND_TEMP = 60000, /* (generic temp file operation) */ + PGCOND_TEMP_READ = 61000, /* BufFileRead */ + PGCOND_TEMP_WRITE = 62000, /* BufFileWrite */ + + /* 70000 - LOCK */ + PGCOND_LOCK = 70000, /* waiting on a lmgr lock */ + /* 70001-70999 is reserved for lmgr locks */ + + /* 80000 - LWLOCK */ + PGCOND_LWLOCK = 80000, /* waiting on a generic lwlock */ + /* 80001-80999 is reserved for named lwlocks */ + PGCOND_LWLOCK_BUFMAPPING = 81000, /* BufMappingLock(s) */ + PGCOND_LWLOCK_LOCKMGR = 82000, /* LockMgrLock(s) */ + PGCOND_LWLOCK_PAGE = 83000, /* BufferDesc.content_lock */ + PGCOND_LWLOCK_IO = 84000, /* BufferDesc.io_in_progress_lock */ + + /* 90000 - SPINLOCK */ + PGCOND_SPINLOCK = 90000 /* timeout in s_lock */ + } PgCondition; Regards, -- Fujii Masao
On Tue, Jun 30, 2015 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jun 26, 2015 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >> >> 3. Add new view 'pg_stat_wait_event' with following info: >> >> >> pid - process id of this backend >> >> >> waiting - true for any form of wait, false otherwise >> >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, >> >> >> etc >> >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc >> >> I am pretty unconvinced that it's a good idea to try to split up the >> >> wait event into two columns. I'm only proposing ~20 wait states, so >> >> there's something like 5 bits of information here. Spreading that >> >> over two text columns is a lot, and note that Amit's text would >> >> basically recapitulate the contents of the first column in the second >> >> one, which I cannot get excited about. >> > There is an advantage in splitting the columns which is if >> > wait_event_type >> > column indicates Heavy Weight Lock, then user can go and check further >> > details in pg_locks, I think he can do that even by seeing wait_event >> > column, but that might not be as straightforward as with wait_event_type >> > column. >> >> It's just a matter of writing event_type LIKE 'Lock %' instead of >> event_type = 'Lock'. >> > > Yes that way it can be done and may be that is not inconvenient for user, > but then there is other type of information which user might need like what > distinct resources on which wait is possible, which again he can easily find > with > different event_type column. I think some more discussion is required before > we > could freeze the user interface for this feature, but in the meantime I have > prepared an initial patch by adding a new column wait_event in > pg_stat_activity. > For now, I have added the support for Heavy-Weight locks, Light-Weight locks > [1] > and Buffer Cleanup Lock. I could add for other types (spin lock delay > sleep, IO, > network IO, etc.) if there is no objection in the approach used in patch to > implement > this feature. > > [1] For LWLocks, currently I have used wait_event as OtherLock for locks > other than NUM_FIXED_LWLOCKS (Refer function NumLWLocks to see all > type of LWLocks). The reason is that there is no straight forward way to > get > the id (lockid) of such locks as for some of those (like shared_buffers, > MaxBackends) the number of locks will depend on run-time configuration > parameters. I think if we want to handle those then we could either do some > math to find out the lockid based on runtime values of these parameters or > we could add tag in LWLock structure (which indicates the lock type) and > fill it during Lock initialization or may be some other better way to do it. > > I have still not added documentation and have not changed anything for > waiting column in pg_stat_activity as I think before that we need to > finalize > the user interface. Apart from that as mentioned above still wait for > some event types (like IO, netwrok IO, etc.) is not added and also I think > separate function/'s (like we have for existing ones > pg_stat_get_backend_waiting) > will be required which again depends upon user interface. Yes, we need to discuss what events to track. As I suggested upthread, Itagaki-san's patch would be helpful to think that. He proposed to track not only "wait event" like locking and I/O operation but also CPU events like query parsing, planning, and etc. I think that tracking even CPU events would be useful to analyze the performance problem. For example, if pg_stat_activity reports many times that a large majority of backends are doing QUERY PLANNING, DBA can think that it might be possible cause of performance bottleneck and try to check whether the application uses prepared statements properly. Here are some review comments on the patch: When I played around the patch, the test of make check failed. Each backend reports its event when trying to take a lock. But the reported event is never reset until next event is reported. Is this OK? This means that the wait_event column keeps showing the *last* event while a backend is in idle state, for example. So, shouldn't we reset the reported event or report another one when releasing the lock? +read_string_from_waitevent(uint8 wait_event) The performance of this function looks poor because its worst case is O(n): n is the number of all the events that we are trying to track. Also in pg_stat_activity, this function is called per backend. Can't we retrieve the event name by using wait event ID as an index of WaitEventTypeMap array? Regards, -- Fujii Masao
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Kyotaro HORIGUCHI
Date:
Hello, At Tue, 7 Jul 2015 16:27:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEJwov8YwvmbbWps3Rba6kF1yf7qL3S==Oy4D=gq9YNsQ@mail.gmail.com> > Each backend reports its event when trying to take a lock. But > the reported event is never reset until next event is reported. > Is this OK? This means that the wait_event column keeps showing > the *last* event while a backend is in idle state, for example. > So, shouldn't we reset the reported event or report another one > when releasing the lock? It seems so but pg_stat_activity.waiting would indicate whether the event is lasting. However, .waiting reflects only the status of heavy-wait locks. It would be quite misleading. I think that pg_stat_activity.wait_event sould be linked to .waiting then .wait_event should be restricted to heavy wait locks if the meaning of .waiting cannot not be changed. On the other hand, we need to have as many wait events as Itagaki-san's patch did so pg_stat_activity might be the wrong place for full-spec wait_event. > +read_string_from_waitevent(uint8 wait_event) > > The performance of this function looks poor because its worst case > is O(n): n is the number of all the events that we are trying to track. > Also in pg_stat_activity, this function is called per backend. > Can't we retrieve the event name by using wait event ID as an index > of WaitEventTypeMap array? +1 regards, At Tue, 7 Jul 2015 16:27:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEJwov8YwvmbbWps3Rba6kF1yf7qL3S==Oy4D=gq9YNsQ@mail.gmail.com> > On Tue, Jun 30, 2015 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 26, 2015 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > >> On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> > >> wrote: > >> >> >> 3. Add new view 'pg_stat_wait_event' with following info: > >> >> >> pid - process id of this backend > >> >> >> waiting - true for any form of wait, false otherwise > >> >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, > >> >> >> etc > >> >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc > >> >> I am pretty unconvinced that it's a good idea to try to split up the > >> >> wait event into two columns. I'm only proposing ~20 wait states, so > >> >> there's something like 5 bits of information here. Spreading that > >> >> over two text columns is a lot, and note that Amit's text would > >> >> basically recapitulate the contents of the first column in the second > >> >> one, which I cannot get excited about. > >> > There is an advantage in splitting the columns which is if > >> > wait_event_type > >> > column indicates Heavy Weight Lock, then user can go and check further > >> > details in pg_locks, I think he can do that even by seeing wait_event > >> > column, but that might not be as straightforward as with wait_event_type > >> > column. > >> > >> It's just a matter of writing event_type LIKE 'Lock %' instead of > >> event_type = 'Lock'. > >> > > > > Yes that way it can be done and may be that is not inconvenient for user, > > but then there is other type of information which user might need like what > > distinct resources on which wait is possible, which again he can easily find > > with > > different event_type column. I think some more discussion is required before > > we > > could freeze the user interface for this feature, but in the meantime I have > > prepared an initial patch by adding a new column wait_event in > > pg_stat_activity. > > For now, I have added the support for Heavy-Weight locks, Light-Weight locks > > [1] > > and Buffer Cleanup Lock. I could add for other types (spin lock delay > > sleep, IO, > > network IO, etc.) if there is no objection in the approach used in patch to > > implement > > this feature. > > > > [1] For LWLocks, currently I have used wait_event as OtherLock for locks > > other than NUM_FIXED_LWLOCKS (Refer function NumLWLocks to see all > > type of LWLocks). The reason is that there is no straight forward way to > > get > > the id (lockid) of such locks as for some of those (like shared_buffers, > > MaxBackends) the number of locks will depend on run-time configuration > > parameters. I think if we want to handle those then we could either do some > > math to find out the lockid based on runtime values of these parameters or > > we could add tag in LWLock structure (which indicates the lock type) and > > fill it during Lock initialization or may be some other better way to do it. > > > > I have still not added documentation and have not changed anything for > > waiting column in pg_stat_activity as I think before that we need to > > finalize > > the user interface. Apart from that as mentioned above still wait for > > some event types (like IO, netwrok IO, etc.) is not added and also I think > > separate function/'s (like we have for existing ones > > pg_stat_get_backend_waiting) > > will be required which again depends upon user interface. > > Yes, we need to discuss what events to track. As I suggested upthread, > Itagaki-san's patch would be helpful to think that. He proposed to track > not only "wait event" like locking and I/O operation but also CPU events > like query parsing, planning, and etc. I think that tracking even CPU > events would be useful to analyze the performance problem. For example, > if pg_stat_activity reports many times that a large majority of backends > are doing QUERY PLANNING, DBA can think that it might be possible > cause of performance bottleneck and try to check whether the application > uses prepared statements properly. > > Here are some review comments on the patch: > > When I played around the patch, the test of make check failed. > > Each backend reports its event when trying to take a lock. But > the reported event is never reset until next event is reported. > Is this OK? This means that the wait_event column keeps showing > the *last* event while a backend is in idle state, for example. > So, shouldn't we reset the reported event or report another one > when releasing the lock? > > +read_string_from_waitevent(uint8 wait_event) > > The performance of this function looks poor because its worst case > is O(n): n is the number of all the events that we are trying to track. > Also in pg_stat_activity, this function is called per backend. > Can't we retrieve the event name by using wait event ID as an index > of WaitEventTypeMap array? > > Regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Kyotaro HORIGUCHI
Date:
Please forgive me to resend this message for some too-sad misspellings. # "Waiting for heavy weight locks" is somewhat confusing to spell.. === Hello, At Tue, 7 Jul 2015 16:27:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEJwov8YwvmbbWps3Rba6kF1yf7qL3S==Oy4D=gq9YNsQ@mail.gmail.com> > Each backend reports its event when trying to take a lock. But > the reported event is never reset until next event is reported. > Is this OK? This means that the wait_event column keeps showing > the *last* event while a backend is in idle state, for example. > So, shouldn't we reset the reported event or report another one > when releasing the lock? It seems so but pg_stat_activity.waiting would indicate whether the event is lasting. However, .waiting reflects only the status of heavy-weight locks. It would be quite misleading. I think that pg_stat_activity.wait_event sould be linked to .waiting then .wait_event should be restricted to heavy weight locks if the meaning of .waiting cannot not be changed. On the other hand, we need to have as many wait events as Itagaki-san's patch did so pg_stat_activity might be the wrong place for full-spec wait_event. > +read_string_from_waitevent(uint8 wait_event) > > The performance of this function looks poor because its worst case > is O(n): n is the number of all the events that we are trying to track. > Also in pg_stat_activity, this function is called per backend. > Can't we retrieve the event name by using wait event ID as an index > of WaitEventTypeMap array? +1 regards, At Tue, 7 Jul 2015 16:27:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEJwov8YwvmbbWps3Rba6kF1yf7qL3S==Oy4D=gq9YNsQ@mail.gmail.com> > On Tue, Jun 30, 2015 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 26, 2015 at 6:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > >> On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila <amit.kapila16@gmail.com> > >> wrote: > >> >> >> 3. Add new view 'pg_stat_wait_event' with following info: > >> >> >> pid - process id of this backend > >> >> >> waiting - true for any form of wait, false otherwise > >> >> >> wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, > >> >> >> etc > >> >> >> wait_event - Lock (Relation), Lock (Relation Extension), etc > >> >> I am pretty unconvinced that it's a good idea to try to split up the > >> >> wait event into two columns. I'm only proposing ~20 wait states, so > >> >> there's something like 5 bits of information here. Spreading that > >> >> over two text columns is a lot, and note that Amit's text would > >> >> basically recapitulate the contents of the first column in the second > >> >> one, which I cannot get excited about. > >> > There is an advantage in splitting the columns which is if > >> > wait_event_type > >> > column indicates Heavy Weight Lock, then user can go and check further > >> > details in pg_locks, I think he can do that even by seeing wait_event > >> > column, but that might not be as straightforward as with wait_event_type > >> > column. > >> > >> It's just a matter of writing event_type LIKE 'Lock %' instead of > >> event_type = 'Lock'. > >> > > > > Yes that way it can be done and may be that is not inconvenient for user, > > but then there is other type of information which user might need like what > > distinct resources on which wait is possible, which again he can easily find > > with > > different event_type column. I think some more discussion is required before > > we > > could freeze the user interface for this feature, but in the meantime I have > > prepared an initial patch by adding a new column wait_event in > > pg_stat_activity. > > For now, I have added the support for Heavy-Weight locks, Light-Weight locks > > [1] > > and Buffer Cleanup Lock. I could add for other types (spin lock delay > > sleep, IO, > > network IO, etc.) if there is no objection in the approach used in patch to > > implement > > this feature. > > > > [1] For LWLocks, currently I have used wait_event as OtherLock for locks > > other than NUM_FIXED_LWLOCKS (Refer function NumLWLocks to see all > > type of LWLocks). The reason is that there is no straight forward way to > > get > > the id (lockid) of such locks as for some of those (like shared_buffers, > > MaxBackends) the number of locks will depend on run-time configuration > > parameters. I think if we want to handle those then we could either do some > > math to find out the lockid based on runtime values of these parameters or > > we could add tag in LWLock structure (which indicates the lock type) and > > fill it during Lock initialization or may be some other better way to do it. > > > > I have still not added documentation and have not changed anything for > > waiting column in pg_stat_activity as I think before that we need to > > finalize > > the user interface. Apart from that as mentioned above still wait for > > some event types (like IO, netwrok IO, etc.) is not added and also I think > > separate function/'s (like we have for existing ones > > pg_stat_get_backend_waiting) > > will be required which again depends upon user interface. > > Yes, we need to discuss what events to track. As I suggested upthread, > Itagaki-san's patch would be helpful to think that. He proposed to track > not only "wait event" like locking and I/O operation but also CPU events > like query parsing, planning, and etc. I think that tracking even CPU > events would be useful to analyze the performance problem. For example, > if pg_stat_activity reports many times that a large majority of backends > are doing QUERY PLANNING, DBA can think that it might be possible > cause of performance bottleneck and try to check whether the application > uses prepared statements properly. > > Here are some review comments on the patch: > > When I played around the patch, the test of make check failed. > > Each backend reports its event when trying to take a lock. But > the reported event is never reset until next event is reported. > Is this OK? This means that the wait_event column keeps showing > the *last* event while a backend is in idle state, for example. > So, shouldn't we reset the reported event or report another one > when releasing the lock? > > +read_string_from_waitevent(uint8 wait_event) > > The performance of this function looks poor because its worst case > is O(n): n is the number of all the events that we are trying to track. > Also in pg_stat_activity, this function is called per backend. > Can't we retrieve the event name by using wait event ID as an index > of WaitEventTypeMap array? > > Regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 7, 2015 at 12:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Tue, Jun 30, 2015 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have still not added documentation and have not changed anything for
> > waiting column in pg_stat_activity as I think before that we need to
> > finalize
> > the user interface. Apart from that as mentioned above still wait for
> > some event types (like IO, netwrok IO, etc.) is not added and also I think
> > separate function/'s (like we have for existing ones
> > pg_stat_get_backend_waiting)
> > will be required which again depends upon user interface.
>
> Yes, we need to discuss what events to track. As I suggested upthread,
> Itagaki-san's patch would be helpful to think that. He proposed to track
> not only "wait event" like locking and I/O operation but also CPU events
> like query parsing, planning, and etc. I think that tracking even CPU
> events would be useful to analyze the performance problem.
> Here are some review comments on the patch:
>
> When I played around the patch, the test of make check failed.
>
> Each backend reports its event when trying to take a lock. But
> the reported event is never reset until next event is reported.
> Is this OK? This means that the wait_event column keeps showing
> the *last* event while a backend is in idle state, for example.
> So, shouldn't we reset the reported event or report another one
> when releasing the lock?
>
>
> On Tue, Jun 30, 2015 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have still not added documentation and have not changed anything for
> > waiting column in pg_stat_activity as I think before that we need to
> > finalize
> > the user interface. Apart from that as mentioned above still wait for
> > some event types (like IO, netwrok IO, etc.) is not added and also I think
> > separate function/'s (like we have for existing ones
> > pg_stat_get_backend_waiting)
> > will be required which again depends upon user interface.
>
> Yes, we need to discuss what events to track. As I suggested upthread,
> Itagaki-san's patch would be helpful to think that. He proposed to track
> not only "wait event" like locking and I/O operation but also CPU events
> like query parsing, planning, and etc. I think that tracking even CPU
> events would be useful to analyze the performance problem.
>
I think as part of this patch, we are trying to capture events where we
need to wait, extending the scope to include CPU events might duplicate
the tracing events (DTRACE) we already have in PostgreSQL and it might
degrade performance in certain cases. If we try to capture events only
during waits, then there is almost no chance of having any visible performance
impact. I suggest for an initial implementation, lets track Heavy weight locks,
Light Weight Locks and IO events and then we can add more events if we think
those are important.
> Here are some review comments on the patch:
>
> When I played around the patch, the test of make check failed.
>
Yes, I still need to update some of the tests according to updates in
pg_stat_activity, but the main reason I have not done so is that the user
interface needs some more discussion.
> Each backend reports its event when trying to take a lock. But
> the reported event is never reset until next event is reported.
> Is this OK? This means that the wait_event column keeps showing
> the *last* event while a backend is in idle state, for example.
> So, shouldn't we reset the reported event or report another one
> when releasing the lock?
>
As pointed by Kyotaro-San, 'waiting' column will indicate whether it
is still waiting on the event specified in wait_event column.
Currently waiting is updated only for Heavy Weight locks, if there is
no objection to use it for other wait_events (aka break backward compatibility
for that parameter), then I can update it for other events as well, do you
have any better suggestions for the same?
> +read_string_from_waitevent(uint8 wait_event)
>
> The performance of this function looks poor because its worst case
> is O(n): n is the number of all the events that we are trying to track.
> Also in pg_stat_activity, this function is called per backend.
> Can't we retrieve the event name by using wait event ID as an index
> of WaitEventTypeMap array?
>
> The performance of this function looks poor because its worst case
> is O(n): n is the number of all the events that we are trying to track.
> Also in pg_stat_activity, this function is called per backend.
> Can't we retrieve the event name by using wait event ID as an index
> of WaitEventTypeMap array?
>
Yes, we can do that way.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 6/22/15 1:37 PM, Robert Haas wrote:
>> Currently, the only time we report a process as waiting is when it is
>> waiting for a heavyweight lock. I'd like to make that somewhat more
>> fine-grained, by reporting the type of heavyweight lock it's awaiting
>> (relation, relation extension, transaction, etc.). Also, I'd like to
>> report when we're waiting for a lwlock, and report either the specific
>> fixed lwlock for which we are waiting, or else the type of lock (lock
>> manager lock, buffer content lock, etc.) for locks of which there is
>> more than one. I'm less sure about this next part, but I think we
>> might also want to report ourselves as waiting when we are doing an OS
>> read or an OS write, because it's pretty common for people to think
>> that a PostgreSQL bug is to blame when in fact it's the operating
>> system that isn't servicing our I/O requests very quickly.
>
> Could that also cover waiting on network?
Possibly. My approach requires that the number of wait states be kept
relatively small, ideally fitting in a single byte. And it also
requires that we insert pgstat_report_waiting() calls around the thing
that is notionally blocking. So, if there are a small number of
places in the code where we do network I/O, we could stick those calls
around those places, and this would work just fine. But if a foreign
data wrapper, or any other piece of code, does network I/O - or any
other blocking operation - without calling pgstat_report_waiting(), we
just won't know about it.
Idea of fitting wait information into single byte and avoid both locking and atomic operations is attractive.
But how long we can go with it?
Could DBA make some conclusion by single querying of pg_stat_activity or double querying?
In order to make a conclusion about system load one have to run daemon or background worker which is continuously sampling current wait events.
Could DBA make some conclusion by single querying of pg_stat_activity or double querying?
In order to make a conclusion about system load one have to run daemon or background worker which is continuously sampling current wait events.
Sampling current wait event with high rate also gives some overhead to the system as well as locking or atomic operations.
Checking if backend is stuck isn't easy as well. If you don't expose how long last wait event continues it's hard to distinguish getting stuck on particular lock and high concurrency on that lock type.
I can propose following:
1) Expose more information about current lock to user. For instance, having duration of current wait event, user can determine if backend is getting stuck on particular event without sampling.
I can propose following:
1) Expose more information about current lock to user. For instance, having duration of current wait event, user can determine if backend is getting stuck on particular event without sampling.
2) Accumulate per backend statistics about each wait event type: number of occurrences and total duration. With this statistics user can identify system bottlenecks again without sampling.
Number #2 will be provided as a separate patch.
Number #1 require different concurrency model. ldus will extract it from "waits monitoring" patch shortly.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Jul 10, 2015 at 10:03 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> > On 6/22/15 1:37 PM, Robert Haas wrote:
>> >> Currently, the only time we report a process as waiting is when it is
>> >> waiting for a heavyweight lock. I'd like to make that somewhat more
>> >> fine-grained, by reporting the type of heavyweight lock it's awaiting
>> >> (relation, relation extension, transaction, etc.). Also, I'd like to
>> >> report when we're waiting for a lwlock, and report either the specific
>> >> fixed lwlock for which we are waiting, or else the type of lock (lock
>> >> manager lock, buffer content lock, etc.) for locks of which there is
>> >> more than one. I'm less sure about this next part, but I think we
>> >> might also want to report ourselves as waiting when we are doing an OS
>> >> read or an OS write, because it's pretty common for people to think
>> >> that a PostgreSQL bug is to blame when in fact it's the operating
>> >> system that isn't servicing our I/O requests very quickly.
>> >
>> > Could that also cover waiting on network?
>>
>> Possibly. My approach requires that the number of wait states be kept
>> relatively small, ideally fitting in a single byte. And it also
>> requires that we insert pgstat_report_waiting() calls around the thing
>> that is notionally blocking. So, if there are a small number of
>> places in the code where we do network I/O, we could stick those calls
>> around those places, and this would work just fine. But if a foreign
>> data wrapper, or any other piece of code, does network I/O - or any
>> other blocking operation - without calling pgstat_report_waiting(), we
>> just won't know about it.
>
>
> Idea of fitting wait information into single byte and avoid both locking and atomic operations is attractive.
> But how long we can go with it?
> Could DBA make some conclusion by single querying of pg_stat_activity or double querying?
> In order to make a conclusion about system load one have to run daemon or background worker which is continuously sampling current wait events.
> Checking if backend is stuck isn't easy as well. If you don't expose how long last wait event continues it's hard to distinguish getting stuck on particular lock and high concurrency on that lock type.
>
> I can propose following:
>
> 1) Expose more information about current lock to user. For instance, having duration of current wait event, user can determine if backend is getting > stuck on particular event without sampling.
> 2) Accumulate per backend statistics about each wait event type: number of occurrences and total duration. With this statistics user can identify system bottlenecks again without sampling.
>
> Number #2 will be provided as a separate patch.
> Number #1 require different concurrency model. ldus will extract it from "waits monitoring" patch shortly.
>
Sure, I think those should be evaluated as separate patches,
>
> On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> > On 6/22/15 1:37 PM, Robert Haas wrote:
>> >> Currently, the only time we report a process as waiting is when it is
>> >> waiting for a heavyweight lock. I'd like to make that somewhat more
>> >> fine-grained, by reporting the type of heavyweight lock it's awaiting
>> >> (relation, relation extension, transaction, etc.). Also, I'd like to
>> >> report when we're waiting for a lwlock, and report either the specific
>> >> fixed lwlock for which we are waiting, or else the type of lock (lock
>> >> manager lock, buffer content lock, etc.) for locks of which there is
>> >> more than one. I'm less sure about this next part, but I think we
>> >> might also want to report ourselves as waiting when we are doing an OS
>> >> read or an OS write, because it's pretty common for people to think
>> >> that a PostgreSQL bug is to blame when in fact it's the operating
>> >> system that isn't servicing our I/O requests very quickly.
>> >
>> > Could that also cover waiting on network?
>>
>> Possibly. My approach requires that the number of wait states be kept
>> relatively small, ideally fitting in a single byte. And it also
>> requires that we insert pgstat_report_waiting() calls around the thing
>> that is notionally blocking. So, if there are a small number of
>> places in the code where we do network I/O, we could stick those calls
>> around those places, and this would work just fine. But if a foreign
>> data wrapper, or any other piece of code, does network I/O - or any
>> other blocking operation - without calling pgstat_report_waiting(), we
>> just won't know about it.
>
>
> Idea of fitting wait information into single byte and avoid both locking and atomic operations is attractive.
> But how long we can go with it?
> Could DBA make some conclusion by single querying of pg_stat_activity or double querying?
>
It could be helpful in situations, where the session is stuck on a
particular lock or when you see most of the backends are showing
the wait on same LWLock.
> In order to make a conclusion about system load one have to run daemon or background worker which is continuously sampling current wait events.
> Sampling current wait event with high rate also gives some overhead to the system as well as locking or atomic operations.
>
The idea of sampling sounds good, but I think if it adds performance
penality on the system, then we should look into the ways to avoid
it in hot-paths.
> Checking if backend is stuck isn't easy as well. If you don't expose how long last wait event continues it's hard to distinguish getting stuck on particular lock and high concurrency on that lock type.
>
> I can propose following:
>
> 1) Expose more information about current lock to user. For instance, having duration of current wait event, user can determine if backend is getting > stuck on particular event without sampling.
>
For having duration, I think you need to use gettimeofday or some
similar call to calculate the wait time, now it will be okay for the
cases where wait time is longer, however it could be problematic for
the cases if the waits are very small (which could probably be the
case for LWLocks)
> 2) Accumulate per backend statistics about each wait event type: number of occurrences and total duration. With this statistics user can identify system bottlenecks again without sampling.
>
> Number #2 will be provided as a separate patch.
> Number #1 require different concurrency model. ldus will extract it from "waits monitoring" patch shortly.
>
Sure, I think those should be evaluated as separate patches,
and I can look into those patches and see if something more
can be exposed as part of this patch which we can be reused in
those patches.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/12/2015 06:53 AM, Amit Kapila wrote:
gettimeofday already used in our patch and it gives enough accuracy (in microseconds), especially when lwlock become a problem. Also we tested our realization and it gives overhead less than 1%. (http://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru, testing part). We need help here with testing on other platforms. I used gettimeofday because of builtin module "instr_time.h" that already gives cross-platform tested functions for measuring, but I'm planning to make similar implementation for monotonic functions based on clock_gettime for more accuracy.On Fri, Jul 10, 2015 at 10:03 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>
> On Fri, Jun 26, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Jun 25, 2015 at 9:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> > On 6/22/15 1:37 PM, Robert Haas wrote:
>> >> Currently, the only time we report a process as waiting is when it is
>> >> waiting for a heavyweight lock. I'd like to make that somewhat more
>> >> fine-grained, by reporting the type of heavyweight lock it's awaiting
>> >> (relation, relation extension, transaction, etc.). Also, I'd like to
>> >> report when we're waiting for a lwlock, and report either the specific
>> >> fixed lwlock for which we are waiting, or else the type of lock (lock
>> >> manager lock, buffer content lock, etc.) for locks of which there is
>> >> more than one. I'm less sure about this next part, but I think we
>> >> might also want to report ourselves as waiting when we are doing an OS
>> >> read or an OS write, because it's pretty common for people to think
>> >> that a PostgreSQL bug is to blame when in fact it's the operating
>> >> system that isn't servicing our I/O requests very quickly.
>> >
>> > Could that also cover waiting on network?
>>
>> Possibly. My approach requires that the number of wait states be kept
>> relatively small, ideally fitting in a single byte. And it also
>> requires that we insert pgstat_report_waiting() calls around the thing
>> that is notionally blocking. So, if there are a small number of
>> places in the code where we do network I/O, we could stick those calls
>> around those places, and this would work just fine. But if a foreign
>> data wrapper, or any other piece of code, does network I/O - or any
>> other blocking operation - without calling pgstat_report_waiting(), we
>> just won't know about it.
>
>
> Idea of fitting wait information into single byte and avoid both locking and atomic operations is attractive.
> But how long we can go with it?
> Could DBA make some conclusion by single querying of pg_stat_activity or double querying?>It could be helpful in situations, where the session is stuck on aparticular lock or when you see most of the backends are showingthe wait on same LWLock.
> In order to make a conclusion about system load one have to run daemon or background worker which is continuously sampling current wait events.> Sampling current wait event with high rate also gives some overhead to the system as well as locking or atomic operations.>The idea of sampling sounds good, but I think if it adds performancepenality on the system, then we should look into the ways to avoidit in hot-paths.
> Checking if backend is stuck isn't easy as well. If you don't expose how long last wait event continues it's hard to distinguish getting stuck on particular lock and high concurrency on that lock type.
>
> I can propose following:
>
> 1) Expose more information about current lock to user. For instance, having duration of current wait event, user can determine if backend is getting > stuck on particular event without sampling.>For having duration, I think you need to use gettimeofday or somesimilar call to calculate the wait time, now it will be okay for thecases where wait time is longer, however it could be problematic forthe cases if the waits are very small (which could probably be thecase for LWLocks)
If you agree I'l do some modifications to your patch, so we can later extend it with our other modifications. Main issue is that one variable for all types is not enough. For flexibity in the future we need at least two - class and event, for example class=LWLock, event=ProcArrayLock, or class=Storage, and event=READ. With this modification it is not so big problem merge our patches to one. There are not so many types of waits, they can fit to one int32 and can be read atomically too.
> 2) Accumulate per backend statistics about each wait event type: number of occurrences and total duration. With this statistics user can identify system bottlenecks again without sampling.
>
> Number #2 will be provided as a separate patch.
> Number #1 require different concurrency model. ldus will extract it from "waits monitoring" patch shortly.
>
Sure, I think those should be evaluated as separate patches,and I can look into those patches and see if something morecan be exposed as part of this patch which we can be reused inthose patches.
-- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Jul 13, 2015 at 3:26 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
gettimeofday already used in our patch and it gives enough accuracy (in microseconds), especially when lwlock become a problem. Also we tested our realization and it gives overhead less than 1%. (http://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru, testing part).
On 07/12/2015 06:53 AM, Amit Kapila wrote:For having duration, I think you need to use gettimeofday or somesimilar call to calculate the wait time, now it will be okay for thecases where wait time is longer, however it could be problematic forthe cases if the waits are very small (which could probably be thecase for LWLocks)
I think that test is quite generic, we should test more combinations
(like use -M prepared option as that can stress LWLock machinery
somewhat more) and other type of tests which can stress the part
of code where gettimeofday() is used in patch.
We need help here with testing on other platforms. I used gettimeofday because of builtin module "instr_time.h" that already gives cross-platform tested functions for measuring, but I'm planning to make similar implementation for monotonic functions based on clock_gettime for more accuracy.If you agree I'l do some modifications to your patch, so we can later extend it with our other modifications. Main issue is that one variable for all types is not enough. For flexibity in the future we need at least two - class and event, for example class=LWLock, event=ProcArrayLock, or class=Storage, and event=READ.
> 2) Accumulate per backend statistics about each wait event type: number of occurrences and total duration. With this statistics user can identify system bottlenecks again without sampling.
>
> Number #2 will be provided as a separate patch.
> Number #1 require different concurrency model. ldus will extract it from "waits monitoring" patch shortly.
>
Sure, I think those should be evaluated as separate patches,and I can look into those patches and see if something morecan be exposed as part of this patch which we can be reused inthose patches.
I have already proposed something very similar in this thread [1]
(where instead of class, I have used wait_event_type) to which
Robert doesn't agree, so here I think before writing code, it seems
prudent to get an agreement about what kind of User-Interface
would satisfy the requirement and will be extendible for future as
well. I think it will be better if you can highlight some points about
what kind of user-interface is better (extendible) and the reasons for
same.
[1] (Refer option-3) - http://www.postgresql.org/message-id/CAA4eK1J6Cg_jYM00nrwt4n8r78Zn4LJoqY_zU1xRzXFq+mEY3g@mail.gmail.com
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/13/2015 01:36 PM, Amit Kapila wrote: > I have already proposed something very similar in this thread [1] > (where instead of class, I have used wait_event_type) to which > Robert doesn't agree, so here I think before writing code, it seems > prudent to get an agreement about what kind of User-Interface > would satisfy the requirement and will be extendible for future as > well. I think it will be better if you can highlight some points about > what kind of user-interface is better (extendible) and the reasons for > same. > > [1] (Refer option-3) - > http://www.postgresql.org/message-id/CAA4eK1J6Cg_jYM00nrwt4n8r78Zn4LJoqY_zU1xRzXFq+mEY3g@mail.gmail.com The idea of splitting to classes and events does not confict with your current implementation. That is not an issue to show only one value in pg_stat_activity and more detailed two parameters in other places. The base reason is that DBA will want to see grouped information about class (for example wait time of whole `Storage` class). About user interface it depends from what we want to be monitored. In our patch we have profiling and history. In profiling we show class, event, wait_time and count. In history we save all parameters of wait. Other problem of pg_stat_activity that we can not see all processes there (checkpointer for example). So we anyway need separate view for monitoring purposes. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Jul 13, 2015 at 9:19 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > On 07/13/2015 01:36 PM, Amit Kapila wrote: > > Other problem of pg_stat_activity that we can not see all processes there > (checkpointer for example). So we anyway need separate view for monitoring > purposes. +1 When there are many walsender processes running, maybe I'd like to see their wait events. Regards, -- Fujii Masao
On Mon, Jul 6, 2015 at 10:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > According to his patch, the wait events that he was thinking to add were: > > + typedef enum PgCondition > + { > + PGCOND_UNUSED = 0, /* unused */ > + > + /* 10000 - CPU */ > + PGCOND_CPU = 10000, /* generic cpu operations */ > + /* 11000 - CPU:PARSE */ > + PGCOND_CPU_PARSE = 11000, /* pg_parse_query */ > + PGCOND_CPU_PARSE_ANALYZE = 11100, /* parse_analyze */ > + /* 12000 - CPU:REWRITE */ > + PGCOND_CPU_REWRITE = 12000, /* pg_rewrite_query */ > + /* 13000 - CPU:PLAN */ > + PGCOND_CPU_PLAN = 13000, /* pg_plan_query */ > + /* 14000 - CPU:EXECUTE */ > + PGCOND_CPU_EXECUTE = 14000, /* PortalRun or > PortalRunMulti */ [ etc. ] Sorry to say it, but I this design is a mess. Suppose we are executing a query, and during the query we execute the ILIKE operator, and within that we try to acquire a buffer content lock (say, to detoast some data). So at the outermost level our state is PGCOND_CPU_EXECUTE, and then within that we are in state PGCOND_CPU_ILIKE, and then within that we are in state PGCOND_LWLOCK_PAGE. When we exit each of the inner states, we've got to restore the proper outer state, or time will be mis-attribtued. Error handling has got to pop all of the items off the stack that were added since the PG_TRY() block started, and then push on a new state for error handling, which gets popped when the PG_TRY block finishes. Another problem is that some of these things are incredibly specific (like "running the ILIKE operator") and others are extremely general (like "executing the query"). Why does ILIKE get a code but +(int4,int4) does not? We need some less-arbitrary way of assigning codes than what's shown here. Now, that's not to say there are no good ideas here. For example, pg_stat_activity could expose a byte of state indicating which phase of query processing is current: parse / parse analysis / rewrite / plan / execute / none. I think that'd be a fine thing to do, and I support doing it, although maybe not in the same patch as my original proposal. On the flip side, I don't support trying to expose information on the level of "which C function are we currently executing?" because I think there's going to be absolutely no reasonable way to make that sufficiently low-overhead, and also because I don't see any way to make it less than nightmarishly onerous from the point of view of code maintenance. We could expose some functions but not others, but that seems like a mess; I think unless and until we have a better solution, the right answer to "I need to know which C function is running in each backend" is "that's what perf is for". In any case, I think the main point is that Itagaki-san's proposal is not really a proposal for wait events. It is a proposal to expose some state about "what is the backend doing right now?" which might be waiting or something else. I believe those things are should be separated into several separate pieces of state. It's entirely reasonable to want to know whether we are in the parse phase or plan phase separately from knowing whether we are waiting on an lwlock or not, because then you could (for example) judge what percentage of your lock waits are coming from parsing vs. what fraction are coming from planning, which somebody might care about. Or you might care about ONLY the phase of query processing and not about wait events at all, and then you can ignore one column and look at the other. With the above proposal, those things all get munged together in a way that I think is bound to be awkward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 7, 2015 at 6:28 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Please forgive me to resend this message for some too-sad > misspellings. > > # "Waiting for heavy weight locks" is somewhat confusing to spell.. > > === > Hello, > > At Tue, 7 Jul 2015 16:27:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEJwov8YwvmbbWps3Rba6kF1yf7qL3S==Oy4D=gq9YNsQ@mail.gmail.com> >> Each backend reports its event when trying to take a lock. But >> the reported event is never reset until next event is reported. >> Is this OK? This means that the wait_event column keeps showing >> the *last* event while a backend is in idle state, for example. >> So, shouldn't we reset the reported event or report another one >> when releasing the lock? > > It seems so but pg_stat_activity.waiting would indicate whether > the event is lasting. However, .waiting reflects only the status > of heavy-weight locks. It would be quite misleading. > > I think that pg_stat_activity.wait_event sould be linked to > .waiting then .wait_event should be restricted to heavy weight > locks if the meaning of .waiting cannot not be changed. Yeah, that's clearly no good. It only makes sense to have wait_event show the most recent event if waiting tells you whether the wait is still ongoing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 10, 2015 at 12:33 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I can propose following: > > 1) Expose more information about current lock to user. For instance, having > duration of current wait event, user can determine if backend is getting > stuck on particular event without sampling. Although this is appealing from a functionality perspective, I predict that the overhead of it will be quite significant. We'd need to do gettimeofday() every time somebody calls pgstat_report_waiting(), and if we do that every time we (say) initiate a disk write, I think that's going to be pretty costly even on platforms where gettimeofday() is fast, let alone those where it's slow. If somebody does a sequential scan of a non-cached table, I don't care to add a gettimeofday() call for every read(). > 2) Accumulate per backend statistics about each wait event type: number of > occurrences and total duration. With this statistics user can identify > system bottlenecks again without sampling. This is even more expensive: now you've got to do TWO gettimeofday() calls per wait event, one when it starts and one when it ends. Plus, you've got to do updates to a backend-local hash table. It might be that this is tolerable for wait events that only happen in contended paths - e.g. when a lock or lwlock acquisition actually blocks, or when we decide to do a spin-delay - but I suspect it's going to stink for things that happen frequently even when things are going well, like reading and writing blocks. So the effect will either add a lot of performance overhead, or else we just can't add some of the wait events that people would like to see. I really think we should do the simple thing first. If we make this complicated and add lots of bells and whistles, it is going to be much harder to get anything committed, because there will be more things for somebody to object to. If we start with something simple, we can always improve it later, if we are confident that the design for improving it is good. The hardest thing about a proposal like this is going to be getting down the overhead to a level that is acceptable, and every expansion of the basic design that has been proposed - gathering more than one byte of information, or gathering times, or having the backend update a tracking hash - adds *significant* overhead to the design I proposed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I really think we should do the simple thing first. If we make this > complicated and add lots of bells and whistles, it is going to be much > harder to get anything committed, because there will be more things > for somebody to object to. If we start with something simple, we can > always improve it later, if we are confident that the design for > improving it is good. The hardest thing about a proposal like this is > going to be getting down the overhead to a level that is acceptable, > and every expansion of the basic design that has been proposed - > gathering more than one byte of information, or gathering times, or > having the backend update a tracking hash - adds *significant* > overhead to the design I proposed. FWIW, I entirely share Robert's opinion that adding gettimeofday() overhead in routinely-taken paths is likely not to be acceptable. But there's no need to base this solely on opinions. I suggest somebody try instrumenting just one hotspot in the various ways that are being proposed, and then we can actually measure what it costs, instead of guessing about that. regards, tom lane
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
> On Jul 14, 2015, at 5:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: >> I really think we should do the simple thing first. If we make this >> complicated and add lots of bells and whistles, it is going to be much >> harder to get anything committed, because there will be more things >> for somebody to object to. If we start with something simple, we can >> always improve it later, if we are confident that the design for >> improving it is good. The hardest thing about a proposal like this is >> going to be getting down the overhead to a level that is acceptable, >> and every expansion of the basic design that has been proposed - >> gathering more than one byte of information, or gathering times, or >> having the backend update a tracking hash - adds *significant* >> overhead to the design I proposed. > > FWIW, I entirely share Robert's opinion that adding gettimeofday() > overhead in routinely-taken paths is likely not to be acceptable. > But there's no need to base this solely on opinions. I suggest somebody > try instrumenting just one hotspot in the various ways that are being > proposed, and then we can actually measure what it costs, instead > of guessing about that. > > regards, tom lane > I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. Testing configuration: 24 cores, Intel Xeon CPU X5675@3.07Ghz RAM 24 GB 54179703 - microseconds total 2147483647 - (INT_MAX), the number of gettimeofday() calls >>> 54179703 / 2147483647.0 0.025229390256679331 Here we have the average duration of one gettimeofday in microseconds. Now we get the count of all waits in one minute (pgbench -i -s 500, 2 GB shared buffers, started with -c 96 -j 4, it is almost 100% cpu load). b1=# select sum(wait_count) from pg_stat_wait_profile; sum --------- 2113608 So, we can estimate the time of gtd in one minute (it multiplies by two, because we are going to call it twice for each wait) >>> 2113608 * 0.025229390256679331 * 2 106650.08216327897 This is time in microseconds that we spend on gtd in one minute. Calculation of overhead (in percents): >>> 106650 / 60000000.0 * 100 0.17775
Attachment
Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> writes: > I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. > Testing configuration: > 24 cores, Intel Xeon CPU X5675@3.07Ghz > RAM 24 GB > 54179703 - microseconds total > 2147483647 - (INT_MAX), the number of gettimeofday() calls > >>> 54179703 / 2147483647.0 > 0.025229390256679331 > Here we have the average duration of one gettimeofday in microseconds. 25 nsec per gettimeofday() is in the same ballpark as what I measured on a new-ish machine last year: http://www.postgresql.org/message-id/flat/31856.1400021891@sss.pgh.pa.us The problem is that (a) on modern hardware that is not a small number, it's the equivalent of 100 or more instructions; and (b) the results look very much worse on less-modern hardware, particularly machines where gettimeofday requires a kernel call. regards, tom lane
On Thu, Jul 16, 2015 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> writes: >> I made benchmark of gettimeofday(). I believe it is certainly usable for monitoring. >> Testing configuration: >> 24 cores, Intel Xeon CPU X5675@3.07Ghz >> RAM 24 GB > >> 54179703 - microseconds total >> 2147483647 - (INT_MAX), the number of gettimeofday() calls > >> >>> 54179703 / 2147483647.0 >> 0.025229390256679331 > >> Here we have the average duration of one gettimeofday in microseconds. > > 25 nsec per gettimeofday() is in the same ballpark as what I measured > on a new-ish machine last year: > http://www.postgresql.org/message-id/flat/31856.1400021891@sss.pgh.pa.us > > The problem is that (a) on modern hardware that is not a small number, > it's the equivalent of 100 or more instructions; and (b) the results > look very much worse on less-modern hardware, particularly machines > where gettimeofday requires a kernel call. Yes, we've been through this many times before. All you have to do is look at how much slower a query gets when you run EXPLAIN ANALYZE vs. when you run it without EXPLAIN ANALYZE. The slowdown there is platform-dependent, but I think it's significant even on platforms where gettimeofday is fast, like modern Linux machines. That overhead is precisely the reason why we added EXPLAIN (ANALYZE, TIMING OFF) - so that if you want to, you can see the row-count estimates without incurring the timing overhead. There is *plenty* of evidence that using gettimeofday in contexts where it may be called many times per query measurably hurts performance. It is possible that we can have an *optional feature* where timing can be turned on, but it is dead certain that turning it on unconditionally will be unacceptable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Peter Geoghegan
Date:
On Tue, Jul 14, 2015 at 7:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, I entirely share Robert's opinion that adding gettimeofday() > overhead in routinely-taken paths is likely not to be acceptable. I think that it can depend on many factors. For example, the availability of vDSO support on Linux/glibc. I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with CLOCK_MONOTONIC_COARSE can have significantly lower overhead than gettimeofday(). -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with > CLOCK_MONOTONIC_COARSE can have significantly lower overhead than > gettimeofday(). It can, but it also has *much* lower precision, typically 1ms or so. regards, tom lane
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Fri, Jul 17, 2015 at 6:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@heroku.com> writes:
> I've heard that clock_gettime() with CLOCK_REALTIME_COARSE, or with
> CLOCK_MONOTONIC_COARSE can have significantly lower overhead than
> gettimeofday().
It can, but it also has *much* lower precision, typically 1ms or so.
I've write simple benchmark of QueryPerformanceCounter() for Windows. The source code is following.
#include <stdio.h>
#include <windows.h>
#include <Winbase.h>
int _main(int argc, char* argv[])
{
LARGE_INTEGER start, freq, current;
long count = 0;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);
current = start;
while (current.QuadPart < start.QuadPart + freq.QuadPart)
{
QueryPerformanceCounter(¤t);
count++;
}
printf("QueryPerformanceCounter() per second: %lu\n", count);
return 0;
}
In a contrast my MacBook can natively run 26260236 gettimeofday() per second.
So, performance of PostgreSQL instr_time.h can vary in more than order of magnitude. It's possible that we can found systems where measurements of time are even much slower.
In general, there could be some systems where accurate measurements of time intervals is impossible or slow. That means we should provide them some different solution like sampling. But does it means we should force majority of systems use sampling which is both slower and less accurate for them? Could we end up with different options for user?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
Hello.
I did some refactoring to previous patch. Improvements:
1) Wait is determined by class and event without affecting to atomic usage of it.
They are still stored in one variable. This improvement gives an opportunity to make
more detailed views later (waits can be grouped by class).
2) Only active wait of each backend is visible. pg_report_wait_end() function called
on the end of wait and clears it.
3) Wait name determination was optimized (last version used cycles for each of them,
on the end of wait and clears it.
3) Wait name determination was optimized (last version used cycles for each of them,
and was very heavy). I added lazy `group` field to LWLock, which used as index in
lwlock names array.
lwlock names array.
4) New wait can be added by more simpler way. For example an individual lwlock
requires only specifying its name in LWLock names arrayb
5) Added new types of waits: Storage, Network, Latch
This patch is more informative and it'll be easier to extend.
Sample:
b1=# select pid, wait_event from pg_stat_activity;
pid | wait_event
-------+------------------------------
17099 | LWLocks: BufferCleanupLock
17100 | Locks: Transaction
17101 | LWLocks: BufferPartitionLock
17102 |
17103 | Network: READ
17086 |
(6 rows)
-- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2015-07-21 13:11:36 +0300, Ildus Kurbangaliev wrote: > > /* > * Top-level transactions are identified by VirtualTransactionIDs comprising > diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h > index cff3b99..55b0687 100644 > --- a/src/include/storage/lwlock.h > +++ b/src/include/storage/lwlock.h > @@ -58,6 +58,9 @@ typedef struct LWLock > #ifdef LOCK_DEBUG > struct PGPROC *owner; /* last exlusive owner of the lock */ > #endif > + > + /* LWLock group, initialized as -1, calculated in first acquire */ > + int group; > } LWLock; I'd very much like to avoid increasing the size of struct LWLock. We have a lot of those and I'd still like to inline them with the buffer descriptors. Why do we need a separate group and can't reuse the tranche? That might require creating a few more tranches, but ...? Andres
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/21/2015 01:18 PM, Andres Freund wrote: > On 2015-07-21 13:11:36 +0300, Ildus Kurbangaliev wrote: >> >> /* >> * Top-level transactions are identified by VirtualTransactionIDs comprising >> diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h >> index cff3b99..55b0687 100644 >> --- a/src/include/storage/lwlock.h >> +++ b/src/include/storage/lwlock.h >> @@ -58,6 +58,9 @@ typedef struct LWLock >> #ifdef LOCK_DEBUG >> struct PGPROC *owner; /* last exlusive owner of the lock */ >> #endif >> + >> + /* LWLock group, initialized as -1, calculated in first acquire */ >> + int group; >> } LWLock; > I'd very much like to avoid increasing the size of struct LWLock. We > have a lot of those and I'd still like to inline them with the buffer > descriptors. Why do we need a separate group and can't reuse the > tranche? That might require creating a few more tranches, but ...? > > Andres Do you mean moving LWLocks defined by offsets and with dynamic sizes to separate tranches? It sounds like an good option, but it will require refactoring of current tranches. In current implementation i tried not change current things. Simple solution will be using uint8 for tranche (because we have only 3 of them now, and it will take much time to get to 255) and uint8 for group. In this case size will not change. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Kyotaro HORIGUCHI
Date:
Hello, At Tue, 21 Jul 2015 14:28:25 +0300, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote in <55AE2CD9.4050005@postgrespro.ru> > On 07/21/2015 01:18 PM, Andres Freund wrote: > > On 2015-07-21 13:11:36 +0300, Ildus Kurbangaliev wrote: > >> /* > >> * Top-level transactions are identified by VirtualTransactionIDs > >> * comprising > >> diff --git a/src/include/storage/lwlock.h > >> b/src/include/storage/lwlock.h > >> index cff3b99..55b0687 100644 > >> --- a/src/include/storage/lwlock.h > >> +++ b/src/include/storage/lwlock.h > >> @@ -58,6 +58,9 @@ typedef struct LWLock > >> #ifdef LOCK_DEBUG > >> struct PGPROC *owner; /* last exlusive owner of the lock */ > >> #endif > >> + > >> + /* LWLock group, initialized as -1, calculated in first acquire */ > >> + int group; > >> } LWLock; > > I'd very much like to avoid increasing the size of struct LWLock. We > > have a lot of those and I'd still like to inline them with the buffer > > descriptors. Why do we need a separate group and can't reuse the > > tranche? That might require creating a few more tranches, but ...? > > > > Andres > Do you mean moving LWLocks defined by offsets and with dynamic sizes > to separate tranches? I think it is too much for the purpose. Only two new tranches and maybe one or some new members (maybe representing the group) of trances will do, I suppose. > It sounds like an good option, but it will require refactoring of > current tranches. In current implementation > i tried not change current things. Now one of the most controversial points of this patch is the details of the implement, which largely affects performance drag, maybe. From the viewpoint of performance, I have some comments on the feature: - LWLockReportStat runs a linear search loop which I suppose should be avoided even if the loop count is rather small for LWLocks, as Fujii-san said upthread or anywhere. - Currently pg_stat_activity view is the only API, which would be a bit heavy for sampling use. It'd be appreciated to havea far lighter means to know the same informtion. > Simple solution will be using uint8 for tranche (because we have only > 3 of them now, > and it will take much time to get to 255) and uint8 for group. In this > case size will not change. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/22/2015 09:10 AM, Kyotaro HORIGUCHI wrote: > Hello, > > At Tue, 21 Jul 2015 14:28:25 +0300, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote in <55AE2CD9.4050005@postgrespro.ru> >> On 07/21/2015 01:18 PM, Andres Freund wrote: >>> On 2015-07-21 13:11:36 +0300, Ildus Kurbangaliev wrote: >>>> /* >>>> * Top-level transactions are identified by VirtualTransactionIDs >>>> * comprising >>>> diff --git a/src/include/storage/lwlock.h >>>> b/src/include/storage/lwlock.h >>>> index cff3b99..55b0687 100644 >>>> --- a/src/include/storage/lwlock.h >>>> +++ b/src/include/storage/lwlock.h >>>> @@ -58,6 +58,9 @@ typedef struct LWLock >>>> #ifdef LOCK_DEBUG >>>> struct PGPROC *owner; /* last exlusive owner of the lock */ >>>> #endif >>>> + >>>> + /* LWLock group, initialized as -1, calculated in first acquire */ >>>> + int group; >>>> } LWLock; >>> I'd very much like to avoid increasing the size of struct LWLock. We >>> have a lot of those and I'd still like to inline them with the buffer >>> descriptors. Why do we need a separate group and can't reuse the >>> tranche? That might require creating a few more tranches, but ...? >>> >>> Andres >> Do you mean moving LWLocks defined by offsets and with dynamic sizes >> to separate tranches? > I think it is too much for the purpose. Only two new tranches and > maybe one or some new members (maybe representing the group) of > trances will do, I suppose. Can you explain why only two new tranches? There is 13 types of lwlocks (besides individual), and we need separate them somehow. >> It sounds like an good option, but it will require refactoring of >> current tranches. In current implementation >> i tried not change current things. > Now one of the most controversial points of this patch is the > details of the implement, which largely affects performance drag, > maybe. > > > From the viewpoint of performance, I have some comments on the feature: > > - LWLockReportStat runs a linear search loop which I suppose > should be avoided even if the loop count is rather small for > LWLocks, as Fujii-san said upthread or anywhere. It runs only one time in first acquirement. In previous patch it was much heavier. Anyway this code will be removed if we split main tranche to smaller tranches. > - Currently pg_stat_activity view is the only API, which would > be a bit heavy for sampling use. It'd be appreciated to have a > far lighter means to know the same informtion. Yes, pg_stat_activity is just information about current wait, and it's too heavy for sampling. Main goal of this patch was creating base structures that can be used later. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Kyotaro HORIGUCHI
Date:
Hello, At Wed, 22 Jul 2015 17:50:35 +0300, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote in <55AFADBB.9090203@postgrespro.ru> > On 07/22/2015 09:10 AM, Kyotaro HORIGUCHI wrote: > > Hello, > > > > At Tue, 21 Jul 2015 14:28:25 +0300, Ildus Kurbangaliev > > <i.kurbangaliev@postgrespro.ru> wrote in > > <55AE2CD9.4050005@postgrespro.ru> > >> On 07/21/2015 01:18 PM, Andres Freund wrote: > >>> I'd very much like to avoid increasing the size of struct LWLock. We > >>> have a lot of those and I'd still like to inline them with the buffer > >>> descriptors. Why do we need a separate group and can't reuse the > >>> tranche? That might require creating a few more tranches, but ...? > >>> > >>> Andres > >> Do you mean moving LWLocks defined by offsets and with dynamic sizes > >> to separate tranches? > > I think it is too much for the purpose. Only two new tranches and > > maybe one or some new members (maybe representing the group) of > > trances will do, I suppose. > > Can you explain why only two new tranches? > There is 13 types of lwlocks (besides individual), and we need > separate them somehow. Sorry, I minunderstood about tranche. Currently tranches other than main are used by WALInsertLocks and ReplicationOrigins. Other "dynamic locks" are defined as parts of main LWLokcs since they have the same shape with individual lwlocks. Leaving the individual locks, every lock groups may have their own tranche if we allow lwlocks to have own tranche even if it is in MainLWLockArray. New 13-16 trances will be added but no need to register their name in LWLOCK_GROUPS[]. After all, this array would be renamed such as "IndividualLWLockNames" and the name-lookup can be done by the follwoing simple steps. - If the the lock is in main tranche, lookup the individual name array for its name. - Elsewise, use the name of its tranche. Does this make sense? > >> It sounds like an good option, but it will require refactoring of > >> current tranches. In current implementation > >> i tried not change current things. > > Now one of the most controversial points of this patch is the > > details of the implement, which largely affects performance drag, > > maybe. > > > > > > From the viewpoint of performance, I have some comments on the > > feature: > > > > - LWLockReportStat runs a linear search loop which I suppose > > should be avoided even if the loop count is rather small for > > LWLocks, as Fujii-san said upthread or anywhere. > It runs only one time in first acquirement. In previous patch > it was much heavier. Anyway this code will be removed if we > split main tranche to smaller tranches. Ah, this should be the same with what I wrote above, isn't it? > > - Currently pg_stat_activity view is the only API, which would > > be a bit heavy for sampling use. It'd be appreciated to have a > > far lighter means to know the same informtion. > Yes, pg_stat_activity is just information about current wait, > and it's too heavy for sampling. Main goal of this patch was > creating base structures that can be used later. Ok, I see it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Kyotaro HORIGUCHI
Date:
Hi, I forgot to mention a significant point. > At Wed, 22 Jul 2015 17:50:35 +0300, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote in <55AFADBB.9090203@postgrespro.ru> > > On 07/22/2015 09:10 AM, Kyotaro HORIGUCHI wrote: > > > Hello, > > > > > > At Tue, 21 Jul 2015 14:28:25 +0300, Ildus Kurbangaliev > > > <i.kurbangaliev@postgrespro.ru> wrote in > > > <55AE2CD9.4050005@postgrespro.ru> > > >> On 07/21/2015 01:18 PM, Andres Freund wrote: > > >>> I'd very much like to avoid increasing the size of struct LWLock. We > > >>> have a lot of those and I'd still like to inline them with the buffer > > >>> descriptors. Why do we need a separate group and can't reuse the > > >>> tranche? That might require creating a few more tranches, but ...? > > >>> > > >>> Andres > > >> Do you mean moving LWLocks defined by offsets and with dynamic sizes > > >> to separate tranches? > > > I think it is too much for the purpose. Only two new tranches and > > > maybe one or some new members (maybe representing the group) of > > > trances will do, I suppose. > > > > Can you explain why only two new tranches? > > There is 13 types of lwlocks (besides individual), and we need > > separate them somehow. > > Sorry, I minunderstood about tranche. > > Currently tranches other than main are used by WALInsertLocks and > ReplicationOrigins. Other "dynamic locks" are defined as parts of > main LWLokcs since they have the same shape with individual > lwlocks. Leaving the individual locks, every lock groups may have > their own tranche if we allow lwlocks to have own tranche even if > it is in MainLWLockArray. New 13-16 trances will be added but no > need to register their name in LWLOCK_GROUPS[]. After all, this > array would be renamed such as "IndividualLWLockNames" and the > name-lookup can be done by the follwoing simple steps. > > - If the the lock is in main tranche, lookup the individual name > array for its name. This lookup is doable by calculation and no need to scan. > - Elsewise, use the name of its tranche. > > Does this make sense? > > > >> It sounds like an good option, but it will require refactoring of > > >> current tranches. In current implementation > > >> i tried not change current things. > > > Now one of the most controversial points of this patch is the > > > details of the implement, which largely affects performance drag, > > > maybe. > > > > > > > > > From the viewpoint of performance, I have some comments on the > > > feature: > > > > > > - LWLockReportStat runs a linear search loop which I suppose > > > should be avoided even if the loop count is rather small for > > > LWLocks, as Fujii-san said upthread or anywhere. > > It runs only one time in first acquirement. In previous patch > > it was much heavier. Anyway this code will be removed if we > > split main tranche to smaller tranches. > > Ah, this should be the same with what I wrote above, isn't it? > > > > - Currently pg_stat_activity view is the only API, which would > > > be a bit heavy for sampling use. It'd be appreciated to have a > > > far lighter means to know the same informtion. > > Yes, pg_stat_activity is just information about current wait, > > and it's too heavy for sampling. Main goal of this patch was > > creating base structures that can be used later. > > Ok, I see it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 23 June 2015 at 05:37, Robert Haas <robertmhaas@gmail.com> wrote:
When a PostgreSQL system wedges, or when it becomes dreadfully slow
for some reason, I often find myself relying on tools like strace,
gdb, or perf to figure out what is happening. This doesn't tend to
instill customers with confidence; they would like (quite
understandably) a process that doesn't require installing developer
tools on their production systems, and doesn't require a developer to
interpret the results, and perhaps even something that they could
connect up to PEM or Nagios or whatever alerting system they are
using.
There are obviously many ways that we might think about improving
things here, but what I'd like to do is try to put some better
information in pg_stat_activity, so that when a process is not
running, users can get some better information about *why* it's not
running. The basic idea is that pg_stat_activity.waiting would be
replaced by a new column pg_stat_activity.wait_event, which would
display the reason why that backend is waiting. This wouldn't be a
free-form text field, because that would be too expensive to populate.
I've not looked into the feasibility of it, but if it were also possible to have a "waiting_for" column which would store the process ID of the process that's holding a lock that this process is waiting on, then it would be possible for some smart guy to write some code which draws beautiful graphs, perhaps in Pg Admin 4 of which processes are blocking other processes. I imagine this as a chart with an icon for each process. Processes waiting on locks being released would have an arrow pointing to their blocking process, if we clicked on that blocking process we could see the query that it's running and various other properties that are existing columns in pg_stat_activity.
Obviously this is blue-skies stuff, but if we had a few to provide that information it would be a great step forward towards that.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/23/2015 05:57 AM, Kyotaro HORIGUCHI wrote: > At Wed, 22 Jul 2015 17:50:35 +0300, Ildus Kurbangaliev<i.kurbangaliev@postgrespro.ru> wrote in<55AFADBB.9090203@postgrespro.ru> >> >On 07/22/2015 09:10 AM, Kyotaro HORIGUCHI wrote: >>> > >Hello, >>> > > >>> > >At Tue, 21 Jul 2015 14:28:25 +0300, Ildus Kurbangaliev >>> > ><i.kurbangaliev@postgrespro.ru> wrote in >>> > ><55AE2CD9.4050005@postgrespro.ru> >>>> > >>On 07/21/2015 01:18 PM, Andres Freund wrote: >>>>> > >>>I'd very much like to avoid increasing the size of struct LWLock. We >>>>> > >>>have a lot of those and I'd still like to inline them with the buffer >>>>> > >>>descriptors. Why do we need a separate group and can't reuse the >>>>> > >>>tranche? That might require creating a few more tranches, but ...? >>>>> > >>> >>>>> > >>>Andres >>>> > >>Do you mean moving LWLocks defined by offsets and with dynamic sizes >>>> > >>to separate tranches? >>> > >I think it is too much for the purpose. Only two new tranches and >>> > >maybe one or some new members (maybe representing the group) of >>> > >trances will do, I suppose. >> > >> >Can you explain why only two new tranches? >> >There is 13 types of lwlocks (besides individual), and we need >> >separate them somehow. > Sorry, I minunderstood about tranche. > > Currently tranches other than main are used by WALInsertLocks and > ReplicationOrigins. Other "dynamic locks" are defined as parts of > main LWLokcs since they have the same shape with individual > lwlocks. Leaving the individual locks, every lock groups may have > their own tranche if we allow lwlocks to have own tranche even if > it is in MainLWLockArray. New 13-16 trances will be added but no > need to register their name in LWLOCK_GROUPS[]. After all, this > array would be renamed such as "IndividualLWLockNames" and the > name-lookup can be done by the follwoing simple steps. > > - If the the lock is in main tranche, lookup the individual name > array for its name. > > - Elsewise, use the name of its tranche. > > Does this make sense? > Yes, this is exactly how I see it too. We keep MainLWLockArray, and create 16 tranches. Only problem is here that dynamic lwlocks allocated with LWLockAssign, and for some of cases we need to pass `tranche_id` to place where it is called (for example async.c -> slru.c) -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
Hello. I’ve changed the previous patch. `group` field in LWLock is removed, so the size of LWLock will not increase. Instead of the `group` field I've created new tranches for LWLocks from MainLWLocksArray. This allowed to remove a loop from the previous version of the patch. Now the names for LWLocks that are not individual is obtained from corresponding tranches. ---- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jul 24, 2015 at 12:31 AM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
Hello.
I’ve changed the previous patch. `group` field in LWLock is removed, so the size of LWLock will not increase.
Instead of the `group` field I've created new tranches for LWLocks from MainLWLocksArray.
@@ -448,6 +549,30 @@ CreateLWLocks(void)
+ /* Initialized tranche id for buffer mapping lwlocks */
+ for (id = 0; id < NUM_BUFFER_PARTITIONS; id++)
+ {
+ LWLock *lock;
+ lock = &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + id].lock;
+ lock->tranche = LW_BUFFER_MAPPING;
+ }
+
+ /* Initialized tranche id for lock manager lwlocks */
+ for (id = 0; id < NUM_LOCK_PARTITIONS; id++)
+ {
+ LWLock *lock;
+ lock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + id].lock;
+ lock->tranche = LW_LOCK_MANAGER;
+ }
+
+ /* Initialized tranche id for predicate lock manager lwlocks */
+ for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++)
+ {
+ LWLock *lock;
+ lock = &MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + id].lock;
+ lock->tranche = LW_PREDICATELOCK_MANAGER;
+ }
I don't think we need to separately set tranche for above locks, we
can easily identify the same by lockid as is done in my patch.
2.
+const char *
+pgstat_get_wait_event_name(uint8 classId, uint8 eventId)
{
..
}
I don't understand why a single WaitEventType enum is not sufficient
as in the patch provided by me and why we need to divide it into
separate structures for each kind of wait, I find that way easily
understandable.
3.
+void
+pgstat_report_wait_end()
{
}
Till now, what has been discussed is that we will not clear the
waiting event untill next wait, rather we will change waiting
as true or false, so this also doesn't seem to be inline with
whats being discussed.
In general, I think if we want to extend the patch for more wait types,
then lets do it in the way I have defined them initially as I haven't
heard anything bad in this thread about the way it has bee initially
designed.
"I don't mind doing it some other way as you are proposing or some
one else proposes, but the point is that we should get sufficient
buy-in for going in one or the other way, just going ahead and modifying
it do the things in different way doesn't appear to be a good way to
proceed".
Also I think before extending we should be careful in adding such
stats collection, example if we have to add new stats collection in
below path as is done in patch, then we need to measure performance
for read-only cases when data-doesn't fit in shared buffers as I think
this will be used heavily in such a path:
@@ -674,6 +675,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
int nbytes;
MdfdVec *v;
+ pgstat_report_wait_start(WAIT_IO, WAIT_IO_READ);
+
TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
reln->smgr_rnode.node.spcNode,
reln->smgr_rnode.node.dbNode,
@@ -702,6 +705,8 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nbytes,
BLCKSZ);
+ pgstat_report_wait_end();
+
I have added stats collection at minimal places so that we minimize such
impacts and adding at more places could be convenient if they are proven
to be safe performance impact-wise.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
> On Jul 24, 2015, at 7:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 24, 2015 at 12:31 AM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > Hello. > I’ve changed the previous patch. `group` field in LWLock is removed, so the size of LWLock will not increase. > Instead of the `group` field I've created new tranches for LWLocks from MainLWLocksArray. > > @@ -448,6 +549,30 @@ CreateLWLocks(void) > + /* Initialized tranche id for buffer mapping lwlocks */ > + for (id = 0; id < NUM_BUFFER_PARTITIONS; id++) > + { > + LWLock *lock; > + lock = &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + id].lock; > + lock->tranche = LW_BUFFER_MAPPING; > + } > + > + /* Initialized tranche id for lock manager lwlocks */ > + for (id = 0; id < NUM_LOCK_PARTITIONS; id++) > + { > + LWLock *lock; > + lock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + id].lock; > + lock->tranche = LW_LOCK_MANAGER; > + } > + > + /* Initialized tranche id for predicate lock manager lwlocks */ > + for (id = 0; id < NUM_PREDICATELOCK_PARTITIONS; id++) > + { > + LWLock *lock; > + lock = &MainLWLockArray[PREDICATELOCK_MANAGER_LWLOCK_OFFSET + id].lock; > + lock->tranche = LW_PREDICATELOCK_MANAGER; > + } > > > I don't think we need to separately set tranche for above locks, we > can easily identify the same by lockid as is done in my patch. Tranches have been added because we need to identify other 10 types of LWLocks from main tranche allocated dynamically. These 3 can be identified much easier, by there is no point to identify them by other way. > > 2. > +const char * > +pgstat_get_wait_event_name(uint8 classId, uint8 eventId) > { > .. > } > > I don't understand why a single WaitEventType enum is not sufficient > as in the patch provided by me and why we need to divide it into > separate structures for each kind of wait, I find that way easily > understandable. In current patch if somebody adds new lock or individual lwlock he just add its name to the corresponding array. With WaitEventType first of all he suppose to know about WaitEventType, then add new item to enum, add new struct in types map, and keep in mind that be something wrong with offsets in types map. ---- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jul 23, 2015 at 3:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > Hello. > I’ve changed the previous patch. `group` field in LWLock is removed, so the size of LWLock will not increase. > Instead of the `group` field I've created new tranches for LWLocks from MainLWLocksArray. This allowed to remove a loop > from the previous version of the patch. > Now the names for LWLocks that are not individual is obtained from corresponding tranches. I think using tranches for this is good, but I'm not really keen on using two bytes for this. I think using one byte is just fine. It's OK to assume that anything up to a 4-byte word can be read atomically if it's read using a read of the same width that was used to write it. But it's not OK to assume that if somebody might do, say, a memcpy of a structure containing that 4-byte word, as pgstat_read_current_status does. That, I think, might get torn, because it's reading using 1-byte reads. You'll notice that pgstat_beshutdown_hook() uses the changecount protocol even though it's only changing a 4-byte word. There's no real need for two bytes here, so let's not do that. Just use offsets intelligently to pack it into a single byte. Also, the patch should not invent a new array similar but not quite identical to LockTagTypeNames[]. This is goofy: + if (tranche_id > 0) + result->tranche = tranche_id; + else + result->tranche = LW_USERDEFINED; If we're going to make everybody pass a tranche ID when they call LWLockAssign(), then they should have to do that always, not sometimes pass a tranche ID and otherwise pass 0, which is actually a valid tranche ID but to which you've given a weird special meaning here. I'd also name the constants differently, like LWLOCK_TRANCHE_XXX or something. LW_ is a somewhat ambiguous prefix. The idea of tranches is really that each tranche is an array of items each of which contains 1 or more lwlocks. Here you are intermingling different tranches. I guess that won't really break anything but it seems ugly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 24, 2015 at 1:12 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
>
> > On Jul 24, 2015, at 7:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 2.
> > +const char *
> > +pgstat_get_wait_event_name(uint8 classId, uint8 eventId)
> > {
> > ..
> > }
> >
> > I don't understand why a single WaitEventType enum is not sufficient
> > as in the patch provided by me and why we need to divide it into
> > separate structures for each kind of wait, I find that way easily
> > understandable.
>
> In current patch if somebody adds new lock or individual lwlock he just add
> its name to the corresponding array. With WaitEventType first of all he suppose to
> know about WaitEventType,
>
>
> > On Jul 24, 2015, at 7:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 2.
> > +const char *
> > +pgstat_get_wait_event_name(uint8 classId, uint8 eventId)
> > {
> > ..
> > }
> >
> > I don't understand why a single WaitEventType enum is not sufficient
> > as in the patch provided by me and why we need to divide it into
> > separate structures for each kind of wait, I find that way easily
> > understandable.
>
> In current patch if somebody adds new lock or individual lwlock he just add
> its name to the corresponding array. With WaitEventType first of all he suppose to
> know about WaitEventType,
That anyway he has to do it either you go by defining individual arrays
or having unified WaitEventType enum for individual arrays he has to
find out that array. Another thing is with that you can just encapsulate
this information in one byte in structure PgBackendStatus, rather than
using more number of bytes (4 bytes) and I think the function for reporting
Waitevent will be much more simplified.
I think it is better if we just implement the idea of tranche's on top of my
patch and do the other remaining work like setting of bwaiting correctly
as mentioned upthread.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
> On Jul 24, 2015, at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Also, the patch should not invent a new array similar but not quite > identical to LockTagTypeNames[]. > > This is goofy: > > + if (tranche_id > 0) > + result->tranche = tranche_id; > + else > + result->tranche = LW_USERDEFINED; > > If we're going to make everybody pass a tranche ID when they call > LWLockAssign(), then they should have to do that always, not sometimes > pass a tranche ID and otherwise pass 0, which is actually a valid > tranche ID but to which you've given a weird special meaning here. > I'd also name the constants differently, like LWLOCK_TRANCHE_XXX or > something. LW_ is a somewhat ambiguous prefix. > > The idea of tranches is really that each tranche is an array of items > each of which contains 1 or more lwlocks. Here you are intermingling > different tranches. I guess that won't really break anything but it > seems ugly. Maybe it will be better to split LWLockAssign to two functions then, keep name LWLockAssign for user defined locks and other function with tranche_id. > On Jul 25, 2015, at 1:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > That anyway he has to do it either you go by defining individual arrays > or having unified WaitEventType enum for individual arrays he has to > find out that array. Another thing is with that you can just encapsulate > this information in one byte in structure PgBackendStatus, rather than > using more number of bytes (4 bytes) and I think the function for reporting > Waitevent will be much more simplified. In my way there are no special meaning for names. Array with names located in lwlock.c and lock.c, and can be used for other things (for example tracing). One byte sounds good only for this case. We are going to extend waits monitoring, add more views, some profiling. That’s why waits have to be groupable by classes. ---- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Sat, Jul 25, 2015 at 10:30 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
> On Jul 24, 2015, at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Also, the patch should not invent a new array similar but not quite
> identical to LockTagTypeNames[].
>
> This is goofy:
>
> + if (tranche_id > 0)
> + result->tranche = tranche_id;
> + else
> + result->tranche = LW_USERDEFINED;
>
> If we're going to make everybody pass a tranche ID when they call
> LWLockAssign(), then they should have to do that always, not sometimes
> pass a tranche ID and otherwise pass 0, which is actually a valid
> tranche ID but to which you've given a weird special meaning here.
> I'd also name the constants differently, like LWLOCK_TRANCHE_XXX or
> something. LW_ is a somewhat ambiguous prefix.
>
> The idea of tranches is really that each tranche is an array of items
> each of which contains 1 or more lwlocks. Here you are intermingling
> different tranches. I guess that won't really break anything but it
> seems ugly.
Maybe it will be better to split LWLockAssign to two functions then, keep name
LWLockAssign for user defined locks and other function with tranche_id.
> On Jul 25, 2015, at 1:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> That anyway he has to do it either you go by defining individual arrays
> or having unified WaitEventType enum for individual arrays he has to
> find out that array. Another thing is with that you can just encapsulate
> this information in one byte in structure PgBackendStatus, rather than
> using more number of bytes (4 bytes) and I think the function for reporting
> Waitevent will be much more simplified.
In my way there are no special meaning for names. Array with names
located in lwlock.c and lock.c, and can be used for other things (for example
tracing). One byte sounds good only for this case.
Do you mean to say that you need more than 256 events? I am not sure
if we can add that many events without adding performance penalty
in some path.
The original idea was proposed for one-byte and the patch was written
considering the same, now you are planning to extend (which is okay), but
modifying it without any prior consent is what slightly a matter of concern.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
Hello. In the attached patch I've made a refactoring for tranches. The prefix for them was extended, and I've did a split of LWLockAssign to two functions (one with tranche and second for user defined LWLocks). -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
David Rowley wrote: > I've not looked into the feasibility of it, but if it were also possible to > have a "waiting_for" column which would store the process ID of the process > that's holding a lock that this process is waiting on, then it would be > possible for some smart guy to write some code which draws beautiful > graphs, perhaps in Pg Admin 4 of which processes are blocking other > processes. I imagine this as a chart with an icon for each process. > Processes waiting on locks being released would have an arrow pointing to > their blocking process, if we clicked on that blocking process we could see > the query that it's running and various other properties that are existing > columns in pg_stat_activity. I think this is already possible, is it not? You just have to look for an identically-identified pg_locks entry with granted=true. That gives you a PID and vxid/xid. You can self-join pg_locks with that, and join to pg_stat_activity. I remember we discussed having a layer of system views on top of pg_stat_activity and pg_locks, probably defined recursively, that would show the full graph of waiters/lockers. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 27, 2015 at 2:32 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > David Rowley wrote: >> I've not looked into the feasibility of it, but if it were also possible to >> have a "waiting_for" column which would store the process ID of the process >> that's holding a lock that this process is waiting on, then it would be >> possible for some smart guy to write some code which draws beautiful >> graphs, perhaps in Pg Admin 4 of which processes are blocking other >> processes. I imagine this as a chart with an icon for each process. >> Processes waiting on locks being released would have an arrow pointing to >> their blocking process, if we clicked on that blocking process we could see >> the query that it's running and various other properties that are existing >> columns in pg_stat_activity. > > I think this is already possible, is it not? You just have to look for > an identically-identified pg_locks entry with granted=true. That gives > you a PID and vxid/xid. You can self-join pg_locks with that, and join > to pg_stat_activity. > > I remember we discussed having a layer of system views on top of > pg_stat_activity and pg_locks, probably defined recursively, that would > show the full graph of waiters/lockers. It isn't necessarily the case that A is waiting for a unique process B. It could well be the case that A wants AccessExclusiveLock and many processes hold a variety of other lock types. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Mon, Jul 27, 2015 at 2:32 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I think this is already possible, is it not? You just have to look for > > an identically-identified pg_locks entry with granted=true. That gives > > you a PID and vxid/xid. You can self-join pg_locks with that, and join > > to pg_stat_activity. > > > > I remember we discussed having a layer of system views on top of > > pg_stat_activity and pg_locks, probably defined recursively, that would > > show the full graph of waiters/lockers. > > It isn't necessarily the case that A is waiting for a unique process > B. It could well be the case that A wants AccessExclusiveLock and > many processes hold a variety of other lock types. Sure, but I don't think this makes it impossible to figure out who's locking who. I think the only thing you need other than the data in pg_locks is the conflicts table, which is well documented. Oh, hmm, one thing missing is the ordering of the wait queue for each locked object. If process A holds RowExclusive on some object, process B wants ShareLock (stalled waiting) and process C wants AccessExclusive (also stalled waiting), who of B and C is woken up first after A releases the lock depends on order of arrival. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 27, 2015 at 2:43 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Mon, Jul 27, 2015 at 2:32 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: > >> > I think this is already possible, is it not? You just have to look for >> > an identically-identified pg_locks entry with granted=true. That gives >> > you a PID and vxid/xid. You can self-join pg_locks with that, and join >> > to pg_stat_activity. >> > >> > I remember we discussed having a layer of system views on top of >> > pg_stat_activity and pg_locks, probably defined recursively, that would >> > show the full graph of waiters/lockers. >> >> It isn't necessarily the case that A is waiting for a unique process >> B. It could well be the case that A wants AccessExclusiveLock and >> many processes hold a variety of other lock types. > > Sure, but I don't think this makes it impossible to figure out who's > locking who. I think the only thing you need other than the data in > pg_locks is the conflicts table, which is well documented. > > Oh, hmm, one thing missing is the ordering of the wait queue for each > locked object. If process A holds RowExclusive on some object, process > B wants ShareLock (stalled waiting) and process C wants AccessExclusive > (also stalled waiting), who of B and C is woken up first after A > releases the lock depends on order of arrival. Agreed - it would be nice to expose that somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/27/15 1:46 PM, Robert Haas wrote: > On Mon, Jul 27, 2015 at 2:43 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Robert Haas wrote: >>> On Mon, Jul 27, 2015 at 2:32 PM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >> >>>> I think this is already possible, is it not? You just have to look for >>>> an identically-identified pg_locks entry with granted=true. That gives >>>> you a PID and vxid/xid. You can self-join pg_locks with that, and join >>>> to pg_stat_activity. >>>> >>>> I remember we discussed having a layer of system views on top of >>>> pg_stat_activity and pg_locks, probably defined recursively, that would >>>> show the full graph of waiters/lockers. >>> >>> It isn't necessarily the case that A is waiting for a unique process >>> B. It could well be the case that A wants AccessExclusiveLock and >>> many processes hold a variety of other lock types. >> >> Sure, but I don't think this makes it impossible to figure out who's >> locking who. I think the only thing you need other than the data in >> pg_locks is the conflicts table, which is well documented. >> >> Oh, hmm, one thing missing is the ordering of the wait queue for each >> locked object. If process A holds RowExclusive on some object, process >> B wants ShareLock (stalled waiting) and process C wants AccessExclusive >> (also stalled waiting), who of B and C is woken up first after A >> releases the lock depends on order of arrival. > > Agreed - it would be nice to expose that somehow. +1. It's very common to want to know who's blocking who, and not at all easy to do that today. We should at minimum have a canonical example of how to do it, but something built in would be even better. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Heikki Linnakangas
Date:
On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote: > Hello. > In the attached patch I've made a refactoring for tranches. > The prefix for them was extended, and I've did a split of LWLockAssign > to two > functions (one with tranche and second for user defined LWLocks). This needs some work in order to be maintainable: * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync with the list of individual locks in lwlock.h. Sooner or later someone will add an LWLock and forget to update the names-array. That needs to be made less error-prone, so that the names are maintained in the same place as the #defines. Perhaps something like rmgrlist.h. * The "base" tranches are a bit funny. They all have the same array_base, pointing to MainLWLockArray. If there are e.g. 5 clog buffer locks, I would expect the T_NAME() to return "ClogBufferLocks" for all of them, and T_ID() to return numbers between 0-4. But in reality, T_ID() will return something like 55-59. Instead of passing a tranche-id to LWLockAssign(), I think it would be more clear to have a new function to allocate a contiguous block of lwlocks as a new tranche. It could then set the base correctly. * Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" locks, how about just giving each one of them a separate tranche? * User manual needs to be updated to explain the new column in pg_stat_activity. - Heikki
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 07/28/2015 10:28 PM, Heikki Linnakangas wrote: > On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote: >> Hello. >> In the attached patch I've made a refactoring for tranches. >> The prefix for them was extended, and I've did a split of LWLockAssign >> to two >> functions (one with tranche and second for user defined LWLocks). > > This needs some work in order to be maintainable: > > * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in > sync with the list of individual locks in lwlock.h. Sooner or later > someone will add an LWLock and forget to update the names-array. That > needs to be made less error-prone, so that the names are maintained in > the same place as the #defines. Perhaps something like rmgrlist.h. > > * The "base" tranches are a bit funny. They all have the same > array_base, pointing to MainLWLockArray. If there are e.g. 5 clog > buffer locks, I would expect the T_NAME() to return "ClogBufferLocks" > for all of them, and T_ID() to return numbers between 0-4. But in > reality, T_ID() will return something like 55-59. > > Instead of passing a tranche-id to LWLockAssign(), I think it would be > more clear to have a new function to allocate a contiguous block of > lwlocks as a new tranche. It could then set the base correctly. > > * Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" > locks, how about just giving each one of them a separate tranche? > > * User manual needs to be updated to explain the new column in > pg_stat_activity. > > - Heikki > Hello. Thanks for review. I attached new version of patch. It adds new field in pg_stat_activity that shows current wait in backend. I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only invididual and user defined LWLocks is creating, other LWLocks created by modules who need them. I think that is more logical (user know about module, not module about of all users). It also simplifies LWLocks acquirement. Now each individual LWLock and other groups of LWLocks have their tranche, and can be easily identified. If somebody will add new individual LWLock and forget to add its name, postgres will show a message. Individual LWLocks still allocated in one array but tranches for them point to particular index in main array. Sample: b1=# select pid, wait_event from pg_stat_activity; \watch 1 pid | wait_event ------+------------------------- 7722 | Storage: READ 7653 | 7723 | Network: WRITE 7725 | Network: READ 7727 | Locks: Transaction 7731 | Storage: READ 7734 | Network: READ 7735 | Storage: READ 7739 | LWLocks: WALInsertLocks 7738 | Locks: Transaction 7740 | LWLocks: BufferMgrLocks 7741 | Network: READ 7742 | Network: READ 7743 | Locks: Transaction -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 08/03/2015 04:25 PM, Ildus Kurbangaliev wrote: > On 07/28/2015 10:28 PM, Heikki Linnakangas wrote: >> On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote: >>> Hello. >>> In the attached patch I've made a refactoring for tranches. >>> The prefix for them was extended, and I've did a split of LWLockAssign >>> to two >>> functions (one with tranche and second for user defined LWLocks). >> >> This needs some work in order to be maintainable: >> >> * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept >> in sync with the list of individual locks in lwlock.h. Sooner or >> later someone will add an LWLock and forget to update the >> names-array. That needs to be made less error-prone, so that the >> names are maintained in the same place as the #defines. Perhaps >> something like rmgrlist.h. >> >> * The "base" tranches are a bit funny. They all have the same >> array_base, pointing to MainLWLockArray. If there are e.g. 5 clog >> buffer locks, I would expect the T_NAME() to return "ClogBufferLocks" >> for all of them, and T_ID() to return numbers between 0-4. But in >> reality, T_ID() will return something like 55-59. >> >> Instead of passing a tranche-id to LWLockAssign(), I think it would >> be more clear to have a new function to allocate a contiguous block >> of lwlocks as a new tranche. It could then set the base correctly. >> >> * Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" >> locks, how about just giving each one of them a separate tranche? >> >> * User manual needs to be updated to explain the new column in >> pg_stat_activity. >> >> - Heikki >> > > Hello. Thanks for review. I attached new version of patch. > > It adds new field in pg_stat_activity that shows current wait in backend. > > I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only > invididual and user defined LWLocks is creating, other LWLocks created by > modules who need them. I think that is more logical (user know about > module, > not module about of all users). It also simplifies LWLocks acquirement. > > Now each individual LWLock and other groups of LWLocks have their > tranche, and can > be easily identified. If somebody will add new individual LWLock and > forget to add > its name, postgres will show a message. Individual LWLocks still > allocated in > one array but tranches for them point to particular index in main array. > > Sample: > > b1=# select pid, wait_event from pg_stat_activity; \watch 1 > > pid | wait_event > ------+------------------------- > 7722 | Storage: READ > 7653 | > 7723 | Network: WRITE > 7725 | Network: READ > 7727 | Locks: Transaction > 7731 | Storage: READ > 7734 | Network: READ > 7735 | Storage: READ > 7739 | LWLocks: WALInsertLocks > 7738 | Locks: Transaction > 7740 | LWLocks: BufferMgrLocks > 7741 | Network: READ > 7742 | Network: READ > 7743 | Locks: Transaction Attatched new version of patch with some small fixes in code -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync > with the list of individual locks in lwlock.h. Sooner or later someone will > add an LWLock and forget to update the names-array. That needs to be made > less error-prone, so that the names are maintained in the same place as the > #defines. Perhaps something like rmgrlist.h. This is a good idea, but it's not easy to do in the style of rmgrlist.h, because I don't believe there's any way to define a macro that expands to a preprocessor directive. Attached is a patch that instead generates the list of macros from a text file, and also generates an array inside lwlock.c with the lock names that gets used by the Trace_lwlocks stuff where applicable. Any objections to this solution to the problem? If not, I'd like to go ahead and push this much. I can't test the Windows changes locally, though, so it would be helpful if someone could check that out. > * Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" locks, how > about just giving each one of them a separate tranche? I don't think it's good to split things up to that degree; standardizing on one name per fixed lwlock and one per tranche otherwise seems like a good compromise to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 08/04/2015 03:15 PM, Robert Haas wrote: > On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas<hlinnaka@iki.fi> wrote: >> >* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync >> >with the list of individual locks in lwlock.h. Sooner or later someone will >> >add an LWLock and forget to update the names-array. That needs to be made >> >less error-prone, so that the names are maintained in the same place as the >> >#defines. Perhaps something like rmgrlist.h. > This is a good idea, but it's not easy to do in the style of > rmgrlist.h, because I don't believe there's any way to define a macro > that expands to a preprocessor directive. Attached is a patch that > instead generates the list of macros from a text file, and also > generates an array inside lwlock.c with the lock names that gets used > by the Trace_lwlocks stuff where applicable. > > Any objections to this solution to the problem? If not, I'd like to > go ahead and push this much. I can't test the Windows changes > locally, though, so it would be helpful if someone could check that > out. > In my latest patch I still have an array with names, but postgres will show a message if somebody adds an individual LWLock and forgets to add its name. Code generation is also a solution, and if commiters will support it I'll merge it to main patch. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Heikki Linnakangas
Date:
On 08/04/2015 03:15 PM, Robert Haas wrote: > On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync >> with the list of individual locks in lwlock.h. Sooner or later someone will >> add an LWLock and forget to update the names-array. That needs to be made >> less error-prone, so that the names are maintained in the same place as the >> #defines. Perhaps something like rmgrlist.h. > > This is a good idea, but it's not easy to do in the style of > rmgrlist.h, because I don't believe there's any way to define a macro > that expands to a preprocessor directive. Attached is a patch that > instead generates the list of macros from a text file, and also > generates an array inside lwlock.c with the lock names that gets used > by the Trace_lwlocks stuff where applicable. > > Any objections to this solution to the problem? If not, I'd like to > go ahead and push this much. I can't test the Windows changes > locally, though, so it would be helpful if someone could check that > out. A more low-tech solution would be to something like this in lwlocknames.c: static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS]; /* Turn pointer into one of the LWLocks in main array into an index number */ #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name InitLWLockNames() { NAME_LWLOCK(ShmemIndexLock, "ShmemIndexLock"); NAME_LWLOCK(OidGenLock, "OidGenLock"); ... } That would not be auto-generated, so you'd need to keep that list in sync with lwlock.h, but it would be much better than the original patch because if you forgot to add an entry in the names-array, the numbering of all the other locks would not go wrong. And you could have a runtime check that complains if there's an entry missing, like Ildus did in his latest patch. I have no particular objection to your perl script either, though. I'll leave it up to you. - Heikki
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
> On Aug 4, 2015, at 4:54 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 08/04/2015 03:15 PM, Robert Haas wrote: >> On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync >>> with the list of individual locks in lwlock.h. Sooner or later someone will >>> add an LWLock and forget to update the names-array. That needs to be made >>> less error-prone, so that the names are maintained in the same place as the >>> #defines. Perhaps something like rmgrlist.h. >> >> This is a good idea, but it's not easy to do in the style of >> rmgrlist.h, because I don't believe there's any way to define a macro >> that expands to a preprocessor directive. Attached is a patch that >> instead generates the list of macros from a text file, and also >> generates an array inside lwlock.c with the lock names that gets used >> by the Trace_lwlocks stuff where applicable. >> >> Any objections to this solution to the problem? If not, I'd like to >> go ahead and push this much. I can't test the Windows changes >> locally, though, so it would be helpful if someone could check that >> out. > > A more low-tech solution would be to something like this in lwlocknames.c: > > static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS]; > > /* Turn pointer into one of the LWLocks in main array into an index number */ > #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name > > InitLWLockNames() > { > NAME_LWLOCK(ShmemIndexLock, "ShmemIndexLock"); > NAME_LWLOCK(OidGenLock, "OidGenLock"); > ... > } > > That would not be auto-generated, so you'd need to keep that list in sync with lwlock.h, but it would be much better thanthe original patch because if you forgot to add an entry in the names-array, the numbering of all the other locks wouldnot go wrong. And you could have a runtime check that complains if there's an entry missing, like Ildus did in his latestpatch. > > I have no particular objection to your perl script either, though. I'll leave it up to you. > > - Heikki > A new version of the patch. I used your idea with macros, and with tranches that allowed us to remove array with names (they can be written directly to the corresponding tranche). ---- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > A new version of the patch. I used your idea with macros, and with tranches that > allowed us to remove array with names (they can be written directly to the corresponding > tranche). You seem not to have addressed a few of the points I brought up here: http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirUPJSAhg@mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 4, 2015 at 4:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: >> A new version of the patch. I used your idea with macros, and with tranches that >> allowed us to remove array with names (they can be written directly to the corresponding >> tranche). > > You seem not to have addressed a few of the points I brought up here: > > http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirUPJSAhg@mail.gmail.com More generally, I'd like to stop smashing all the things that need to be done here into one patch. We need to make some changes, such as the one I proposed earlier today, to make it easier to properly identify locks. Let's talk about how to do that and agree on the details. Then, once that's done, let's do the main part of the work afterwards, in a separate commit. We're running through patch versions at light speed here, but I'm not sure we're really building consensus around how to do things. The actual technical work here isn't really the problem; that part is easy. The hard part is agreeing on the details of how it should work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Ildus Kurbangaliev wrote: > A new version of the patch. I used your idea with macros, and with tranches that > allowed us to remove array with names (they can be written directly to the corresponding > tranche). Just a bystander here, I haven't reviewed this patch at all, but I have two questions, 1. have you tested this under -DEXEC_BACKEND ? I wonder if those initializations are going to work on Windows. 2. why keep the SLRU control locks as individual locks? Surely we could put them in the SlruCtl struct and get rid of a few individual lwlocks? Also, I wonder if we shouldn't be doing this in two parts, one that changes the underlying lwlock structure and another one to change pg_stat_activity. We have CppAsString() in c.h IIRC, which we use instead of # (we do use ## in a few places though). I wonder if that stuff has any value anymore. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 06/25/2015 07:50 AM, Tom Lane wrote: > To do that, we'd have to change the semantics of the 'waiting' column so > that it becomes true for non-heavyweight-lock waits. I'm not sure whether > that's a good idea or not; I'm afraid there may be client-side code that > expects 'waiting' to indicate that there's a corresponding row in > pg_locks. If we're willing to do that, then I'd be okay with > allowing wait_status to be defined as "last thing waited for"; but the > two points aren't separable. Speaking as someone who writes a lot of monitoring and alerting code, changing the meaning of the waiting column is OK as long as there's still a boolean column named "waiting" and it means "query blocked" in some way. Users are used to pg_stat_activity.waiting failing to join against pg_locks ... for one thing, there's timing issues there. So pretty much everything I've seen uses outer joins anyway. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Aug 4, 2015 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
> > with the list of individual locks in lwlock.h. Sooner or later someone will
> > add an LWLock and forget to update the names-array. That needs to be made
> > less error-prone, so that the names are maintained in the same place as the
> > #defines. Perhaps something like rmgrlist.h.
>
> This is a good idea, but it's not easy to do in the style of
> rmgrlist.h, because I don't believe there's any way to define a macro
> that expands to a preprocessor directive. Attached is a patch that
> instead generates the list of macros from a text file, and also
> generates an array inside lwlock.c with the lock names that gets used
> by the Trace_lwlocks stuff where applicable.
>
> Any objections to this solution to the problem? If not, I'd like to
> go ahead and push this much. I can't test the Windows changes
> locally, though, so it would be helpful if someone could check that
> out.
>
>
> On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
> > with the list of individual locks in lwlock.h. Sooner or later someone will
> > add an LWLock and forget to update the names-array. That needs to be made
> > less error-prone, so that the names are maintained in the same place as the
> > #defines. Perhaps something like rmgrlist.h.
>
> This is a good idea, but it's not easy to do in the style of
> rmgrlist.h, because I don't believe there's any way to define a macro
> that expands to a preprocessor directive. Attached is a patch that
> instead generates the list of macros from a text file, and also
> generates an array inside lwlock.c with the lock names that gets used
> by the Trace_lwlocks stuff where applicable.
>
> Any objections to this solution to the problem? If not, I'd like to
> go ahead and push this much. I can't test the Windows changes
> locally, though, so it would be helpful if someone could check that
> out.
>
Okay, I have check it on Windows and found below issues which I
have fixed in patch attached with this mail.
1. Got below error while running mkvcbuild.pl
Use of uninitialized value in numeric lt (<) at Solution.pm line 103.
Could not open src/backend/storage/lmgr/lwlocknames.h at Mkvcbuild.pm line 674.
+ if (IsNewer(
+ 'src/backend/storage/lmgr/lwlocknames.txt',
'src/include/storage/lwlocknames.h'))
I think the usage of old and new filenames in IsNewer is reversed here.
2.
+ print "Generating lwlocknames.c and fmgroids.h...\n";
Here it should be lwlocknames.h instead of fmgroids.h
3.
+ if (IsNewer(
+ 'src/include/storage/lwlocknames.h',
+
'src/backend/storage/lmgr/fmgroids.h'))
Here, it seems lwlocknames.h needs to compared instead of fmgroids.h
4. Got below error while running mkvcbuild.pl
Generating lwlocknames.c and fmgroids.h...
/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not edit */
/* there is deliberately not an #ifndef LWLOCKNAMES_H here */
rename: lwlocknames.h.tmp3180: Permission denied at generate-lwlocknames.pl line
59, <$lwlocknames> line 47.
In generate-lwlocknames.pl, below line is causing problem.
rename($htmp, 'lwlocknames.h') || die "rename: $htmp: $!";
The reason is that closing of tmp files is missing.
Some other general observations
5.
On running perl mkvcbuild.pl in Windows, below message is generated.
Generating lwlocknames.c and lwlocknames.h...
/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not edit */
/* there is deliberately not an #ifndef LWLOCKNAMES_H here */
Following message gets displayed, which looks slightly odd, displaying
it in C comments makes it look out of place and other similar .pl files
don't generate such message. Having said that, I think it displays useful
information, so it might be okay to retain it.
6.
+
+maintainer-clean: clean
+ rm -f lwlocknames.h
Here, don't we need to clean lwlocknames.c?
Attachment
On Tue, Aug 4, 2015 at 8:46 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/25/2015 07:50 AM, Tom Lane wrote: >> To do that, we'd have to change the semantics of the 'waiting' column so >> that it becomes true for non-heavyweight-lock waits. I'm not sure whether >> that's a good idea or not; I'm afraid there may be client-side code that >> expects 'waiting' to indicate that there's a corresponding row in >> pg_locks. If we're willing to do that, then I'd be okay with >> allowing wait_status to be defined as "last thing waited for"; but the >> two points aren't separable. > > Speaking as someone who writes a lot of monitoring and alerting code, > changing the meaning of the waiting column is OK as long as there's > still a boolean column named "waiting" and it means "query blocked" in > some way. > > Users are used to pg_stat_activity.waiting failing to join against > pg_locks ... for one thing, there's timing issues there. So pretty much > everything I've seen uses outer joins anyway. All of that is exactly how I feel about it, too. It's not totally painless to redefine waiting, but we're not proposing a *big* change in semantics. The way I see it, if we change this now, some people will need to adjust, but it won't really be a big deal. If we insist that "waiting" is graven in stone, then in 5 years people will still be wondering why the "waiting" column is inconsistent with the "wait_state" column. That's not going to be a win. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 08/04/2015 11:47 PM, Robert Haas wrote: > On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: >> A new version of the patch. I used your idea with macros, and with tranches that >> allowed us to remove array with names (they can be written directly to the corresponding >> tranche). > You seem not to have addressed a few of the points I brought up here: > > http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirUPJSAhg@mail.gmail.com > About `memcpy`, PgBackendStatus struct already have a bunch of multi-byte variables, so it will be not consistent anyway if somebody will want to copy it in that way. On the other hand two bytes in this case give less overhead because we can avoid the offset calculations. And as I've mentioned before the class of wait will be useful when monitoring of waits will be extended. Other things from that patch already changed in latest patch. On 08/04/2015 11:53 PM, Alvaro Herrera wrote: > Just a bystander here, I haven't reviewed this patch at all, but I have > two questions, > > 1. have you tested this under -DEXEC_BACKEND ? I wonder if those > initializations are going to work on Windows. No, it wasn't tested on Windows -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Aug 5, 2015 at 1:10 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > About `memcpy`, PgBackendStatus struct already have a bunch of multi-byte > variables, so it will be > not consistent anyway if somebody will want to copy it in that way. On the > other hand two bytes in this case > give less overhead because we can avoid the offset calculations. And as I've > mentioned before the class > of wait will be useful when monitoring of waits will be extended. You're missing the point. Those multi-byte fields have additional synchronization requirements, as I explained in some detail in my previous email. You can't just wave that away. When people raise points in review, you need to respond to those with discussion, not just blast out a new patch version that may or may not have made some changes in that area. Otherwise, you're wasting the time of the people who are reviewing, which is not nice. >> 1. have you tested this under -DEXEC_BACKEND ? I wonder if those >> initializations are going to work on Windows. > > No, it wasn't tested on Windows I don't think it's going to work on Windows. CreateSharedMemoryAndSemaphores() is called once only from the postmaster on non-EXEC_BACKEND builds, but on EXEC_BACKEND builds (i.e. Windows) it's called for every process startup. Thus, it's got to be idempotent: if the shared memory structures it's looking for already exist, it must not try to recreate them. You have, for example, InitBufferPool() calling LWLockCreateTranche(), which unconditionally assigns a new tranche ID. It can't do that; all of the processes in the system have to agree on what the tranche IDs are. The general problem that I see with splitting apart the main lwlock array into many tranches is that all of those tranches need to get set up properly - with matching tranche IDs - in every backend. In non-EXEC_BACKEND builds, that's basically free, but on EXEC_BACKEND builds it isn't. I think that's OK; this won't be the first piece of state where EXEC_BACKEND builds incur some extra overhead. But we should make an effort to keep that overhead small. The way to do that, I think, is to store some state in shared memory that, in EXEC_BACKEND builds, will allow new postmaster children to correctly re-register all of the tranches. It seems to me that we can do this as follows: 1. Have a compiled-in array of built-in tranche names. 2. In LWLockShmemSize(), work out the number of lwlocks each tranche should contain. 3. In CreateLWLocks(), if IsUnderPostmaster, grab enough shared memory for all the lwlocks themselves plus enough extra shared memory for one pointer per tranche. Store pointers to the start of each tranche in shared memory, and initialize all the lwlocks. 4. In CreateLWLocks(), if tranche registration has not yet been done (either because we are the postmaster, or because this is the EXEC_BACKEND case) loop over the array of built-in tranche names and register each one with the corresponding address grabbed from shared memory. A more radical redesign would be to have each tranche of LWLocks as a separate chunk of shared memory, registered with ShmemInitStruct(), and let EXEC_BACKEND children find those chunks by name using ShmemIndex. But that seems like more work to me for not much benefit, especially since there's this weird thing where lwlocks are initialized before ShmemIndex gets set up. Yet another possible approach is to have each module register its own tranche and track its own tranche ID using the same kind of strategy that replication origins and WAL insert locks already employ. That may well be the right long-term strategy, but it seems like sort of a pain to all that effort right now for this project, so I'm inclined to hack on the approach described above and see if I can get that working for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On 08/05/2015 09:33 PM, Robert Haas wrote: > On Wed, Aug 5, 2015 at 1:10 PM, Ildus Kurbangaliev > <i.kurbangaliev@postgrespro.ru> wrote: >> About `memcpy`, PgBackendStatus struct already have a bunch of multi-byte >> variables, so it will be >> not consistent anyway if somebody will want to copy it in that way. On the >> other hand two bytes in this case >> give less overhead because we can avoid the offset calculations. And as I've >> mentioned before the class >> of wait will be useful when monitoring of waits will be extended. > You're missing the point. Those multi-byte fields have additional > synchronization requirements, as I explained in some detail in my > previous email. You can't just wave that away. I see that now. Thank you for the point. I've looked deeper and I found PgBackendStatus to be not a suitable place for keeping information about low level waits. Really, PgBackendStatus is used to track high level information about backend. This is why auxiliary processes don't have PgBackendStatus, because they don't have such information to expose. But when we come to the low level wait events then auxiliary processes are as useful for monitoring as backends are. WAL writer, checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear why they can't be monitored. This is why I think we shoudn't place wait event into PgBackendStatus. It could be placed into PGPROC or even separate data structure with different concurrency model which would be most suitable for monitoring. -- Ildus Kurbangaliev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
On 08/05/2015 09:33 PM, Robert Haas wrote:On Wed, Aug 5, 2015 at 1:10 PM, Ildus KurbangalievI see that now. Thank you for the point.
<i.kurbangaliev@postgrespro.ru> wrote:About `memcpy`, PgBackendStatus struct already have a bunch of multi-byteYou're missing the point. Those multi-byte fields have additional
variables, so it will be
not consistent anyway if somebody will want to copy it in that way. On the
other hand two bytes in this case
give less overhead because we can avoid the offset calculations. And as I've
mentioned before the class
of wait will be useful when monitoring of waits will be extended.
synchronization requirements, as I explained in some detail in my
previous email. You can't just wave that away.
I've looked deeper and I found PgBackendStatus to be not a suitable
place for keeping information about low level waits. Really, PgBackendStatus
is used to track high level information about backend. This is why auxiliary
processes don't have PgBackendStatus, because they don't have such information
to expose. But when we come to the low level wait events then auxiliary
processes are as useful for monitoring as backends are. WAL writer,
checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
why they can't be monitored.
This is why I think we shoudn't place wait event into PgBackendStatus. It
could be placed into PGPROC or even separate data structure with different
concurrency model which would be most suitable for monitoring.
+1 for tracking wait events not only for backends
Ildus, could you do following?
1) Extract LWLocks refactoring into separate patch.
2) Make a patch with storing current wait event information in PGPROC.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev < > i.kurbangaliev@postgrespro.ru> wrote: >> This is why I think we shoudn't place wait event into PgBackendStatus. It >> could be placed into PGPROC or even separate data structure with different >> concurrency model which would be most suitable for monitoring. > +1 for tracking wait events not only for backends > Ildus, could you do following? > 1) Extract LWLocks refactoring into separate patch. > 2) Make a patch with storing current wait event information in PGPROC. What will this accomplish exactly, other than making it more complicated to make a copy of the information when we capture an activity snapshot? You'll have to get data out of two places, which do not have any synchronization protocol defined between them. regards, tom lane
On Mon, Jul 27, 2015 at 04:10:14PM -0500, Jim Nasby wrote: > >>Sure, but I don't think this makes it impossible to figure out who's > >>locking who. I think the only thing you need other than the data in > >>pg_locks is the conflicts table, which is well documented. > >> > >>Oh, hmm, one thing missing is the ordering of the wait queue for each > >>locked object. If process A holds RowExclusive on some object, process > >>B wants ShareLock (stalled waiting) and process C wants AccessExclusive > >>(also stalled waiting), who of B and C is woken up first after A > >>releases the lock depends on order of arrival. > > > >Agreed - it would be nice to expose that somehow. > > +1. It's very common to want to know who's blocking who, and not at > all easy to do that today. We should at minimum have a canonical > example of how to do it, but something built in would be even > better. Coming in late here, but have you looked at my locking presentation; I think there are examples in there: http://momjian.us/main/writings/pgsql/locking.pdf -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
"andres@anarazel.de"
Date:
On 2015-08-04 23:37:08 +0300, Ildus Kurbangaliev wrote: > diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c > index 3a58f1e..10c25cf 100644 > --- a/src/backend/access/transam/clog.c > +++ b/src/backend/access/transam/clog.c > @@ -457,7 +457,8 @@ CLOGShmemInit(void) > { > ClogCtl->PagePrecedes = CLOGPagePrecedes; > SimpleLruInit(ClogCtl, "CLOG Ctl", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, > - CLogControlLock, "pg_clog"); > + CLogControlLock, "pg_clog", > + "CLogBufferLocks"); > } I'd rather just add the name "clog" (etc) as a string once to SimpleLruInit instead of now four 3 times. > +/* Lock names. For monitoring purposes */ > +const char *LOCK_NAMES[] = > +{ > + "Relation", > + "RelationExtend", > + "Page", > + "Tuple", > + "Transaction", > + "VirtualTransaction", > + "SpeculativeToken", > + "Object", > + "Userlock", > + "Advisory" > +}; Why do we need this rather than DescribeLockTag? > + /* Create tranches for individual LWLocks */ > + for (i = 0; i < NUM_INDIVIDUAL_LWLOCKS; i++, tranche++) > + { > + int id = LWLockNewTrancheId(); > + > + /* > + * We need to be sure that generated id is equal to index > + * for individual LWLocks > + */ > + Assert(id == i); > + > + tranche->array_base = MainLWLockArray; > + tranche->array_stride = sizeof(LWLockPadded); > + MemSet(tranche->name, 0, LWLOCK_MAX_TRANCHE_NAME); > + > + /* Initialize individual LWLock */ > + LWLockInitialize(&MainLWLockArray[i].lock, id); > + > + /* Register new tranche in tranches array */ > + LWLockRegisterTranche(id, tranche); > + } > + > + /* Fill individual LWLock names */ > + InitLWLockNames(); Why a new tranche for each of these? And it can't be correct that each has the same base? I don't really like the tranche model as in the patch right now. I'd rather have in a way that we have one tranch for all the individual lwlocks, where the tranche points to an array of names alongside the tranche's name. And then for the others we just supply the tranche name, but leave the name array empty, whereas a name can be generated. Greetings, Andres Freund
On Tue, Sep 1, 2015 at 6:43 PM, andres@anarazel.de <andres@anarazel.de> wrote: > Why a new tranche for each of these? And it can't be correct that each > has the same base? I complained about the same-base problem before. Apparently, that got ignored. > I don't really like the tranche model as in the patch right now. I'd > rather have in a way that we have one tranch for all the individual > lwlocks, where the tranche points to an array of names alongside the > tranche's name. And then for the others we just supply the tranche name, > but leave the name array empty, whereas a name can be generated. That's an interesting idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>
I think the chances of background processes stuck in LWLock is quite less
as compare to backends as they do the activities periodically. As an example
WALWriter will take WALWriteLock to write the WAL, but actually there will never
be any much contention for WALWriter. In synchronous_commit = on, the
backends themselves write the WAL so WALWriter won't do much in that
case and for synchronous_commit = off, backends won't write the WAL so
WALWriter won't face any contention unless some buffers have to be written
by bgwriter or checkpoint for which WAL is not flushed which I don't think
would lead to any contention.
I am not denying from the fact that there could be some contention in rare
scenarios for background processes, but I think tracking them is not as
important as tracking the LWLocks for backends.
Also as we are planning to track the wait_event information in pg_stat_activity
along with other backends information, it will not make sense to include
information about backend processes in this variable as pg_stat_activity
just displays information of backend processes.
On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>>
>>
>> I've looked deeper and I found PgBackendStatus to be not a suitable
>> place for keeping information about low level waits. Really, PgBackendStatus
>> is used to track high level information about backend. This is why auxiliary
>> processes don't have PgBackendStatus, because they don't have such information
>> to expose. But when we come to the low level wait events then auxiliary
>> processes are as useful for monitoring as backends are. WAL writer,
>> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
>> why they can't be monitored.
>>
>> This is why I think we shoudn't place wait event into PgBackendStatus. It
>> could be placed into PGPROC or even separate data structure with different
>> concurrency model which would be most suitable for monitoring.
>
>
> +1 for tracking wait events not only for backends
>
> Ildus, could you do following?
> 1) Extract LWLocks refactoring into separate patch.
> 2) Make a patch with storing current wait event information in PGPROC.
>
>
> On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>>
>>
>> I've looked deeper and I found PgBackendStatus to be not a suitable
>> place for keeping information about low level waits. Really, PgBackendStatus
>> is used to track high level information about backend. This is why auxiliary
>> processes don't have PgBackendStatus, because they don't have such information
>> to expose. But when we come to the low level wait events then auxiliary
>> processes are as useful for monitoring as backends are. WAL writer,
>> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
>> why they can't be monitored.
>>
>> This is why I think we shoudn't place wait event into PgBackendStatus. It
>> could be placed into PGPROC or even separate data structure with different
>> concurrency model which would be most suitable for monitoring.
>
>
> +1 for tracking wait events not only for backends
>
> Ildus, could you do following?
> 1) Extract LWLocks refactoring into separate patch.
> 2) Make a patch with storing current wait event information in PGPROC.
>
Now as Robert has committed standardization of lwlock names in
commit - aa65de04, let us try to summarize and work on remaining parts
of the patch. So I think now the next set of work is as follows:
1. Modify the tranche mechanism so that information about LWLocks
can be tracked easily. For this already there is some discussion, ideas
and initial patch is floated in this thread and there doesn't seem to be much
conflicts, so we can write the patch for it. I am planning to write or modify
the existing patch unless you, IIdus or anyone has objections or want to
write it, please let me know to avoid duplication of work.
2. Track wait_event in pg_stat_activity. Now here the main point where
we doesn't seem to be in complete agreement is that shall we keep it
as one byte in PgBackendStatus or use two or more bytes to store
wait_event information and still there is another point made by you to
store it in PGProc rather than in PgBackendStatus so that we can display
information of background processes as well.
Now as a matter of consensus, I think Tom has raised a fair point [1] against
storing this information in PGProc and I feel that it is relatively less
important to have such information about background processes and the
reason for same is explained upthread [2]. About having more than one-byte
to store information about various wait_events, I think now we will not have
more than 100 or so events to track, do you really think that anytime in forseeable
future we will have more than 256 important events which we would like to track?
So I think about this lets first try to build the consensus and then attempt to
write or modify the patch.
On Sat, Sep 12, 2015 at 8:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > 1. Modify the tranche mechanism so that information about LWLocks > can be tracked easily. For this already there is some discussion, ideas > and initial patch is floated in this thread and there doesn't seem to be > much > conflicts, so we can write the patch for it. I am planning to write or > modify > the existing patch unless you, IIdus or anyone has objections or want to > write it, please let me know to avoid duplication of work. What I'd like to see happen here is two new API calls. During the early initialization (before shared memory sizing, and during process_shared_preload_libraries), backends in either the core code or a loadable module can call RequestLWLockTranche(char *, int) to request a tranche with the given name and number of locks. Then, when shared memory is created, the core code creates a tranche which is part of MainLWLockArray. The base of the tranche points to the first lock in that tranche, and the tranche is automatically registered for all subsequent backends. In EXEC_BACKEND builds, this requires stashing the LWLockTranche and the name to which it points in shared memory somewhere, so that exec'd backends can look at shared memory and redo the registration. In non-EXEC_BACKEND builds the values can just be inherited via fork. Then, we add a second API call LookupLWTrancheByName(char *) which does just that. This gets used to initialize backend-private pointers to the various tranches. Besides splitting apart the main tranche into a bunch of tranches with individual names so that we can identify lwlocks easily, this approach makes sure that the number of locks requested by an extension matches the number it actually gets. Today, frustratingly, an extension that requests one number of locks and uses another may work or fail unpredictably depending on what other extensions are loaded and what they do. This would eliinate that nuisance: the new APIs would obsolete RequestAddinLWLocks, which I would propose to remove. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Ildus Kurbangaliev
Date:
On Sep 13, 2015, at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:On Sat, Sep 12, 2015 at 8:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:1. Modify the tranche mechanism so that information about LWLocks
can be tracked easily. For this already there is some discussion, ideas
and initial patch is floated in this thread and there doesn't seem to be
much
conflicts, so we can write the patch for it. I am planning to write or
modify
the existing patch unless you, IIdus or anyone has objections or want to
write it, please let me know to avoid duplication of work.
What I'd like to see happen here is two new API calls. During the
early initialization (before shared memory sizing, and during
process_shared_preload_libraries), backends in either the core code or
a loadable module can call RequestLWLockTranche(char *, int) to
request a tranche with the given name and number of locks. Then, when
shared memory is created, the core code creates a tranche which is
part of MainLWLockArray. The base of the tranche points to the first
lock in that tranche, and the tranche is automatically registered for
all subsequent backends. In EXEC_BACKEND builds, this requires
stashing the LWLockTranche and the name to which it points in shared
memory somewhere, so that exec'd backends can look at shared memory
and redo the registration. In non-EXEC_BACKEND builds the values can
just be inherited via fork. Then, we add a second API call
LookupLWTrancheByName(char *) which does just that. This gets used to
initialize backend-private pointers to the various tranches.
Besides splitting apart the main tranche into a bunch of tranches with
individual names so that we can identify lwlocks easily, this approach
makes sure that the number of locks requested by an extension matches
the number it actually gets. Today, frustratingly, an extension that
requests one number of locks and uses another may work or fail
unpredictably depending on what other extensions are loaded and what
they do. This would eliinate that nuisance: the new APIs would
obsolete RequestAddinLWLocks, which I would propose to remove.
Thoughts?
two API calls (for a size determination and a tranche creation), except
MainLWLockArray is used only for individual LWLocks.
Also I suggest to keep RequestAddinLWLocks for backward compatibility.
On Sun, Sep 13, 2015 at 11:09 AM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote: > This is pretty much the same that my patch does. There is > two API calls (for a size determination and a tranche creation), except > MainLWLockArray is used only for individual LWLocks. It's not really the same. Your patch doesn't provide any interlock to ensure that the number of locks requested for a particular subsystem during shmem sizing is the same as the number actually created during shmem setup. That's an interlock I'd really like to have. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Vladimir Borodin
Date:
12 сент. 2015 г., в 14:05, Amit Kapila <amit.kapila16@gmail.com> написал(а):On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>I think the chances of background processes stuck in LWLock is quite lessas compare to backends as they do the activities periodically. As an exampleWALWriter will take WALWriteLock to write the WAL, but actually there will neverbe any much contention for WALWriter. In synchronous_commit = on, thebackends themselves write the WAL so WALWriter won't do much in thatcase and for synchronous_commit = off, backends won't write the WAL soWALWriter won't face any contention unless some buffers have to be writtenby bgwriter or checkpoint for which WAL is not flushed which I don't thinkwould lead to any contention.
WALWriter is not a good example, IMHO. And monitoring LWLocks is not the only thing that waits monitoring brings to us. Here [0] is an example when understanding of what is happening inside the startup process took some long time and led to GDB usage. With waits monitoring I could do a couple of SELECTs and use oid2name to understand the reason of a problem.
Also we should consider that PostgreSQL has a good infrastructure to parallelize many auxilary processes. Can we be sure that we will always have exactly one wal writer process? Perhaps, some time in the future we would need several of them and there would be contention for WALWriteLock between them. Perhaps, wal writer is not a good example here too, but having multiple checkpointer or bgwriter processes on the near future seems very likely, no?
I am not denying from the fact that there could be some contention in rarescenarios for background processes, but I think tracking them is not asimportant as tracking the LWLocks for backends.Also as we are planning to track the wait_event information in pg_stat_activityalong with other backends information, it will not make sense to includeinformation about backend processes in this variable as pg_stat_activityjust displays information of backend processes.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>I think the chances of background processes stuck in LWLock is quite lessas compare to backends as they do the activities periodically. As an exampleWALWriter will take WALWriteLock to write the WAL, but actually there will neverbe any much contention for WALWriter. In synchronous_commit = on, thebackends themselves write the WAL so WALWriter won't do much in thatcase and for synchronous_commit = off, backends won't write the WAL soWALWriter won't face any contention unless some buffers have to be writtenby bgwriter or checkpoint for which WAL is not flushed which I don't thinkwould lead to any contention.
Hmm, synchronous_commit is per session variable: some transactions could run with synchronous_commit = on, but some with synchronous_commit = off. This is very popular feature of PostgreSQL: achieve better performance by making non-critical transaction asynchronous while leaving critical transactions synchronous. Thus, contention for WALWriteLock between backends and WALWriter could be real.
I am not denying from the fact that there could be some contention in rarescenarios for background processes, but I think tracking them is not asimportant as tracking the LWLocks for backends.
I would be more careful in calling some of scenarios rare. As DBMS developers we should do our best to evade contention for LWLocks: any contention, not only between backends and background processes. One may assume that high LWLock contention is rare scenario in general. Once we're here we doesn't think so, though.
You claims that there couldn't be contention for WALWriteLock between backends and WALWriter. This is unclear for me: I think it could be. Nobody opposes tracking wait events for backends and tracking them for background processes. I think we need to track both in order to provide full picture to DBA.
Also as we are planning to track the wait_event information in pg_stat_activityalong with other backends information, it will not make sense to includeinformation about backend processes in this variable as pg_stat_activityjust displays information of backend processes.
I'm not objecting that we should track only backends information in pg_stat_activity. I think we should have also some other way of tracking wait events for background processes. We should think it out before extending pg_stat_activity to evade design issues later.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>>
>>
>> I've looked deeper and I found PgBackendStatus to be not a suitable
>> place for keeping information about low level waits. Really, PgBackendStatus
>> is used to track high level information about backend. This is why auxiliary
>> processes don't have PgBackendStatus, because they don't have such information
>> to expose. But when we come to the low level wait events then auxiliary
>> processes are as useful for monitoring as backends are. WAL writer,
>> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
>> why they can't be monitored.
>>
>> This is why I think we shoudn't place wait event into PgBackendStatus. It
>> could be placed into PGPROC or even separate data structure with different
>> concurrency model which would be most suitable for monitoring.
>
>
> +1 for tracking wait events not only for backends
>
> Ildus, could you do following?
> 1) Extract LWLocks refactoring into separate patch.
> 2) Make a patch with storing current wait event information in PGPROC.
>Now as Robert has committed standardization of lwlock names incommit - aa65de04, let us try to summarize and work on remaining partsof the patch. So I think now the next set of work is as follows:1. Modify the tranche mechanism so that information about LWLockscan be tracked easily. For this already there is some discussion, ideasand initial patch is floated in this thread and there doesn't seem to be muchconflicts, so we can write the patch for it. I am planning to write or modifythe existing patch unless you, IIdus or anyone has objections or want towrite it, please let me know to avoid duplication of work.2. Track wait_event in pg_stat_activity. Now here the main point wherewe doesn't seem to be in complete agreement is that shall we keep itas one byte in PgBackendStatus or use two or more bytes to storewait_event information and still there is another point made by you tostore it in PGProc rather than in PgBackendStatus so that we can displayinformation of background processes as well.Now as a matter of consensus, I think Tom has raised a fair point [1] againststoring this information in PGProc and I feel that it is relatively lessimportant to have such information about background processes and thereason for same is explained upthread [2]. About having more than one-byteto store information about various wait_events, I think now we will not havemore than 100 or so events to track, do you really think that anytime in forseeablefuture we will have more than 256 important events which we would like to track?So I think about this lets first try to build the consensus and then attempt towrite or modify the patch.
In order to build the consensus we need the roadmap for waits monitoring.
Would single byte in PgBackendStatus be the only way for tracking wait events? Could we have pluggable infrastructure in waits monitoring: for instance, hooks for wait event begin and end?
Limit of 256 wait events is probably OK. But we have no room for any additional parameters of wait event. For instance, if you notice high contention for buffer content lock then it would be nice to figure out: how many blocks are involved?, which blocks? We need to track additional parameters in order to answer this question.
Comments about tracking only backends are in [1] and [2].
PGProc is not the only way to track events with parameters for every process. We could try to extend PgBackendStatus or other build secondary data structure.
However, this is depending of our general roadmap. If pg_stat_activity is just for default while suitable waits monitoring could be a plugin, then it's pobably OK.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sun, Sep 13, 2015 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Sep 12, 2015 at 8:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 1. Modify the tranche mechanism so that information about LWLocks
> > can be tracked easily. For this already there is some discussion, ideas
> > and initial patch is floated in this thread and there doesn't seem to be
> > much
> > conflicts, so we can write the patch for it. I am planning to write or
> > modify
> > the existing patch unless you, IIdus or anyone has objections or want to
> > write it, please let me know to avoid duplication of work.
>
> What I'd like to see happen here is two new API calls. During the
> early initialization (before shared memory sizing, and during
> process_shared_preload_libraries), backends in either the core code or
> a loadable module can call RequestLWLockTranche(char *, int) to
> request a tranche with the given name and number of locks.
>
> On Sat, Sep 12, 2015 at 8:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 1. Modify the tranche mechanism so that information about LWLocks
> > can be tracked easily. For this already there is some discussion, ideas
> > and initial patch is floated in this thread and there doesn't seem to be
> > much
> > conflicts, so we can write the patch for it. I am planning to write or
> > modify
> > the existing patch unless you, IIdus or anyone has objections or want to
> > write it, please let me know to avoid duplication of work.
>
> What I'd like to see happen here is two new API calls. During the
> early initialization (before shared memory sizing, and during
> process_shared_preload_libraries), backends in either the core code or
> a loadable module can call RequestLWLockTranche(char *, int) to
> request a tranche with the given name and number of locks.
>
Won't this new API introduce another failure mode which is collision
of tranche name? Will it give the error for collision during Request
call or during shared_memory create?
I understand that such an API could simplify the maintainance of tranche,
but won't it be slightly less user friendly, in particular due to additional
requirement of providing name of tranche which extensions user might or
might not understand the meaning of.
> Then, when
> shared memory is created, the core code creates a tranche which is
> part of MainLWLockArray. The base of the tranche points to the first
> lock in that tranche, and the tranche is automatically registered for
> all subsequent backends.
> shared memory is created, the core code creates a tranche which is
> part of MainLWLockArray. The base of the tranche points to the first
> lock in that tranche, and the tranche is automatically registered for
> all subsequent backends.
>
I think same has to be done for tranches outside MainLWLockArray like WAL
or ReplicationOrigin?
> In EXEC_BACKEND builds, this requires
> stashing the LWLockTranche and the name to which it points in shared
> memory somewhere, so that exec'd backends can look at shared memory
> and redo the registration. In non-EXEC_BACKEND builds the values can
> just be inherited via fork. Then, we add a second API call
> LookupLWTrancheByName(char *) which does just that. This gets used to
> initialize backend-private pointers to the various tranches.
>
> stashing the LWLockTranche and the name to which it points in shared
> memory somewhere, so that exec'd backends can look at shared memory
> and redo the registration. In non-EXEC_BACKEND builds the values can
> just be inherited via fork. Then, we add a second API call
> LookupLWTrancheByName(char *) which does just that. This gets used to
> initialize backend-private pointers to the various tranches.
>
So will this also require changes in the way extensions assigns the
locks?
Now they call LWLockAssign() during shared memory initialization, so
will that needs change?
On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>I think the chances of background processes stuck in LWLock is quite lessas compare to backends as they do the activities periodically. As an exampleWALWriter will take WALWriteLock to write the WAL, but actually there will neverbe any much contention for WALWriter. In synchronous_commit = on, thebackends themselves write the WAL so WALWriter won't do much in thatcase and for synchronous_commit = off, backends won't write the WAL soWALWriter won't face any contention unless some buffers have to be writtenby bgwriter or checkpoint for which WAL is not flushed which I don't thinkwould lead to any contention.Hmm, synchronous_commit is per session variable: some transactions could run with synchronous_commit = on, but some with synchronous_commit = off. This is very popular feature of PostgreSQL: achieve better performance by making non-critical transaction asynchronous while leaving critical transactions synchronous. Thus, contention for WALWriteLock between backends and WALWriter could be real.
I think it is difficult to say that can lead to contention due to periodic
nature of WALWriter, but I don't deny that there is chance for
background processes to have contention.
I am not denying from the fact that there could be some contention in rarescenarios for background processes, but I think tracking them is not asimportant as tracking the LWLocks for backends.I would be more careful in calling some of scenarios rare. As DBMS developers we should do our best to evade contention for LWLocks: any contention, not only between backends and background processes. One may assume that high LWLock contention is rare scenario in general. Once we're here we doesn't think so, though.You claims that there couldn't be contention for WALWriteLock between backends and WALWriter. This is unclear for me: I think it could be.
I think there would be more things where background processes could wait
than LWLocks and I think they are important to track, but could be done separately
from tracking them for pg_stat_activity. Example, we have a pg_stat_bgwriter
view, can't we think of tracking bgwriter/checkpointer wait information in that
view and similarly for other background processes we can track in other views
if any related view exists or create a new one to track for all background processes.
Nobody opposes tracking wait events for backends and tracking them for background processes. I think we need to track both in order to provide full picture to DBA.
Sure, that is good to do, but can't we do it separately in another patch.
I think in this patch lets just work for wait_events for backends.
Also as we are planning to track the wait_event information in pg_stat_activityalong with other backends information, it will not make sense to includeinformation about backend processes in this variable as pg_stat_activityjust displays information of backend processes.I'm not objecting that we should track only backends information in pg_stat_activity. I think we should have also some other way of tracking wait events for background processes. We should think it out before extending pg_stat_activity to evade design issues later.
I think we can discuss if you see any specific problems or you want specific
things to be clarified, but sorting out the complete design of waits monitoring
before this patch can extend the scope of this patch beyond need.
On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
With Regards,
Amit Kapila.
On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>>
>>
>> I've looked deeper and I found PgBackendStatus to be not a suitable
>> place for keeping information about low level waits. Really, PgBackendStatus
>> is used to track high level information about backend. This is why auxiliary
>> processes don't have PgBackendStatus, because they don't have such information
>> to expose. But when we come to the low level wait events then auxiliary
>> processes are as useful for monitoring as backends are. WAL writer,
>> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
>> why they can't be monitored.
>>
>> This is why I think we shoudn't place wait event into PgBackendStatus. It
>> could be placed into PGPROC or even separate data structure with different
>> concurrency model which would be most suitable for monitoring.
>
>
> +1 for tracking wait events not only for backends
>
> Ildus, could you do following?
> 1) Extract LWLocks refactoring into separate patch.
> 2) Make a patch with storing current wait event information in PGPROC.
>Now as Robert has committed standardization of lwlock names incommit - aa65de04, let us try to summarize and work on remaining partsof the patch. So I think now the next set of work is as follows:1. Modify the tranche mechanism so that information about LWLockscan be tracked easily. For this already there is some discussion, ideasand initial patch is floated in this thread and there doesn't seem to be muchconflicts, so we can write the patch for it. I am planning to write or modifythe existing patch unless you, IIdus or anyone has objections or want towrite it, please let me know to avoid duplication of work.2. Track wait_event in pg_stat_activity. Now here the main point wherewe doesn't seem to be in complete agreement is that shall we keep itas one byte in PgBackendStatus or use two or more bytes to storewait_event information and still there is another point made by you tostore it in PGProc rather than in PgBackendStatus so that we can displayinformation of background processes as well.Now as a matter of consensus, I think Tom has raised a fair point [1] againststoring this information in PGProc and I feel that it is relatively lessimportant to have such information about background processes and thereason for same is explained upthread [2]. About having more than one-byteto store information about various wait_events, I think now we will not havemore than 100 or so events to track, do you really think that anytime in forseeablefuture we will have more than 256 important events which we would like to track?So I think about this lets first try to build the consensus and then attempt towrite or modify the patch.In order to build the consensus we need the roadmap for waits monitoring.Would single byte in PgBackendStatus be the only way for tracking wait events? Could we have pluggable infrastructure in waits monitoring: for instance, hooks for wait event begin and end?Limit of 256 wait events is probably OK. But we have no room for any additional parameters of wait event. For instance, if you notice high contention for buffer content lock then it would be nice to figure out: how many blocks are involved?, which blocks? We need to track additional parameters in order to answer this question.
We can track additional parameters by default or based on some
parameter, but do you think that tracking backends wait_event
information as proposed hinders in any which way the future extensions
in this area? The point is that detailed discussion of other parameters
could be better done separately unless you think that this can block
future enhancements for waits monitoring. I see wait monitoring of overall
database as a really valuable feature, but not able to see compelling
need to sort out everything before this patch. Having said that, I am open
for discussion if you want specific things to be sorted out before moving further.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Mon, Sep 14, 2015 at 2:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 14, 2015 at 3:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:On Sat, Sep 12, 2015 at 3:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Fri, Aug 14, 2015 at 7:23 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Thu, Aug 6, 2015 at 1:01 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>>
>>
>> I've looked deeper and I found PgBackendStatus to be not a suitable
>> place for keeping information about low level waits. Really, PgBackendStatus
>> is used to track high level information about backend. This is why auxiliary
>> processes don't have PgBackendStatus, because they don't have such information
>> to expose. But when we come to the low level wait events then auxiliary
>> processes are as useful for monitoring as backends are. WAL writer,
>> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
>> why they can't be monitored.
>>
>> This is why I think we shoudn't place wait event into PgBackendStatus. It
>> could be placed into PGPROC or even separate data structure with different
>> concurrency model which would be most suitable for monitoring.
>
>
> +1 for tracking wait events not only for backends
>
> Ildus, could you do following?
> 1) Extract LWLocks refactoring into separate patch.
> 2) Make a patch with storing current wait event information in PGPROC.
>Now as Robert has committed standardization of lwlock names incommit - aa65de04, let us try to summarize and work on remaining partsof the patch. So I think now the next set of work is as follows:1. Modify the tranche mechanism so that information about LWLockscan be tracked easily. For this already there is some discussion, ideasand initial patch is floated in this thread and there doesn't seem to be muchconflicts, so we can write the patch for it. I am planning to write or modifythe existing patch unless you, IIdus or anyone has objections or want towrite it, please let me know to avoid duplication of work.2. Track wait_event in pg_stat_activity. Now here the main point wherewe doesn't seem to be in complete agreement is that shall we keep itas one byte in PgBackendStatus or use two or more bytes to storewait_event information and still there is another point made by you tostore it in PGProc rather than in PgBackendStatus so that we can displayinformation of background processes as well.Now as a matter of consensus, I think Tom has raised a fair point [1] againststoring this information in PGProc and I feel that it is relatively lessimportant to have such information about background processes and thereason for same is explained upthread [2]. About having more than one-byteto store information about various wait_events, I think now we will not havemore than 100 or so events to track, do you really think that anytime in forseeablefuture we will have more than 256 important events which we would like to track?So I think about this lets first try to build the consensus and then attempt towrite or modify the patch.In order to build the consensus we need the roadmap for waits monitoring.Would single byte in PgBackendStatus be the only way for tracking wait events? Could we have pluggable infrastructure in waits monitoring: for instance, hooks for wait event begin and end?Limit of 256 wait events is probably OK. But we have no room for any additional parameters of wait event. For instance, if you notice high contention for buffer content lock then it would be nice to figure out: how many blocks are involved?, which blocks? We need to track additional parameters in order to answer this question.We can track additional parameters by default or based on someparameter, but do you think that tracking backends wait_eventinformation as proposed hinders in any which way the future extensionsin this area? The point is that detailed discussion of other parameterscould be better done separately unless you think that this can blockfuture enhancements for waits monitoring.
Yes, I think this can block future enhancements for waits monitoring.
You're currently proposing to store current wait event in just single byte. And this is single byte not because of space economy, it is so by concurrency design.
Isn't it natural to ask you how are we going to store something more about current wait event? Should we store additional parameters separately from current wait event? Does it make sense?
Or should we move current wait event away from BgBackendStatus. If so, why place it there?
I see wait monitoring of overalldatabase as a really valuable feature, but not able to see compellingneed to sort out everything before this patch. Having said that, I am openfor discussion if you want specific things to be sorted out before moving further.
I think we need to be sure that we can move further in waits monitoring without reverting this.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Mon, Sep 14, 2015 at 2:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 14, 2015 at 2:25 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:On Sat, Sep 12, 2015 at 2:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:
>
> On 08/05/2015 09:33 PM, Robert Haas wrote:
>>
>>
>> You're missing the point. Those multi-byte fields have additional
>> synchronization requirements, as I explained in some detail in my
>> previous email. You can't just wave that away.
>
> I see that now. Thank you for the point.
>
> I've looked deeper and I found PgBackendStatus to be not a suitable
> place for keeping information about low level waits. Really, PgBackendStatus
> is used to track high level information about backend. This is why auxiliary
> processes don't have PgBackendStatus, because they don't have such information
> to expose. But when we come to the low level wait events then auxiliary
> processes are as useful for monitoring as backends are. WAL writer,
> checkpointer, bgwriter etc are using LWLocks as well. This is certainly unclear
> why they can't be monitored.
>I think the chances of background processes stuck in LWLock is quite lessas compare to backends as they do the activities periodically. As an exampleWALWriter will take WALWriteLock to write the WAL, but actually there will neverbe any much contention for WALWriter. In synchronous_commit = on, thebackends themselves write the WAL so WALWriter won't do much in thatcase and for synchronous_commit = off, backends won't write the WAL soWALWriter won't face any contention unless some buffers have to be writtenby bgwriter or checkpoint for which WAL is not flushed which I don't thinkwould lead to any contention.Hmm, synchronous_commit is per session variable: some transactions could run with synchronous_commit = on, but some with synchronous_commit = off. This is very popular feature of PostgreSQL: achieve better performance by making non-critical transaction asynchronous while leaving critical transactions synchronous. Thus, contention for WALWriteLock between backends and WALWriter could be real.I think it is difficult to say that can lead to contention due to periodicnature of WALWriter, but I don't deny that there is chance forbackground processes to have contention.
We don't know if there could be contention in advance. This is why we need monitoring.
I am not denying from the fact that there could be some contention in rarescenarios for background processes, but I think tracking them is not asimportant as tracking the LWLocks for backends.I would be more careful in calling some of scenarios rare. As DBMS developers we should do our best to evade contention for LWLocks: any contention, not only between backends and background processes. One may assume that high LWLock contention is rare scenario in general. Once we're here we doesn't think so, though.You claims that there couldn't be contention for WALWriteLock between backends and WALWriter. This is unclear for me: I think it could be.I think there would be more things where background processes could waitthan LWLocks and I think they are important to track, but could be done separatelyfrom tracking them for pg_stat_activity. Example, we have a pg_stat_bgwriterview, can't we think of tracking bgwriter/checkpointer wait information in thatview and similarly for other background processes we can track in other viewsif any related view exists or create a new one to track for all background processes.Nobody opposes tracking wait events for backends and tracking them for background processes. I think we need to track both in order to provide full picture to DBA.Sure, that is good to do, but can't we do it separately in another patch.I think in this patch lets just work for wait_events for backends.
Yes, but I think we should have a design of tracking wait event for every process before implementing this only for backends.
Also as we are planning to track the wait_event information in pg_stat_activityalong with other backends information, it will not make sense to includeinformation about backend processes in this variable as pg_stat_activityjust displays information of backend processes.I'm not objecting that we should track only backends information in pg_stat_activity. I think we should have also some other way of tracking wait events for background processes. We should think it out before extending pg_stat_activity to evade design issues later.I think we can discuss if you see any specific problems or you want specificthings to be clarified, but sorting out the complete design of waits monitoringbefore this patch can extend the scope of this patch beyond need.
I think we need to sort out at least some part of this design: where to store current event information for every process, not only backend. Other way, we can't be sure we're moving towards waits monitoring not backwards.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > In order to build the consensus we need the roadmap for waits monitoring. > Would single byte in PgBackendStatus be the only way for tracking wait > events? Could we have pluggable infrastructure in waits monitoring: for > instance, hooks for wait event begin and end? No, it's not the only way of doing it. I proposed doing that way because it's simple and cheap, but I'm not hell-bent on it. My basic concern here is about the cost of this. I think that the most data we can report without some kind of synchronization protocol is one 4-byte integer. If we want to report anything more than that, we're going to need something like the st_changecount protocol, or a lock, and that's going to add very significantly - and in my view unacceptably - to the cost. I care very much about having this facility be something that we can use in lots of places, even extremely frequent operations like buffer reads and contended lwlock acquisition. I think that there may be some *kinds of waits* for which it's practical to report additional detail. For example, suppose that when a heavyweight lock wait first happens, we just report the lock type (relation, tuple, etc.) but then when the deadlock detector expires, if we're still waiting, we report the entire lock tag. Well, that's going to happen infrequently enough, and is expensive enough anyway, that the cost doesn't matter. But if, every time we read a disk block, we take a lock (or bump a changecount and do a write barrier), dump the whole block tag in there, release the lock (or do another write barrier and bump the changecount again) that sounds kind of expensive to me. Maybe we can prove that it doesn't matter on any workload, but I doubt it. We're fighting for every cycle in some of these code paths, and there's good evidence that we're burning too many of them compared to competing products already. I am not a big fan of hooks as a way of resolving disagreements about the design. We may find that there are places where it's useful to have hooks so that different extensions can do different things, and that is fine. But we shouldn't use that as a way of punting the difficult questions. There isn't enough common understanding here of what we're all trying to get done and why we're trying to do it in particular ways rather than in other ways to jump to the conclusion that a hook is the right answer. I'd prefer to have a nice, built-in system that everyone agrees represents a good set of trade-offs than an extensible system. I think it's reasonable to consider reporting this data in the PGPROC using a 4-byte integer rather than reporting it through a singe byte in the backend status structure. I believe that addresses the concerns about reporting from auxiliary processes, and it also allows a little more data to be reported. For anything in excess of that, I think we should think rather harder. Most likely, such addition detail should be reported only for certain types of wait events, or on a delay, or something like that, so that the core mechanism remains really, really fast. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Mon, Sep 14, 2015 at 3:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Sep 14, 2015 at 5:32 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> In order to build the consensus we need the roadmap for waits monitoring.
> Would single byte in PgBackendStatus be the only way for tracking wait
> events? Could we have pluggable infrastructure in waits monitoring: for
> instance, hooks for wait event begin and end?
No, it's not the only way of doing it. I proposed doing that way
because it's simple and cheap, but I'm not hell-bent on it. My basic
concern here is about the cost of this. I think that the most data we
can report without some kind of synchronization protocol is one 4-byte
integer. If we want to report anything more than that, we're going to
need something like the st_changecount protocol, or a lock, and that's
going to add very significantly - and in my view unacceptably - to the
cost. I care very much about having this facility be something that
we can use in lots of places, even extremely frequent operations like
buffer reads and contended lwlock acquisition.
Yes, the major question is cost. But I think we should validate our thoughts by experiments assuming there are more possible synchronization protocols. Ildus posted implemention of double buffering approach that showed quite low cost.
I think that there may be some *kinds of waits* for which it's
practical to report additional detail. For example, suppose that when
a heavyweight lock wait first happens, we just report the lock type
(relation, tuple, etc.) but then when the deadlock detector expires,
if we're still waiting, we report the entire lock tag. Well, that's
going to happen infrequently enough, and is expensive enough anyway,
that the cost doesn't matter. But if, every time we read a disk
block, we take a lock (or bump a changecount and do a write barrier),
dump the whole block tag in there, release the lock (or do another
write barrier and bump the changecount again) that sounds kind of
expensive to me. Maybe we can prove that it doesn't matter on any
workload, but I doubt it. We're fighting for every cycle in some of
these code paths, and there's good evidence that we're burning too
many of them compared to competing products already.
Yes, but some competing products also provides comprehensive waits monitoring too. That makes me think it should be possible for us too.
I am not a big fan of hooks as a way of resolving disagreements about
the design. We may find that there are places where it's useful to
have hooks so that different extensions can do different things, and
that is fine. But we shouldn't use that as a way of punting the
difficult questions. There isn't enough common understanding here of
what we're all trying to get done and why we're trying to do it in
particular ways rather than in other ways to jump to the conclusion
that a hook is the right answer. I'd prefer to have a nice, built-in
system that everyone agrees represents a good set of trade-offs than
an extensible system.
I think the reason for hooks could be not only disagreements about design, but platform dependent issues too.
Next step after we have view with current wait events will be gathering some statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short period locally and update shared memory structure periodically (using some synchronization protocol).
In the previous attempt to gather lwlocks statistics, you predict that sampling could have a significant overhead [1]. In contrast, on many systems time measurements are cheap. We have implemented both approaches and it shows that sampling every 1 milliseconds produce higher overhead than individual duration measurements for each wait event. We can share another version of waits monitoring based on sampling to make these results reproducible for everybody. However, cheap time measurements are available not for each platform. For instance, ISTM that on Windows time measurements are too expensive [2].
That makes me think that we need pluggable solution, at least for statistics: direct measuring of events durations for majority of systems and sampling for others as the least harm.
I think it's reasonable to consider reporting this data in the PGPROC
using a 4-byte integer rather than reporting it through a singe byte
in the backend status structure. I believe that addresses the
concerns about reporting from auxiliary processes, and it also allows
a little more data to be reported. For anything in excess of that, I
think we should think rather harder. Most likely, such addition
detail should be reported only for certain types of wait events, or on
a delay, or something like that, so that the core mechanism remains
really, really fast.
That sounds reasonable. There are many pending questions, but it seems like step forward to me.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > Yes, the major question is cost. But I think we should validate our thoughts > by experiments assuming there are more possible synchronization protocols. > Ildus posted implemention of double buffering approach that showed quite low > cost. I'm not sure exactly which email you are referring to, but I don't believe that anyone has done experiments that are anywhere near comprehensive enough to convince ourselves that this won't be a problem. If a particular benchmark doesn't show an issue, that can just mean that the benchmark isn't hitting the case where there is a problem. For example, EDB has had customers who have severe contention apparently on the buffer content lwlocks, resulting in big slowdowns. You don't see that in, say, a pgbench run. But for people who have certain kinds of queries, it's really bad. Those sort of loads, where the lwlock system really gets stressed, are cases where adding overhead seems likely to pinch. > Yes, but some competing products also provides comprehensive waits > monitoring too. That makes me think it should be possible for us too. I agree, but keep in mind that some of those products may use techniques to reduce the overhead that we don't have available. I have a strong suspicion that one of those products in particular has done something clever to make measuring the time cheap on all platforms. Whatever that clever thing is, we haven't done it. So that matters. > I think the reason for hooks could be not only disagreements about design, > but platform dependent issues too. > Next step after we have view with current wait events will be gathering some > statistics of them. We can oppose at least two approaches here: > 1) Periodical sampling of current wait events. > 2) Measure each wait event duration. We could collect statistics for short > period locally and update shared memory structure periodically (using some > synchronization protocol). > > In the previous attempt to gather lwlocks statistics, you predict that > sampling could have a significant overhead [1]. In contrast, on many systems > time measurements are cheap. We have implemented both approaches and it > shows that sampling every 1 milliseconds produce higher overhead than > individual duration measurements for each wait event. We can share another > version of waits monitoring based on sampling to make these results > reproducible for everybody. However, cheap time measurements are available > not for each platform. For instance, ISTM that on Windows time measurements > are too expensive [2]. > > That makes me think that we need pluggable solution, at least for > statistics: direct measuring of events durations for majority of systems and > sampling for others as the least harm. To me, those seem like arguments for making it configurable, but not necessarily for having hooks. >> I think it's reasonable to consider reporting this data in the PGPROC >> using a 4-byte integer rather than reporting it through a singe byte >> in the backend status structure. I believe that addresses the >> concerns about reporting from auxiliary processes, and it also allows >> a little more data to be reported. For anything in excess of that, I >> think we should think rather harder. Most likely, such addition >> detail should be reported only for certain types of wait events, or on >> a delay, or something like that, so that the core mechanism remains >> really, really fast. > > That sounds reasonable. There are many pending questions, but it seems like > step forward to me. Great, let's do it. I think we should probably do the work to separate the non-individual lwlocks into tranches first, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Vladimir Borodin
Date:
16 сент. 2015 г., в 20:52, Robert Haas <robertmhaas@gmail.com> написал(а):On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:Yes, the major question is cost. But I think we should validate our thoughts
by experiments assuming there are more possible synchronization protocols.
Ildus posted implemention of double buffering approach that showed quite low
cost.
I'm not sure exactly which email you are referring to, but I don't
believe that anyone has done experiments that are anywhere near
comprehensive enough to convince ourselves that this won't be a
problem. If a particular benchmark doesn't show an issue, that can
just mean that the benchmark isn't hitting the case where there is a
problem. For example, EDB has had customers who have severe
contention apparently on the buffer content lwlocks, resulting in big
slowdowns. You don't see that in, say, a pgbench run. But for people
who have certain kinds of queries, it's really bad. Those sort of
loads, where the lwlock system really gets stressed, are cases where
adding overhead seems likely to pinch.
Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling every N milliseconds and one with measuring timings. We have tested both of them on two kinds of our workload besides pgbench runs. One workload is OLTP when all the data fits in shared buffers, synchronous_commit if off and the bottleneck is ProcArrayLock (all backtraces look something like the following):
[Thread debugging using libthread_db enabled] 0x00007f10e01d4f27 in semop () from /lib64/libc.so.6 #0 0x00007f10e01d4f27 in semop () from /lib64/libc.so.6 #1 0x000000000061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 '\000') at pg_sema.c:421 #2 0x00000000006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:626 #3 LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467 #4 0x0000000000667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, latestXid=182562881) at procarray.c:404 #5 0x00000000004b579b in CommitTransaction () at xact.c:1957 #6 0x00000000004b6ae5 in CommitTransactionCommand () at xact.c:2727 #7 0x00000000006819d9 in finish_xact_command () at postgres.c:2437 #8 0x0000000000684f05 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x21e1a70 "xivadb", username=<value optimized out>) at postgres.c:4270 #9 0x0000000000632d7d in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4155 #10 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3829 #11 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1597 #12 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1244 #13 0x00000000005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228
Another is when all the data fits in RAM but does not fit in shared buffers and the bottleneck is mostly BufFreelistLock with sensible contention around buffer partitions locking and buffers locking. Taking several backtraces with GDB of several backends and doing some bash magic gives the following:
root@pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | uniq -c | sort -rn 126 0x000000000065db61 BufferAlloc bufmgr.c:591 67 0x000000000065e03a BufferAlloc bufmgr.c:760 43 0x00000000005c8c3b pq_getbyte pqcomm.c:899 39 0x000000000065dd93 BufferAlloc bufmgr.c:765 6 0x00000000004b52bb RecordTransactionCommit xact.c:1194 4 0x000000000065da0e ReadBuffer_common bufmgr.c:476 1 ReadBuffer_common relpersistence=112 bufmgr.c:340 1 exec_eval_expr expr=0x166e908, pl_exec.c:4796 1 0x00007f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so 1 0x00000000005d4cbb _copyList copyfuncs.c:3849 root@pgload01g ~ #
For both scenarios on linux we got approximately the same results - version with timings was faster then version with sampling (sampling was done every 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with timings gave ~14500 tps while version with sampling gave ~13800 tps. In all cases processor was 100% utilized. Comparing vanilla PostgreSQL and version with timings on constant workload (12000 tps) gave the following results in latencies for queries:
q'th vanilla timing 99% 79.0 97.0 (+18.0) 98% 64.0 76.0 (+12.0) 95% 38.0 47.0 (+9.0) 90% 16.0 21.0 (+5.0) 85% 7.0 11.0 (+4.0) 80% 5.0 7.0 (+2.0) 75% 4.0 5.0 (+1.0) 50% 2.0 3.0 (+1.0)
And it that test version with timings consumed about 7% more of CPU. Does it seem to be the results on workload where lwlock system is stressed?
And when the data does not fit in RAM you really don’t see much difference between all three version because your contention is moved from lwlock system to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and other waits events.
Yes, but some competing products also provides comprehensive waits
monitoring too. That makes me think it should be possible for us too.
I agree, but keep in mind that some of those products may use
techniques to reduce the overhead that we don't have available. I
have a strong suspicion that one of those products in particular has
done something clever to make measuring the time cheap on all
platforms. Whatever that clever thing is, we haven't done it. So
that matters.
I don’t know for all the products but Oracle didn’t make something clever, they exactly use gettimeofday on linux - strace proofs that.
I think the reason for hooks could be not only disagreements about design,
but platform dependent issues too.
Next step after we have view with current wait events will be gathering some
statistics of them. We can oppose at least two approaches here:
1) Periodical sampling of current wait events.
2) Measure each wait event duration. We could collect statistics for short
period locally and update shared memory structure periodically (using some
synchronization protocol).
In the previous attempt to gather lwlocks statistics, you predict that
sampling could have a significant overhead [1]. In contrast, on many systems
time measurements are cheap. We have implemented both approaches and it
shows that sampling every 1 milliseconds produce higher overhead than
individual duration measurements for each wait event. We can share another
version of waits monitoring based on sampling to make these results
reproducible for everybody. However, cheap time measurements are available
not for each platform. For instance, ISTM that on Windows time measurements
are too expensive [2].
That makes me think that we need pluggable solution, at least for
statistics: direct measuring of events durations for majority of systems and
sampling for others as the least harm.
To me, those seem like arguments for making it configurable, but not
necessarily for having hooks.
We were discussing it with Bruce at pgday.ru and AFAIK Bruce said that it would be enough to have the ability to turn measuring timings off with a GUC not to have significant overhead on some platforms. Isn’t it a good approach?
BTW, we have also tested with pgbench and on two mentioned earlier workloads version with waits monitoring turned off, we haven’t seen any overhead in all tests.
I think it's reasonable to consider reporting this data in the PGPROC
using a 4-byte integer rather than reporting it through a singe byte
in the backend status structure. I believe that addresses the
concerns about reporting from auxiliary processes, and it also allows
a little more data to be reported. For anything in excess of that, I
think we should think rather harder. Most likely, such addition
detail should be reported only for certain types of wait events, or on
a delay, or something like that, so that the core mechanism remains
really, really fast.
That sounds reasonable. There are many pending questions, but it seems like
step forward to me.
Great, let's do it. I think we should probably do the work to
separate the non-individual lwlocks into tranches first, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin <root@simply.name> wrote: > For both scenarios on linux we got approximately the same results - version > with timings was faster then version with sampling (sampling was done every > 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version > with timings gave ~14500 tps while version with sampling gave ~13800 tps. In > all cases processor was 100% utilized. Comparing vanilla PostgreSQL and > version with timings on constant workload (12000 tps) gave the following > results in latencies for queries: If the timing is speeding things up, that's most likely a sign that the spinlock contention on that workload is so severe that you are spending a lot of time in s_lock. Adding more things for the system to do that don't require that lock will speed the system up by reducing the contention. Instead of inserting gettimeofday() calls, you could insert a for loop that counts to some large number without doing any useful work, and that would likely have a similar effect. In any case, I think your experiment clearly proves that the presence or absence of this instrumentation *is* performance-relevant and that we *do* need to worry about what it costs. If the system gets 20% faster when you call gettimeofday() a lot, does that mean we should insert gettimeofday() calls all over the system in random places to speed it up? I do agree that if we're going to include support for timings, having them be controlled by a GUC is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Vladimir Borodin
Date:
18 сент. 2015 г., в 20:16, Robert Haas <robertmhaas@gmail.com> написал(а):On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin <root@simply.name> wrote:For both scenarios on linux we got approximately the same results - version
with timings was faster then version with sampling (sampling was done every
10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version
with timings gave ~14500 tps while version with sampling gave ~13800 tps. In
all cases processor was 100% utilized. Comparing vanilla PostgreSQL and
version with timings on constant workload (12000 tps) gave the following
results in latencies for queries:
If the timing is speeding things up, that's most likely a sign that
the spinlock contention on that workload is so severe that you are
spending a lot of time in s_lock. Adding more things for the system
to do that don't require that lock will speed the system up by
reducing the contention. Instead of inserting gettimeofday() calls,
you could insert a for loop that counts to some large number without
doing any useful work, and that would likely have a similar effect.
In any case, I think your experiment clearly proves that the presence
or absence of this instrumentation *is* performance-relevant and that
we *do* need to worry about what it costs. If the system gets 20%
faster when you call gettimeofday() a lot, does that mean we should
insert gettimeofday() calls all over the system in random places to
speed it up?
No, probably you misunderstood the results, let me explain one more time. Unpatched PostgreSQL from REL9_4_STABLE gave 15500 tps. Version with timings - 14500 tps which is 6,5% worse. Version with sampling wait events every 10 ms gave 13800 tps (11% worse than unpatched and 5% worse than with timings).
We also made a test with a stable workload of 12000 tps for unpatched version and version with timings. In thas test we saw that response times are a bit worse in version with timings as shown in the table below. You should read this table as follows: 99% of all queries in unpatched version fits in 79 ms while in version with timings 99% of all queries fits in 97 ms which is 18 ms slower, and so on. That test also showed that version with timings consumes extra 7% of CPU to handle the same workload as unpatched version.
So this is the cost of waits monitoring with timings on lwlock stress workload - 6,5% less throughput, a bit worse timings and extra 7% of CPU. If you will insert gettimeofday() calls all over the system in random places, you expectedly will not speed up, you will be getting slower.
q'th vanilla timing 99% 79.0 97.0 (+18.0) 98% 64.0 76.0 (+12.0) 95% 38.0 47.0 (+9.0) 90% 16.0 21.0 (+5.0) 85% 7.0 11.0 (+4.0) 80% 5.0 7.0 (+2.0) 75% 4.0 5.0 (+1.0) 50% 2.0 3.0 (+1.0)
I do agree that if we're going to include support for timings, having
them be controlled by a GUC is a good idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 18, 2015 at 4:43 PM, Vladimir Borodin <root@simply.name> wrote: > No, probably you misunderstood the results, let me explain one more time. Yeah, I did, sorry. > Unpatched PostgreSQL from REL9_4_STABLE gave 15500 tps. Version with timings > - 14500 tps which is 6,5% worse. Version with sampling wait events every 10 > ms gave 13800 tps (11% worse than unpatched and 5% worse than with timings). OK. So, I don't really care about the timing stuff right now. I want to get the stuff without timings done first. To do that, we need to come to an agreement on how much information we're going to try to expose and from where, and we need to make sure that doing that doesn't cause a performance hit. Then, if we want to add timing as an optional feature for people who can tolerate the overhead, that's fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>
> >> I think it's reasonable to consider reporting this data in the PGPROC
> >> using a 4-byte integer rather than reporting it through a singe byte
> >> in the backend status structure. I believe that addresses the
> >> concerns about reporting from auxiliary processes, and it also allows
> >> a little more data to be reported. For anything in excess of that, I
> >> think we should think rather harder. Most likely, such addition
> >> detail should be reported only for certain types of wait events, or on
> >> a delay, or something like that, so that the core mechanism remains
> >> really, really fast.
> >
> > That sounds reasonable. There are many pending questions, but it seems like
> > step forward to me.
>
> Great, let's do it. I think we should probably do the work to
> separate the non-individual lwlocks into tranches first, though.
>
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>
> >> I think it's reasonable to consider reporting this data in the PGPROC
> >> using a 4-byte integer rather than reporting it through a singe byte
> >> in the backend status structure. I believe that addresses the
> >> concerns about reporting from auxiliary processes, and it also allows
> >> a little more data to be reported. For anything in excess of that, I
> >> think we should think rather harder. Most likely, such addition
> >> detail should be reported only for certain types of wait events, or on
> >> a delay, or something like that, so that the core mechanism remains
> >> really, really fast.
> >
> > That sounds reasonable. There are many pending questions, but it seems like
> > step forward to me.
>
> Great, let's do it. I think we should probably do the work to
> separate the non-individual lwlocks into tranches first, though.
>
One thing that occurred to me in this context is that if we store the wait
event information in PGPROC, then can we think of providing the info
about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
any other better name) where we can display wait information about
all-processes rather than only backends? This will avoid the confusion
about breaking the backward compatibility for the current 'waiting' column
in pg_stat_activity.
pg_stat_waits can have columns:
pid - Process Id
wait_class_name - Name of the wait class
wait class_event - name of the wait event
We can extend it later with the information about timing for wait event.
Also, if we follow this approach, I think we don't need to store this
information in PgBackendStatus.
On 14 November 2015 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
-- One thing that occurred to me in this context is that if we store the waitevent information in PGPROC, then can we think of providing the infoabout wait events in a separate view pg_stat_waits (or pg_stat_wait_info orany other better name) where we can display wait information aboutall-processes rather than only backends?
Sounds good to me. Consider a logical decoding plugin in a walsender for example.
This will avoid the confusionabout breaking the backward compatibility for the current 'waiting' columnin pg_stat_activity.
I'm about -10^7 on changing the 'waiting' column. I am still seeing confused users from the 'procpid' to 'pid' renaming. If more info is needed, add a column with more detail or, as you suggest here, use a new view.
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Vladimir Borodin
Date:
14 нояб. 2015 г., в 10:50, Amit Kapila <amit.kapila16@gmail.com> написал(а):On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>
> >> I think it's reasonable to consider reporting this data in the PGPROC
> >> using a 4-byte integer rather than reporting it through a singe byte
> >> in the backend status structure. I believe that addresses the
> >> concerns about reporting from auxiliary processes, and it also allows
> >> a little more data to be reported. For anything in excess of that, I
> >> think we should think rather harder. Most likely, such addition
> >> detail should be reported only for certain types of wait events, or on
> >> a delay, or something like that, so that the core mechanism remains
> >> really, really fast.
> >
> > That sounds reasonable. There are many pending questions, but it seems like
> > step forward to me.
>
> Great, let's do it. I think we should probably do the work to
> separate the non-individual lwlocks into tranches first, though.
>One thing that occurred to me in this context is that if we store the waitevent information in PGPROC, then can we think of providing the infoabout wait events in a separate view pg_stat_waits (or pg_stat_wait_info orany other better name) where we can display wait information aboutall-processes rather than only backends? This will avoid the confusionabout breaking the backward compatibility for the current 'waiting' columnin pg_stat_activity.pg_stat_waits can have columns:pid - Process Idwait_class_name - Name of the wait classwait class_event - name of the wait eventWe can extend it later with the information about timing for wait event.Also, if we follow this approach, I think we don't need to store thisinformation in PgBackendStatus.
Sounds like exactly the same that was proposed by Ildus in this thead [0]. Great to be thinking in the same direction. And on the rights of advertisements I’ve somehow described using all those views here [1].
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Michael Paquier
Date:
On Tue, Nov 17, 2015 at 8:36 PM, Vladimir Borodin <root@simply.name> wrote: > > 14 нояб. 2015 г., в 10:50, Amit Kapila <amit.kapila16@gmail.com> написал(а): > > On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov >> <aekorotkov@gmail.com> wrote: >> >> >> I think it's reasonable to consider reporting this data in the PGPROC >> >> using a 4-byte integer rather than reporting it through a singe byte >> >> in the backend status structure. I believe that addresses the >> >> concerns about reporting from auxiliary processes, and it also allows >> >> a little more data to be reported. For anything in excess of that, I >> >> think we should think rather harder. Most likely, such addition >> >> detail should be reported only for certain types of wait events, or on >> >> a delay, or something like that, so that the core mechanism remains >> >> really, really fast. >> > >> > That sounds reasonable. There are many pending questions, but it seems >> > like >> > step forward to me. >> >> Great, let's do it. I think we should probably do the work to >> separate the non-individual lwlocks into tranches first, though. >> > > One thing that occurred to me in this context is that if we store the wait > event information in PGPROC, then can we think of providing the info > about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or > any other better name) where we can display wait information about > all-processes rather than only backends? This will avoid the confusion > about breaking the backward compatibility for the current 'waiting' column > in pg_stat_activity. > > pg_stat_waits can have columns: > pid - Process Id > wait_class_name - Name of the wait class > wait class_event - name of the wait event > > We can extend it later with the information about timing for wait event. > > Also, if we follow this approach, I think we don't need to store this > information in PgBackendStatus. > > > Sounds like exactly the same that was proposed by Ildus in this thead [0]. > Great to be thinking in the same direction. And on the rights of > advertisements I’ve somehow described using all those views here [1]. > > [0] http://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru > [1] https://simply.name/pg-stat-wait.html This thread has stalled a bit and is waiting for new patches for some time now, hence I have switched it as "returned with feedback" on the CF app. -- Michael
On Thu, Dec 24, 2015 at 8:02 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Nov 17, 2015 at 8:36 PM, Vladimir Borodin <root@simply.name> wrote:
> >
> > 14 нояб. 2015 г., в 10:50, Amit Kapila <amit.kapila16@gmail.com> написал(а):
> >
> > On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> >> <aekorotkov@gmail.com> wrote:
> >
> > One thing that occurred to me in this context is that if we store the wait
> > event information in PGPROC, then can we think of providing the info
> > about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
> > any other better name) where we can display wait information about
> > all-processes rather than only backends? This will avoid the confusion
> > about breaking the backward compatibility for the current 'waiting' column
> > in pg_stat_activity.
> >
> > pg_stat_waits can have columns:
> > pid - Process Id
> > wait_class_name - Name of the wait class
> > wait class_event - name of the wait event
> >
> > We can extend it later with the information about timing for wait event.
> >
> > Also, if we follow this approach, I think we don't need to store this
> > information in PgBackendStatus.
> >
> >
> > Sounds like exactly the same that was proposed by Ildus in this thead [0].
> > Great to be thinking in the same direction. And on the rights of
> > advertisements I’ve somehow described using all those views here [1].
> >
> > [0] http://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru
> > [1] https://simply.name/pg-stat-wait.html
>
> This thread has stalled a bit and is waiting for new patches for some
> time now, hence I have switched it as "returned with feedback" on the
> CF app.
>
>
> On Tue, Nov 17, 2015 at 8:36 PM, Vladimir Borodin <root@simply.name> wrote:
> >
> > 14 нояб. 2015 г., в 10:50, Amit Kapila <amit.kapila16@gmail.com> написал(а):
> >
> > On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> >> <aekorotkov@gmail.com> wrote:
> >
> > One thing that occurred to me in this context is that if we store the wait
> > event information in PGPROC, then can we think of providing the info
> > about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
> > any other better name) where we can display wait information about
> > all-processes rather than only backends? This will avoid the confusion
> > about breaking the backward compatibility for the current 'waiting' column
> > in pg_stat_activity.
> >
> > pg_stat_waits can have columns:
> > pid - Process Id
> > wait_class_name - Name of the wait class
> > wait class_event - name of the wait event
> >
> > We can extend it later with the information about timing for wait event.
> >
> > Also, if we follow this approach, I think we don't need to store this
> > information in PgBackendStatus.
> >
> >
> > Sounds like exactly the same that was proposed by Ildus in this thead [0].
> > Great to be thinking in the same direction. And on the rights of
> > advertisements I’ve somehow described using all those views here [1].
> >
> > [0] http://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru
> > [1] https://simply.name/pg-stat-wait.html
>
> This thread has stalled a bit and is waiting for new patches for some
> time now, hence I have switched it as "returned with feedback" on the
> CF app.
>
The reason for not updating the patch related to this thread is that it is
dependent on the work for refactoring the tranches for LWLocks [1]
which is now coming towards an end, so I think it is quite reasonable
that the patch can be updated for this work during commit fest, so
I am moving it to upcoming CF.
Amit Kapila wrote: > The reason for not updating the patch related to this thread is that it is > dependent on the work for refactoring the tranches for LWLocks [1] > which is now coming towards an end, so I think it is quite reasonable > that the patch can be updated for this work during commit fest, so > I am moving it to upcoming CF. Thanks. I think the tranche reworks are mostly done now, so is anyone submitting an updated version of this patch? Also, it would be very good if someone can provide insight on how this patch interacts with the other submitted patch for "waiting for replication" https://commitfest.postgresql.org/8/436/ Andres seems to think that the other patch is completely independent of this one, i.e. the "waiting for replication" column needs to exist separately and not as part of the "more descriptive" new 'waiting' column. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jan 18, 2016 at 11:09 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Kapila wrote: > >> The reason for not updating the patch related to this thread is that it is >> dependent on the work for refactoring the tranches for LWLocks [1] >> which is now coming towards an end, so I think it is quite reasonable >> that the patch can be updated for this work during commit fest, so >> I am moving it to upcoming CF. > > Thanks. I think the tranche reworks are mostly done now, so is anyone > submitting an updated version of this patch? > > Also, it would be very good if someone can provide insight on how this > patch interacts with the other submitted patch for "waiting for > replication" https://commitfest.postgresql.org/8/436/ > Andres seems to think that the other patch is completely independent of > this one, i.e. the "waiting for replication" column needs to exist > separately and not as part of the "more descriptive" new 'waiting' > column. Yeah, I really don't agree with that. I think that it's much better to have one column that says what you are waiting for than a bunch of separate columns that tell you whether you are waiting for individual things for which you might be waiting. I think this patch, which introduces the general mechanism, should win: and the other patch should then be one client of that mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 18, 2016 at 11:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 18, 2016 at 11:09 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Amit Kapila wrote:
> >
> >> The reason for not updating the patch related to this thread is that it is
> >> dependent on the work for refactoring the tranches for LWLocks [1]
> >> which is now coming towards an end, so I think it is quite reasonable
> >> that the patch can be updated for this work during commit fest, so
> >> I am moving it to upcoming CF.
> >
> > Thanks. I think the tranche reworks are mostly done now, so is anyone
> > submitting an updated version of this patch?
> >
Before updating the patch, it is better to clarify few points as mentioned
>
> On Mon, Jan 18, 2016 at 11:09 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Amit Kapila wrote:
> >
> >> The reason for not updating the patch related to this thread is that it is
> >> dependent on the work for refactoring the tranches for LWLocks [1]
> >> which is now coming towards an end, so I think it is quite reasonable
> >> that the patch can be updated for this work during commit fest, so
> >> I am moving it to upcoming CF.
> >
> > Thanks. I think the tranche reworks are mostly done now, so is anyone
> > submitting an updated version of this patch?
> >
Before updating the patch, it is better to clarify few points as mentioned
below.
>
> > Also, it would be very good if someone can provide insight on how this
> > patch interacts with the other submitted patch for "waiting for
> > replication" https://commitfest.postgresql.org/8/436/
> > Andres seems to think that the other patch is completely independent of
> > this one, i.e. the "waiting for replication" column needs to exist
> > separately and not as part of the "more descriptive" new 'waiting'
> > column.
>
> Yeah, I really don't agree with that. I think that it's much better
> to have one column that says what you are waiting for than a bunch of
> separate columns that tell you whether you are waiting for individual
> things for which you might be waiting. I think this patch, which
> introduces the general mechanism, should win: and the other patch
> should then be one client of that mechanism.
>
I agree with what you have said, but I think here bigger question is about
>
> > Also, it would be very good if someone can provide insight on how this
> > patch interacts with the other submitted patch for "waiting for
> > replication" https://commitfest.postgresql.org/8/436/
> > Andres seems to think that the other patch is completely independent of
> > this one, i.e. the "waiting for replication" column needs to exist
> > separately and not as part of the "more descriptive" new 'waiting'
> > column.
>
> Yeah, I really don't agree with that. I think that it's much better
> to have one column that says what you are waiting for than a bunch of
> separate columns that tell you whether you are waiting for individual
> things for which you might be waiting. I think this patch, which
> introduces the general mechanism, should win: and the other patch
> should then be one client of that mechanism.
>
I agree with what you have said, but I think here bigger question is about
the UI and which is the more appropriate place to store wait information. I
will try to summarize the options discussed.
Initially, we started with extending the 'waiting' column in pg_stat_activity,
to which some people have raised concerns about backward
compatability, so another option that came-up during discussion was to
retain waiting as it-is and have an additional column 'wait_event' in
pg_stat_activity, after that there is feedback that we should try to include
wait information about background processes as well which raises a bigger
question whether it is any good to expose this information via pg_stat_activity
(pg_stat_activity doesn't display information about background processes)
or is it better to have a new view as discussed here [1].
Second important and somewhat related point is whether we should save
this information in PGPROC as 4 bytes or keep it in pgBackendStatus.
I think it is better to store in PGPROC, if we want to save wait information
for backend processes as well.
I am of opinion that we should save this information in PGPROC and
expose it via new view, but I am open to go other ways based on what
others think about this matter.
On Mon, Jan 18, 2016 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Initially, we started with extending the 'waiting' column in > pg_stat_activity, > to which some people have raised concerns about backward > compatability, so another option that came-up during discussion was to > retain waiting as it-is and have an additional column 'wait_event' in > pg_stat_activity, after that there is feedback that we should try to include > wait information about background processes as well which raises a bigger > question whether it is any good to expose this information via > pg_stat_activity > (pg_stat_activity doesn't display information about background processes) > or is it better to have a new view as discussed here [1]. > > Second important and somewhat related point is whether we should save > this information in PGPROC as 4 bytes or keep it in pgBackendStatus. > I think it is better to store in PGPROC, if we want to save wait information > for backend processes as well. > > I am of opinion that we should save this information in PGPROC and > expose it via new view, but I am open to go other ways based on what > others think about this matter. My opinion is that storing the information in PGPROC is better because it seems like we can fairly painlessly expose 4 bytes of data that way instead of 1, which is nice. On the topic of the UI, I understand that redefining pg_stat_activity.waiting might cause some short-term annoyance. But I think in the long term what we are proposing here is going to be a huge improvement, so I think it's worth the compatibility break. If we say that pg_stat_activity.waiting has to continue meaning "waiting for a heavyweight lock" even though we now also expose (in some other location) information on other kinds of waits, that's going to be confusing to users. It's better to force people to update their queries once than to have this confusing wart in the system forever. I predict that if we make backward compatibility the priority here, we'll still be explaining it to smart but confused people when 9.6 goes EOL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2016 at 12:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 18, 2016 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Second important and somewhat related point is whether we should save
> > this information in PGPROC as 4 bytes or keep it in pgBackendStatus.
> > I think it is better to store in PGPROC, if we want to save wait information
> > for backend processes as well.
> >
> > I am of opinion that we should save this information in PGPROC and
> > expose it via new view, but I am open to go other ways based on what
> > others think about this matter.
>
> My opinion is that storing the information in PGPROC is better because
> it seems like we can fairly painlessly expose 4 bytes of data that way
> instead of 1, which is nice.
>
> On the topic of the UI, I understand that redefining
> pg_stat_activity.waiting might cause some short-term annoyance. But I
> think in the long term what we are proposing here is going to be a
> huge improvement, so I think it's worth the compatibility break. If
> we say that pg_stat_activity.waiting has to continue meaning "waiting
> for a heavyweight lock" even though we now also expose (in some other
> location) information on other kinds of waits, that's going to be
> confusing to users.
>
> On Mon, Jan 18, 2016 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Second important and somewhat related point is whether we should save
> > this information in PGPROC as 4 bytes or keep it in pgBackendStatus.
> > I think it is better to store in PGPROC, if we want to save wait information
> > for backend processes as well.
> >
> > I am of opinion that we should save this information in PGPROC and
> > expose it via new view, but I am open to go other ways based on what
> > others think about this matter.
>
> My opinion is that storing the information in PGPROC is better because
> it seems like we can fairly painlessly expose 4 bytes of data that way
> instead of 1, which is nice.
>
Okay, do you mean to say that we can place this new 4-byte variable
in PGPROC at 4-byte aligned boundary, so both read and writes will be
atomic?
> On the topic of the UI, I understand that redefining
> pg_stat_activity.waiting might cause some short-term annoyance. But I
> think in the long term what we are proposing here is going to be a
> huge improvement, so I think it's worth the compatibility break. If
> we say that pg_stat_activity.waiting has to continue meaning "waiting
> for a heavyweight lock" even though we now also expose (in some other
> location) information on other kinds of waits, that's going to be
> confusing to users.
>
If we want to go via this route, then the first thing which we need to
decide is whether we want to start displaying the information of
background processes like WALWriter and others in pg_stat_activity?
Second thing that needs some thoughts is that functions like
pg_stat_get_activity() needs to rely both on PgBackendStatus and
PGProc and we might also need to do some special handling for
background processes if want the information for those processes
in this view.
> It's better to force people to update their
> queries once than to have this confusing wart in the system forever.
> I predict that if we make backward compatibility the priority here,
> we'll still be explaining it to smart but confused people when 9.6
> goes EOL.
>
> queries once than to have this confusing wart in the system forever.
> I predict that if we make backward compatibility the priority here,
> we'll still be explaining it to smart but confused people when 9.6
> goes EOL.
>
Valid point, OTOH we can update the docs to say that
pg_stat_activity.waiting parameter is deprecated and after a
release or two we can get rid of this parameter.
On Tue, Jan 19, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> My opinion is that storing the information in PGPROC is better because >> it seems like we can fairly painlessly expose 4 bytes of data that way >> instead of 1, which is nice. > > Okay, do you mean to say that we can place this new 4-byte variable > in PGPROC at 4-byte aligned boundary, so both read and writes will be > atomic? Yes. However, note that you don't need to do anything special to get it 4-byte aligned. The compiler will do that automatically. >> On the topic of the UI, I understand that redefining >> pg_stat_activity.waiting might cause some short-term annoyance. But I >> think in the long term what we are proposing here is going to be a >> huge improvement, so I think it's worth the compatibility break. If >> we say that pg_stat_activity.waiting has to continue meaning "waiting >> for a heavyweight lock" even though we now also expose (in some other >> location) information on other kinds of waits, that's going to be >> confusing to users. > > If we want to go via this route, then the first thing which we need to > decide is whether we want to start displaying the information of > background processes like WALWriter and others in pg_stat_activity? That doesn't seem like a particularly good fit - few of the fields are relevant to that case. We could provide some other way of getting at the information for background processes if people want, but personally I'd probably be inclined not to bother with it for right now. >> It's better to force people to update their >> queries once than to have this confusing wart in the system forever. >> I predict that if we make backward compatibility the priority here, >> we'll still be explaining it to smart but confused people when 9.6 >> goes EOL. > > Valid point, OTOH we can update the docs to say that > pg_stat_activity.waiting parameter is deprecated and after a > release or two we can get rid of this parameter. My impression is that doesn't really ease the pain much. Half the time we never actually remove the deprecated column, or much later than predicted, and people don't read the docs and keep relying on it anyway. So in the end it just makes the process more complicated without really helping. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2016 at 6:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 19, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> On the topic of the UI, I understand that redefining
> >> pg_stat_activity.waiting might cause some short-term annoyance. But I
> >> think in the long term what we are proposing here is going to be a
> >> huge improvement, so I think it's worth the compatibility break. If
> >> we say that pg_stat_activity.waiting has to continue meaning "waiting
> >> for a heavyweight lock" even though we now also expose (in some other
> >> location) information on other kinds of waits, that's going to be
> >> confusing to users.
> >
> > If we want to go via this route, then the first thing which we need to
> > decide is whether we want to start displaying the information of
> > background processes like WALWriter and others in pg_stat_activity?
>
> That doesn't seem like a particularly good fit - few of the fields are
> relevant to that case. We could provide some other way of getting at
> the information for background processes if people want, but
> personally I'd probably be inclined not to bother with it for right
> now.
>
>
> On Tue, Jan 19, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> On the topic of the UI, I understand that redefining
> >> pg_stat_activity.waiting might cause some short-term annoyance. But I
> >> think in the long term what we are proposing here is going to be a
> >> huge improvement, so I think it's worth the compatibility break. If
> >> we say that pg_stat_activity.waiting has to continue meaning "waiting
> >> for a heavyweight lock" even though we now also expose (in some other
> >> location) information on other kinds of waits, that's going to be
> >> confusing to users.
> >
> > If we want to go via this route, then the first thing which we need to
> > decide is whether we want to start displaying the information of
> > background processes like WALWriter and others in pg_stat_activity?
>
> That doesn't seem like a particularly good fit - few of the fields are
> relevant to that case. We could provide some other way of getting at
> the information for background processes if people want, but
> personally I'd probably be inclined not to bother with it for right
> now.
>
I have updated the patch accordingly. pg_stat_get_activity.waiting is
changed to a text column wait_event and currently it will display the
heavy-weight and light-weight lock information for backends, certainly
it can be extended to report network wait or disk wait events, but I feel
that can be done as an add-on patch. For LWLocks, it returns LWLock
name for individual locks and tranche name for others.
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
"andres@anarazel.de"
Date:
On 2016-01-26 13:22:09 +0530, Amit Kapila wrote: > @@ -633,9 +633,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser > <entry>Time when the <structfield>state</> was last changed</entry> > </row> > <row> > - <entry><structfield>waiting</></entry> > - <entry><type>boolean</></entry> > - <entry>True if this backend is currently waiting on a lock</entry> > + <entry><structfield>wait_event</></entry> > + <entry><type>text</></entry> > + <entry>Wait event name if backend is currently waiting, otherwise > + <literal>process not waiting</> > + </entry> > </row> > <row> I still think this is a considerable regression in pg_stat_activity usability. There are lots of people out there that have queries that automatically monitor pg_stat_activity.waiting, and automatically go to pg_locks to understand what's going on, if there's one. With the above definition, that got much harder. Not only do I have to write WHERE wait_event <> 'process not waiting', but then also parse the wait event name, to know whether the process is waiting on a heavyweight lock, or something else! I do think there's a considerable benefit in improving the instrumentation here, but his strikes me as making live more complex for more users than it makes it easier. At the very least this should be split into two fields (type & what we're actually waiting on). I also strongly suspect we shouldn't use in band signaling ("process not waiting"), but rather make the field NULL if we're not waiting on anything. Greetings, Andres Freund
On Tue, Jan 26, 2016 at 1:40 PM, andres@anarazel.de <andres@anarazel.de> wrote:
>
> On 2016-01-26 13:22:09 +0530, Amit Kapila wrote:
> > @@ -633,9 +633,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
> > <entry>Time when the <structfield>state</> was last changed</entry>
> > </row>
> > <row>
> > - <entry><structfield>waiting</></entry>
> > - <entry><type>boolean</></entry>
> > - <entry>True if this backend is currently waiting on a lock</entry>
> > + <entry><structfield>wait_event</></entry>
> > + <entry><type>text</></entry>
> > + <entry>Wait event name if backend is currently waiting, otherwise
> > + <literal>process not waiting</>
> > + </entry>
> > </row>
> > <row>
>
> I still think this is a considerable regression in pg_stat_activity
> usability. There are lots of people out there that have queries that
> automatically monitor pg_stat_activity.waiting, and automatically go to
> pg_locks to understand what's going on, if there's one. With the above
> definition, that got much harder. Not only do I have to write
> WHERE wait_event <> 'process not waiting', but then also parse the wait
> event name, to know whether the process is waiting on a heavyweight
> lock, or something else!
>
> I do think there's a considerable benefit in improving the
> instrumentation here, but his strikes me as making live more complex for
> more users than it makes it easier.
>
> On 2016-01-26 13:22:09 +0530, Amit Kapila wrote:
> > @@ -633,9 +633,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
> > <entry>Time when the <structfield>state</> was last changed</entry>
> > </row>
> > <row>
> > - <entry><structfield>waiting</></entry>
> > - <entry><type>boolean</></entry>
> > - <entry>True if this backend is currently waiting on a lock</entry>
> > + <entry><structfield>wait_event</></entry>
> > + <entry><type>text</></entry>
> > + <entry>Wait event name if backend is currently waiting, otherwise
> > + <literal>process not waiting</>
> > + </entry>
> > </row>
> > <row>
>
> I still think this is a considerable regression in pg_stat_activity
> usability. There are lots of people out there that have queries that
> automatically monitor pg_stat_activity.waiting, and automatically go to
> pg_locks to understand what's going on, if there's one. With the above
> definition, that got much harder. Not only do I have to write
> WHERE wait_event <> 'process not waiting', but then also parse the wait
> event name, to know whether the process is waiting on a heavyweight
> lock, or something else!
>
> I do think there's a considerable benefit in improving the
> instrumentation here, but his strikes me as making live more complex for
> more users than it makes it easier.
>
Here, we have two ways to expose this functionality to user, first is
that we expose this new set of information (wait_type, wait_event)
separately either in new view or in pg_stat_activity and ask users
to migrate to this new information and mark pg_stat_activity.waiting as
deprecated and then remove it in future versions and second is remove
pg_stat_activity.waiting and expose new set of information which will
make users to forcibly move to this new set of information. I think both
the ways have it's pros and cons and they are discussed upthread and
based on that I have decided to move forward with second way.
> At the very least this should be
> split into two fields (type & what we're actually waiting on).
> split into two fields (type & what we're actually waiting on).
>
makes sense to me, so we can repersent wait_type as:
wait_type text, values can be Lock (or HWLock), LWLock, Network, etc.
Let me know if that is okay or you have something else in mind?
> I also
> strongly suspect we shouldn't use in band signaling ("process not
> waiting"), but rather make the field NULL if we're not waiting on
> anything.
>
> strongly suspect we shouldn't use in band signaling ("process not
> waiting"), but rather make the field NULL if we're not waiting on
> anything.
>
Agree, will change in next version of patch.
On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote: > I do think there's a considerable benefit in improving the > instrumentation here, but his strikes me as making live more complex for > more users than it makes it easier. At the very least this should be > split into two fields (type & what we're actually waiting on). I also > strongly suspect we shouldn't use in band signaling ("process not > waiting"), but rather make the field NULL if we're not waiting on > anything. +1 for splitting it into two fields. Regarding making the field NULL, someone (I think you) proposed previously that we should have one field indicating whether we are waiting, and a separate field (or two) indicating the current or most recent wait event. That would be similar to how pg_stat_activity.{query,state} work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 28, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > I do think there's a considerable benefit in improving the
> > instrumentation here, but his strikes me as making live more complex for
> > more users than it makes it easier. At the very least this should be
> > split into two fields (type & what we're actually waiting on). I also
> > strongly suspect we shouldn't use in band signaling ("process not
> > waiting"), but rather make the field NULL if we're not waiting on
> > anything.
>
> +1 for splitting it into two fields.
>
I will take care of this.
>
> Regarding making the field NULL, someone (I think you) proposed
> previously that we should have one field indicating whether we are
> waiting, and a separate field (or two) indicating the current or most
> recent wait event.
>
> On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > I do think there's a considerable benefit in improving the
> > instrumentation here, but his strikes me as making live more complex for
> > more users than it makes it easier. At the very least this should be
> > split into two fields (type & what we're actually waiting on). I also
> > strongly suspect we shouldn't use in band signaling ("process not
> > waiting"), but rather make the field NULL if we're not waiting on
> > anything.
>
> +1 for splitting it into two fields.
>
I will take care of this.
>
> Regarding making the field NULL, someone (I think you) proposed
> previously that we should have one field indicating whether we are
> waiting, and a separate field (or two) indicating the current or most
> recent wait event.
>
I think to do that way we need to change the meaning of pg_stat_activity.
waiting from waiting on locks to waiting on HWLocks and LWLocks and
other events which we add in future. That can be again other source of
confusion where if existing users of pg_stat_activity.waiting are still
relying on it, they can get wrong information or if they are aware of the
recent change, then they need to add additional check like
(waiting = true && wait_event_type = 'HWLock')). So I think it is better to
go with suggestion where we can display NULL in the new field when
there is no-wait.
On Thu, Jan 28, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 28, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > > I do think there's a considerable benefit in improving the
> > > instrumentation here, but his strikes me as making live more complex for
> > > more users than it makes it easier. At the very least this should be
> > > split into two fields (type & what we're actually waiting on). I also
> > > strongly suspect we shouldn't use in band signaling ("process not
> > > waiting"), but rather make the field NULL if we're not waiting on
> > > anything.
> >
> > +1 for splitting it into two fields.
> >
>
> I will take care of this.
>
> On Thu, Jan 28, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > > I do think there's a considerable benefit in improving the
> > > instrumentation here, but his strikes me as making live more complex for
> > > more users than it makes it easier. At the very least this should be
> > > split into two fields (type & what we're actually waiting on). I also
> > > strongly suspect we shouldn't use in band signaling ("process not
> > > waiting"), but rather make the field NULL if we're not waiting on
> > > anything.
> >
> > +1 for splitting it into two fields.
> >
>
> I will take care of this.
>
As discussed, I have added a new field wait_event_type along with
wait_event in pg_stat_activity. Changed the code return NULL, if
backend is not waiting. Updated the docs as well.
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Sun, Jan 31, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jan 28, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 28, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > > I do think there's a considerable benefit in improving the
> > > instrumentation here, but his strikes me as making live more complex for
> > > more users than it makes it easier. At the very least this should be
> > > split into two fields (type & what we're actually waiting on). I also
> > > strongly suspect we shouldn't use in band signaling ("process not
> > > waiting"), but rather make the field NULL if we're not waiting on
> > > anything.
> >
> > +1 for splitting it into two fields.
> >
>
> I will take care of this.>As discussed, I have added a new field wait_event_type along withwait_event in pg_stat_activity. Changed the code return NULL, ifbackend is not waiting. Updated the docs as well.
I wonder if we can use 4-byte wait_event_info more efficiently.
LWLock number in the tranche would be also useful information to expose. Using lwlock number user can determine if there is high concurrency for single lwlock in tranche or it is spread over multiple lwlocks.
I think it would be enough to have 6 bits for event class id and 10 bit for event id. So, it would be maximum 64 event classes and maximum 1024 events per class. These limits seem to be fair enough for me.
And then we save 16 bits for lock number. It's certainly not enough for some tranches. For instance, number of buffers could be easily more than 2^16. However, we could expose at least lower 16 bits. It would be at least something. Using this information user at least can make a conclusion like "It MIGHT be a high concurrency for single buffer content. Other way it is coincidence that a lot of different buffers have the same 16 lower bits.".
Any thoughts?
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Feb 1, 2016 at 7:10 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sun, Jan 31, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:On Thu, Jan 28, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 28, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Jan 26, 2016 at 3:10 AM, andres@anarazel.de <andres@anarazel.de> wrote:
> > > I do think there's a considerable benefit in improving the
> > > instrumentation here, but his strikes me as making live more complex for
> > > more users than it makes it easier. At the very least this should be
> > > split into two fields (type & what we're actually waiting on). I also
> > > strongly suspect we shouldn't use in band signaling ("process not
> > > waiting"), but rather make the field NULL if we're not waiting on
> > > anything.
> >
> > +1 for splitting it into two fields.
> >
>
> I will take care of this.>As discussed, I have added a new field wait_event_type along withwait_event in pg_stat_activity. Changed the code return NULL, ifbackend is not waiting. Updated the docs as well.I wonder if we can use 4-byte wait_event_info more efficiently.LWLock number in the tranche would be also useful information to expose. Using lwlock number user can determine if there is high concurrency for single lwlock in tranche or it is spread over multiple lwlocks.
So what you are suggesting is that have an additional information
like sub_wait_event and it will display lock number for the LWLocks
which belong to tranche and otherwise it will be NULL for HWLocks
or Individual LWLocks. I see the value in have one more field, but
just displaying some number and that too is not exact in some cases
like where LWLocks are more doesn't sound to be very informative.
Any body else have an opinion on this matter?
I think it would be enough to have 6 bits for event class id and 10 bit for event id. So, it would be maximum 64 event classes and maximum 1024 events per class. These limits seem to be fair enough for me.
I also think those are fair limits, let me try to shrink those into
number of bits suggested by you.
On Mon, Feb 1, 2016 at 8:40 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > I wonder if we can use 4-byte wait_event_info more efficiently. > LWLock number in the tranche would be also useful information to expose. > Using lwlock number user can determine if there is high concurrency for > single lwlock in tranche or it is spread over multiple lwlocks. > I think it would be enough to have 6 bits for event class id and 10 bit for > event id. So, it would be maximum 64 event classes and maximum 1024 events > per class. These limits seem to be fair enough for me. > And then we save 16 bits for lock number. It's certainly not enough for some > tranches. For instance, number of buffers could be easily more than 2^16. > However, we could expose at least lower 16 bits. It would be at least > something. Using this information user at least can make a conclusion like > "It MIGHT be a high concurrency for single buffer content. Other way it is > coincidence that a lot of different buffers have the same 16 lower bits.". > > Any thoughts? Meh. I think that's trying to be too clever. That seems hard to document, hard to explain, and likely incomprehensible to users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 2, 2016 at 10:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 1, 2016 at 8:40 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > I wonder if we can use 4-byte wait_event_info more efficiently.
> > LWLock number in the tranche would be also useful information to expose.
> > Using lwlock number user can determine if there is high concurrency for
> > single lwlock in tranche or it is spread over multiple lwlocks.
> > I think it would be enough to have 6 bits for event class id and 10 bit for
> > event id. So, it would be maximum 64 event classes and maximum 1024 events
> > per class. These limits seem to be fair enough for me.
> > And then we save 16 bits for lock number. It's certainly not enough for some
> > tranches. For instance, number of buffers could be easily more than 2^16.
> > However, we could expose at least lower 16 bits. It would be at least
> > something. Using this information user at least can make a conclusion like
> > "It MIGHT be a high concurrency for single buffer content. Other way it is
> > coincidence that a lot of different buffers have the same 16 lower bits.".
> >
> > Any thoughts?
>
> Meh. I think that's trying to be too clever. That seems hard to
> document, hard to explain, and likely incomprehensible to users.
>
>
> On Mon, Feb 1, 2016 at 8:40 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > I wonder if we can use 4-byte wait_event_info more efficiently.
> > LWLock number in the tranche would be also useful information to expose.
> > Using lwlock number user can determine if there is high concurrency for
> > single lwlock in tranche or it is spread over multiple lwlocks.
> > I think it would be enough to have 6 bits for event class id and 10 bit for
> > event id. So, it would be maximum 64 event classes and maximum 1024 events
> > per class. These limits seem to be fair enough for me.
> > And then we save 16 bits for lock number. It's certainly not enough for some
> > tranches. For instance, number of buffers could be easily more than 2^16.
> > However, we could expose at least lower 16 bits. It would be at least
> > something. Using this information user at least can make a conclusion like
> > "It MIGHT be a high concurrency for single buffer content. Other way it is
> > coincidence that a lot of different buffers have the same 16 lower bits.".
> >
> > Any thoughts?
>
> Meh. I think that's trying to be too clever. That seems hard to
> document, hard to explain, and likely incomprehensible to users.
>
So, let's leave adding any additional column, but Alexander has brought up
a good point about storing the wait_type and actual wait_event
information into four bytes. Currently I have stored wait_type (aka classId)
in first byte and then two bytes for wait_event (eventId) and remaining
one-byte can be used in future if required, however Alexandar is proposing to
combine both these (classId and eventId) into two-bytes which sounds
reasonable to me apart from the fact that it might add operation or two extra
in this path. Do you or anyone else have any preference over this point?
On Tue, Feb 2, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > So, let's leave adding any additional column, but Alexander has brought up > a good point about storing the wait_type and actual wait_event > information into four bytes. Currently I have stored wait_type (aka > classId) > in first byte and then two bytes for wait_event (eventId) and remaining > one-byte can be used in future if required, however Alexandar is proposing > to > combine both these (classId and eventId) into two-bytes which sounds > reasonable to me apart from the fact that it might add operation or two > extra > in this path. Do you or anyone else have any preference over this point? I wouldn't bother tinkering with it at this point. The value isn't going to be recorded on disk anywhere, so it will be easy to change the way it's computed in the future if we ever need to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 3, 2016 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 2, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > So, let's leave adding any additional column, but Alexander has brought up
> > a good point about storing the wait_type and actual wait_event
> > information into four bytes. Currently I have stored wait_type (aka
> > classId)
> > in first byte and then two bytes for wait_event (eventId) and remaining
> > one-byte can be used in future if required, however Alexandar is proposing
> > to
> > combine both these (classId and eventId) into two-bytes which sounds
> > reasonable to me apart from the fact that it might add operation or two
> > extra
> > in this path. Do you or anyone else have any preference over this point?
>
> I wouldn't bother tinkering with it at this point. The value isn't
> going to be recorded on disk anywhere, so it will be easy to change
> the way it's computed in the future if we ever need to do that.
>
>
> On Tue, Feb 2, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > So, let's leave adding any additional column, but Alexander has brought up
> > a good point about storing the wait_type and actual wait_event
> > information into four bytes. Currently I have stored wait_type (aka
> > classId)
> > in first byte and then two bytes for wait_event (eventId) and remaining
> > one-byte can be used in future if required, however Alexandar is proposing
> > to
> > combine both these (classId and eventId) into two-bytes which sounds
> > reasonable to me apart from the fact that it might add operation or two
> > extra
> > in this path. Do you or anyone else have any preference over this point?
>
> I wouldn't bother tinkering with it at this point. The value isn't
> going to be recorded on disk anywhere, so it will be easy to change
> the way it's computed in the future if we ever need to do that.
>
Okay. Find the rebased patch attached with this mail. I have moved
this patch to upcoming CF.
Attachment
On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I wouldn't bother tinkering with it at this point. The value isn't >> going to be recorded on disk anywhere, so it will be easy to change >> the way it's computed in the future if we ever need to do that. >> > > Okay. Find the rebased patch attached with this mail. I have moved > this patch to upcoming CF. I would call the functions pgstat_report_wait_start() and pgstat_report_wait_end() instead of pgstat_report_start_waiting() and pgstat_report_end_waiting(). I think pgstat_get_wait_event_type should not return HWLock, a term that appears nowhere in our source tree at present. How about just "Lock"? I think the wait event types should be documented - and the wait events too, perhaps. Maybe it's worth having separate wait event type names for lwlocks and lwlock tranches. We could report LWLockNamed and LWLockTranche and document the difference: "LWLockNamed indicates that the backend is waiting for a specific, named LWLock. The event name is the name of that lock. LWLockTranche indicates that the backend is waiting for any one of a group of locks with similar function. The event name identifies the general purpose of locks in that group." There's no requirement that every session have every tranche registered. I think we should consider displaying "extension" for any tranche that's not built-in, or at least for tranches that are not registered (rather than "unknown wait event"). + if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS) Isn't the second part automatically true at this point? The changes to LockBufferForCleanup() don't look right to me. Waiting for a buffer pin might be a reasonable thing to define as a wait event, but it shouldn't reported as if we were waiting on the LWLock itself. What happens if an error is thrown while we're in a wait? Does this patch hurt performance? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Peter Eisentraut
Date:
Could you enhance the documentation about the difference between "wait event type name" and "wait event name" (examples?)? This is likely to be quite confusing for users who are used to just the plain "waiting" column.
On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I wouldn't bother tinkering with it at this point. The value isn't
>> going to be recorded on disk anywhere, so it will be easy to change
>> the way it's computed in the future if we ever need to do that.
>>
>
> Okay. Find the rebased patch attached with this mail. I have moved
> this patch to upcoming CF.
I would call the functions pgstat_report_wait_start() and
pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
pgstat_report_end_waiting().
I think pgstat_get_wait_event_type should not return HWLock, a term
that appears nowhere in our source tree at present. How about just
"Lock"?
I think the wait event types should be documented - and the wait
events too, perhaps.
Maybe it's worth having separate wait event type names for lwlocks and
lwlock tranches. We could report LWLockNamed and LWLockTranche and
document the difference: "LWLockNamed indicates that the backend is
waiting for a specific, named LWLock. The event name is the name of
that lock. LWLockTranche indicates that the backend is waiting for
any one of a group of locks with similar function. The event name
identifies the general purpose of locks in that group."
Agreed with all the above points and will change the patch accordingly.
There's no requirement that every session have every tranche
registered. I think we should consider displaying "extension" for any
tranche that's not built-in, or at least for tranches that are not
registered (rather than "unknown wait event").
I think it is better to display as an "extension" for unregistered tranches,
but do you see any case where we will have wait event information
for any unregistered tranche?
Another point to consider is that if it is not possible to have wait event
for unregisteredtranche, then should we have Assert or elog(ERROR)
instead of "unknown wait event"?
+ if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS)
Isn't the second part automatically true at this point?
Yes, that point is automatically true and I think we should change
the same check in PRINT_LWDEBUG and LOG_LWDEBUG, although
as separate patches.
The changes to LockBufferForCleanup() don't look right to me. Waiting
for a buffer pin might be a reasonable thing to define as a wait
event, but it shouldn't reported as if we were waiting on the LWLock
itself.
makes sense, how about having a new wait class, something like
WAIT_BUFFER and then have wait event type as Buffer and
wait event as BufferPin. At this moment, I think there will be
only one event in this class, but it seems to me waiting on buffer
has merit to be considered as a separate class.
What happens if an error is thrown while we're in a wait?
For LWLocks, only FATAL error is possible which will anyway lead
to initialization of all backend states. For lock.c, if an error is
thrown, then state is reset in Catch block. In
LockBufferForCleanup(), after we set the wait event and before we
reset it, there is only a chance of FATAL error, if any system call
fails. We do have one error in enable_timeouts which is called from
ResolveRecoveryConflictWithBufferPin(), but that doesn't seem tobe possible. Now one question to answer is that what if tomorrow
some one adds new error after we set the wait state, so may be
it is better to clear wait event in AbortTransaction()?
This patch adds additional code in the path where we are going
Does this patch hurt performance?
to sleep/wait, but we have changed some shared memory structure, so
it is good to verify performance tests. I am planning to run read-only
and read-write pgbench tests (when data fits in shared), is that
sufficient or do you expect anything more?
On Thu, Feb 25, 2016 at 10:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> There's no requirement that every session have every tranche >> registered. I think we should consider displaying "extension" for any >> tranche that's not built-in, or at least for tranches that are not >> registered (rather than "unknown wait event"). > > I think it is better to display as an "extension" for unregistered tranches, > but do you see any case where we will have wait event information > for any unregistered tranche? Sure. If backend A has the tranche registered - because it loaded some .so which registered it in PG_init() - and is waiting on the lock, and backend B does not have the tranche registered - because it didn't do that - and backend B selects from pg_stat_activity, then you'll see a wait in an unregistered tranche. >> The changes to LockBufferForCleanup() don't look right to me. Waiting >> for a buffer pin might be a reasonable thing to define as a wait >> event, but it shouldn't reported as if we were waiting on the LWLock >> itself. > > makes sense, how about having a new wait class, something like > WAIT_BUFFER and then have wait event type as Buffer and > wait event as BufferPin. At this moment, I think there will be > only one event in this class, but it seems to me waiting on buffer > has merit to be considered as a separate class. I would just go with BufferPin/BufferPin for now. I can't think what else I'd want to group with BufferPins in the same class. >> What happens if an error is thrown while we're in a wait? > > For LWLocks, only FATAL error is possible which will anyway lead > to initialization of all backend states. For lock.c, if an error is > thrown, then state is reset in Catch block. In > LockBufferForCleanup(), after we set the wait event and before we > reset it, there is only a chance of FATAL error, if any system call > fails. We do have one error in enable_timeouts which is called from > ResolveRecoveryConflictWithBufferPin(), but that doesn't seem to > be possible. Now one question to answer is that what if tomorrow > some one adds new error after we set the wait state, so may be > it is better to clear wait event in AbortTransaction()? Yeah, I think so. Probably everywhere that we do LWLockReleaseAll() we should also clear the wait event. >> Does this patch hurt performance? > > This patch adds additional code in the path where we are going > to sleep/wait, but we have changed some shared memory structure, so > it is good to verify performance tests. I am planning to run read-only > and read-write pgbench tests (when data fits in shared), is that > sufficient or do you expect anything more? I'm open to ideas from others on tests that would be good to run, but I don't have any great ideas myself right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I wouldn't bother tinkering with it at this point. The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail. I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>
> I think pgstat_get_wait_event_type should not return HWLock, a term
> that appears nowhere in our source tree at present. How about just
> "Lock"?
>
> I think the wait event types should be documented - and the wait
> events too, perhaps.
>
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I wouldn't bother tinkering with it at this point. The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail. I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>
> I think pgstat_get_wait_event_type should not return HWLock, a term
> that appears nowhere in our source tree at present. How about just
> "Lock"?
>
> I think the wait event types should be documented - and the wait
> events too, perhaps.
>
By above do you mean to say that we should document the name of each wait event type and wait event. Documenting wait event names is okay, but we have approximately 65~70 wait events (considering individuals LWLocks, Tranches, Locks, etc), if we want to document all the events, then I think we can have a separate table having columns (wait event name, description) just below pg_stat_activity and have link to that table in wait_event row of pg_stat_activity table. Does that matches your thought or you have something else in mind?
On Thu, Feb 25, 2016 at 2:54 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>
> Could you enhance the documentation about the difference between "wait
> event type name" and "wait event name" (examples?)?
>
> Could you enhance the documentation about the difference between "wait
> event type name" and "wait event name" (examples?)?
>
I am planning to add possible values for each of the wait event type and wait event and will add few examples as well. Let me know if you want to see something else with respect to documentation?
On Mon, Feb 29, 2016 at 8:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Feb 25, 2016 at 2:54 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >> >> Could you enhance the documentation about the difference between "wait >> event type name" and "wait event name" (examples?)? >> > > I am planning to add possible values for each of the wait event type and > wait event and will add few examples as well. Let me know if you want to > see something else with respect to documentation? That's pretty much what I had in mind. I imagine that most of the list of wait events will be a list of the individual LWLocks, which I suppose then will each need a brief description. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I wouldn't bother tinkering with it at this point. The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail. I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>
>
> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I wouldn't bother tinkering with it at this point. The value isn't
> >> going to be recorded on disk anywhere, so it will be easy to change
> >> the way it's computed in the future if we ever need to do that.
> >>
> >
> > Okay. Find the rebased patch attached with this mail. I have moved
> > this patch to upcoming CF.
>
> I would call the functions pgstat_report_wait_start() and
> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
> pgstat_report_end_waiting().
>
Changed as per suggestion and made these functions inline.
> I think pgstat_get_wait_event_type should not return HWLock, a term
> that appears nowhere in our source tree at present. How about just
> "Lock"?
>
Changed as per suggestion.
> I think the wait event types should be documented - and the wait
> events too, perhaps.
>
As discussed upthread, I have added documentation for all the possible wait events and an example. Some of the LWLocks like AsyncQueueLock and AsyncCtlLock are used for quite similar purpose, so I have kept there explanation as same.
> Maybe it's worth having separate wait event type names for lwlocks and
> lwlock tranches. We could report LWLockNamed and LWLockTranche and
> document the difference: "LWLockNamed indicates that the backend is
> waiting for a specific, named LWLock. The event name is the name of
> that lock. LWLockTranche indicates that the backend is waiting for
> any one of a group of locks with similar function. The event name
> identifies the general purpose of locks in that group."
>
Changed as per suggestion.
> There's no requirement that every session have every tranche
> registered. I think we should consider displaying "extension" for any
> tranche that's not built-in, or at least for tranches that are not
> registered (rather than "unknown wait event").
>
> + if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS)
>
> Isn't the second part automatically true at this point?
>
Fixed.
> The changes to LockBufferForCleanup() don't look right to me. Waiting
> for a buffer pin might be a reasonable thing to define as a wait
> event, but it shouldn't reported as if we were waiting on the LWLock
> itself.
>
As discussed upthread, added a new wait event BufferPin for this case.
> What happens if an error is thrown while we're in a wait?
>
As discussed upthread, added in AbortTransaction and from where ever LWLockReleaseAll() gets called, point to note is that we can call this function only when we are sure there is no further possibility of wait on LWLock.
> Does this patch hurt performance?
>
Performance tests are underway.
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the wait event types should be documented - and the wait> events too, perhaps.
>As discussed upthread, I have added documentation for all the possible wait events and an example. Some of the LWLocks like AsyncQueueLock and AsyncCtlLock are used for quite similar purpose, so I have kept there explanation as same.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:> I think the wait event types should be documented - and the wait> events too, perhaps.
>As discussed upthread, I have added documentation for all the possible wait events and an example. Some of the LWLocks like AsyncQueueLock and AsyncCtlLock are used for quite similar purpose, so I have kept there explanation as same.Do you think it worth grouping rows in "wait_event Description" table by wait event type?
They are already grouped (implicitly), do you mean to say that we should add wait event type name as well in that table? If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
On 4 March 2016 at 04:05, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >> I wouldn't bother tinkering with it at this point. The value isn't >> >> going to be recorded on disk anywhere, so it will be easy to change >> >> the way it's computed in the future if we ever need to do that. >> >> >> > >> > Okay. Find the rebased patch attached with this mail. I have moved >> > this patch to upcoming CF. >> >> I would call the functions pgstat_report_wait_start() and >> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and >> pgstat_report_end_waiting(). >> > > Changed as per suggestion and made these functions inline. > >> I think pgstat_get_wait_event_type should not return HWLock, a term >> that appears nowhere in our source tree at present. How about just >> "Lock"? >> > > Changed as per suggestion. > >> I think the wait event types should be documented - and the wait >> events too, perhaps. >> > > As discussed upthread, I have added documentation for all the possible wait > events and an example. Some of the LWLocks like AsyncQueueLock and > AsyncCtlLock are used for quite similar purpose, so I have kept there > explanation as same. > >> Maybe it's worth having separate wait event type names for lwlocks and >> lwlock tranches. We could report LWLockNamed and LWLockTranche and >> document the difference: "LWLockNamed indicates that the backend is >> waiting for a specific, named LWLock. The event name is the name of >> that lock. LWLockTranche indicates that the backend is waiting for >> any one of a group of locks with similar function. The event name >> identifies the general purpose of locks in that group." >> > > Changed as per suggestion. > >> There's no requirement that every session have every tranche >> registered. I think we should consider displaying "extension" for any >> tranche that's not built-in, or at least for tranches that are not >> registered (rather than "unknown wait event"). >> >> + if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS) >> >> Isn't the second part automatically true at this point? >> > > Fixed. > >> The changes to LockBufferForCleanup() don't look right to me. Waiting >> for a buffer pin might be a reasonable thing to define as a wait >> event, but it shouldn't reported as if we were waiting on the LWLock >> itself. >> > > As discussed upthread, added a new wait event BufferPin for this case. > >> What happens if an error is thrown while we're in a wait? >> > > As discussed upthread, added in AbortTransaction and from where ever > LWLockReleaseAll() gets called, point to note is that we can call this > function only when we are sure there is no further possibility of wait on > LWLock. > >> Does this patch hurt performance? >> > > Performance tests are underway. I've attached a revised version of the patch with the following corrections: + <para> + <literal>LWLockTranche</>: The backend is waiting for any one of a + group of locks with similar function. The <literal>wait_event</> + name for this type of wait identifies the general purpose of locks + in that group. + </para> s/with similar/with a similar/ + <row> + <entry><literal>ControlFileLock</></entry> + <entry>A server process is waiting to read or update the control file + or creation of a new WAL log file.</entry> + </row> As the L in WAL stands for "log" anyway, I think the extra "log" word can be removed. + <row> + <entry><literal>RelCacheInitLock</></entry> + <entry>A server process is waiting to read or write to relation cache + initialization file.</entry> + </row> s/to relation/to the relation/ + <row> + <entry><literal>BtreeVacuumLock</></entry> + <entry>A server process is waiting to read or update vacuum related + information for Btree index.</entry> + </row> s/vacuum related/vacuum-related/ s/for Btree/for a Btree/ + <row> + <entry><literal>AutovacuumLock</></entry> + <entry>A server process which could be autovacuum worker is waiting to + update or read current state of autovacuum workers.</entry> + </row> s/could be autovacuum/could be that an autovacuum/ s/read current/read the current/ (discussed with Amit offline about other sources of wait, and he suggested autovacuum launcher, so I've added that in too) + <row> + <entry><literal>AutovacuumScheduleLock</></entry> + <entry>A server process is waiting on another process to ensure that + the table it has selected for vacuum still needs vacuum. + </entry> + </row> s/for vacuum/for a vacuum/ s/still needs vacuum/still needs vacuuming/ + <row> + <entry><literal>SyncScanLock</></entry> + <entry>A server process is waiting to get the start location of scan + on table for synchronized scans.</entry> + </row> s/of scan/of a scan/ s/on table/on a table/ + <row> + <entry><literal>SerializableFinishedListLock</></entry> + <entry>A server process is waiting to access list of finished + serializable transactions.</entry> + </row> s/to access list/to access the list/ + <row> + <entry><literal>SerializablePredicateLockListLock</></entry> + <entry>A server process is waiting to perform operation on list of + locks held by serializable transactions.</entry> + </row> s/perform operation/perform an operation/ s/on list/on a list/ + <row> + <entry><literal>AutoFileLock</></entry> + <entry>A server process is waiting to update <filename>postgresql.auto.conf</> + file.</entry> + </row> s/to update/to update the/ + <row> + <entry><literal>CommitTsLock</></entry> + <entry>A server process is waiting to read or update the last value + set for transaction timestamp.</entry> + </row> s/for transaction/for the transaction/ + <row> + <entry><literal>clog</></entry> + <entry>A server process is waiting on any one of the clog buffer locks + to read or write the clog page in pg_clog subdirectory.</entry> + </row> s/page in/page in the/ The 6 rows that follow that one could apply this same correction. + <row> + <entry><literal>wal_insert</></entry> + <entry>A server process is waiting on any one of the wal_insert locks + to write data in wal buffers.</entry> + </row> s/data in/data into/ + <row> + <entry><literal>buffer_content</></entry> + <entry>A server process is waiting on any one of the buffer_content + locks to read or write data in shared buffers.</entry> + </row> s/data in/data into/ + <row> + <entry><literal>buffer_io</></entry> + <entry>A server process is waiting on any one of the buffer_io locks + to allow other process to complete the I/O on buffer.</entry> + </row> s/allow other process/allow another process/ + <row> + <entry><literal>buffer_mapping</></entry> + <entry>A server process is waiting on any one of the buffer_mapping + locks to associate a data block with buffer in buffer pool.</entry> + </row> s/with buffer/with a buffer/ s/in buffer pool/in the buffer pool/ + <row> + <entry><literal>lock_manager</></entry> + <entry>A server process is waiting on any one of the lock_manager locks + to add or examine locks for backends or waiting on joining/exiting + parallel group to perform parallel query.</entry> + </row> s/exiting parallel/exiting a parallel/ s/perform parallel/perform a parallel/ + <row> + <entry><literal>relation</></entry> + <entry>A server process is waiting to acquire lock on a relation.</entry> + </row> s/acquire lock/acquire a lock/ + <row> + <entry><literal>tuple</></entry> + <entry>A server process is waiting to acquire a lock on tuple.</entry> + </row> s/on tuple/on a tuple/ + <row> + <entry><literal>virtualxid</></entry> + <entry>A server process is waiting to acquire virtual xid lock.</entry> + </row> s/acquire virtual/acquire a virtual/ The 5 rows that follow the above one can apply a similar substitution (i.e. s/acquire/acquire a/ ) + <para> + For tranches registered by extensions, the name is specified by extension + and the same will be displayed as <structfield>wait_event</>. It is quite + possible that user has registered tranche in one of the backends (by + having allocation in dynamic shared memory) in which case other backends + won't have that information, so we display <literal>extension</> for such + cases. + </para> s/and the same will/and this will/ s/has registered tranche/has registered the tranche/ Regards Thom
Attachment
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
On Fri, Mar 4, 2016 at 4:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:> I think the wait event types should be documented - and the wait> events too, perhaps.
>As discussed upthread, I have added documentation for all the possible wait events and an example. Some of the LWLocks like AsyncQueueLock and AsyncCtlLock are used for quite similar purpose, so I have kept there explanation as same.Do you think it worth grouping rows in "wait_event Description" table by wait event type?They are already grouped (implicitly), do you mean to say that we should add wait event type name as well in that table?
Yes.
If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 4 March 2016 at 13:35, Thom Brown <thom@linux.com> wrote: > On 4 March 2016 at 04:05, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>> >> I wouldn't bother tinkering with it at this point. The value isn't >>> >> going to be recorded on disk anywhere, so it will be easy to change >>> >> the way it's computed in the future if we ever need to do that. >>> >> >>> > >>> > Okay. Find the rebased patch attached with this mail. I have moved >>> > this patch to upcoming CF. >>> >>> I would call the functions pgstat_report_wait_start() and >>> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and >>> pgstat_report_end_waiting(). >>> >> >> Changed as per suggestion and made these functions inline. >> >>> I think pgstat_get_wait_event_type should not return HWLock, a term >>> that appears nowhere in our source tree at present. How about just >>> "Lock"? >>> >> >> Changed as per suggestion. >> >>> I think the wait event types should be documented - and the wait >>> events too, perhaps. >>> >> >> As discussed upthread, I have added documentation for all the possible wait >> events and an example. Some of the LWLocks like AsyncQueueLock and >> AsyncCtlLock are used for quite similar purpose, so I have kept there >> explanation as same. >> >>> Maybe it's worth having separate wait event type names for lwlocks and >>> lwlock tranches. We could report LWLockNamed and LWLockTranche and >>> document the difference: "LWLockNamed indicates that the backend is >>> waiting for a specific, named LWLock. The event name is the name of >>> that lock. LWLockTranche indicates that the backend is waiting for >>> any one of a group of locks with similar function. The event name >>> identifies the general purpose of locks in that group." >>> >> >> Changed as per suggestion. >> >>> There's no requirement that every session have every tranche >>> registered. I think we should consider displaying "extension" for any >>> tranche that's not built-in, or at least for tranches that are not >>> registered (rather than "unknown wait event"). >>> >>> + if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS) >>> >>> Isn't the second part automatically true at this point? >>> >> >> Fixed. >> >>> The changes to LockBufferForCleanup() don't look right to me. Waiting >>> for a buffer pin might be a reasonable thing to define as a wait >>> event, but it shouldn't reported as if we were waiting on the LWLock >>> itself. >>> >> >> As discussed upthread, added a new wait event BufferPin for this case. >> >>> What happens if an error is thrown while we're in a wait? >>> >> >> As discussed upthread, added in AbortTransaction and from where ever >> LWLockReleaseAll() gets called, point to note is that we can call this >> function only when we are sure there is no further possibility of wait on >> LWLock. >> >>> Does this patch hurt performance? >>> >> >> Performance tests are underway. > > I've attached a revised version of the patch with the following corrections: > > + <para> > + <literal>LWLockTranche</>: The backend is waiting for any one of a > + group of locks with similar function. The <literal>wait_event</> > + name for this type of wait identifies the general purpose of locks > + in that group. > + </para> > > s/with similar/with a similar/ > > + <row> > + <entry><literal>ControlFileLock</></entry> > + <entry>A server process is waiting to read or update the control file > + or creation of a new WAL log file.</entry> > + </row> > > As the L in WAL stands for "log" anyway, I think the extra "log" word > can be removed. > > + <row> > + <entry><literal>RelCacheInitLock</></entry> > + <entry>A server process is waiting to read or write to relation cache > + initialization file.</entry> > + </row> > > s/to relation/to the relation/ > > + <row> > + <entry><literal>BtreeVacuumLock</></entry> > + <entry>A server process is waiting to read or update vacuum related > + information for Btree index.</entry> > + </row> > > s/vacuum related/vacuum-related/ > s/for Btree/for a Btree/ > > + <row> > + <entry><literal>AutovacuumLock</></entry> > + <entry>A server process which could be autovacuum worker is waiting to > + update or read current state of autovacuum workers.</entry> > + </row> > > s/could be autovacuum/could be that an autovacuum/ > s/read current/read the current/ > > (discussed with Amit offline about other sources of wait, and he > suggested autovacuum launcher, so I've added that in too) > > + <row> > + <entry><literal>AutovacuumScheduleLock</></entry> > + <entry>A server process is waiting on another process to ensure that > + the table it has selected for vacuum still needs vacuum. > + </entry> > + </row> > > s/for vacuum/for a vacuum/ > s/still needs vacuum/still needs vacuuming/ > > + <row> > + <entry><literal>SyncScanLock</></entry> > + <entry>A server process is waiting to get the start location of scan > + on table for synchronized scans.</entry> > + </row> > > s/of scan/of a scan/ > s/on table/on a table/ > > + <row> > + <entry><literal>SerializableFinishedListLock</></entry> > + <entry>A server process is waiting to access list of finished > + serializable transactions.</entry> > + </row> > > s/to access list/to access the list/ > > + <row> > + <entry><literal>SerializablePredicateLockListLock</></entry> > + <entry>A server process is waiting to perform operation on list of > + locks held by serializable transactions.</entry> > + </row> > > s/perform operation/perform an operation/ > s/on list/on a list/ > > + <row> > + <entry><literal>AutoFileLock</></entry> > + <entry>A server process is waiting to update > <filename>postgresql.auto.conf</> > + file.</entry> > + </row> > > s/to update/to update the/ > > + <row> > + <entry><literal>CommitTsLock</></entry> > + <entry>A server process is waiting to read or update the last value > + set for transaction timestamp.</entry> > + </row> > > s/for transaction/for the transaction/ > > + <row> > + <entry><literal>clog</></entry> > + <entry>A server process is waiting on any one of the clog buffer locks > + to read or write the clog page in pg_clog subdirectory.</entry> > + </row> > > s/page in/page in the/ > > The 6 rows that follow that one could apply this same correction. > > + <row> > + <entry><literal>wal_insert</></entry> > + <entry>A server process is waiting on any one of the wal_insert locks > + to write data in wal buffers.</entry> > + </row> > > s/data in/data into/ > > + <row> > + <entry><literal>buffer_content</></entry> > + <entry>A server process is waiting on any one of the buffer_content > + locks to read or write data in shared buffers.</entry> > + </row> > > s/data in/data into/ > > + <row> > + <entry><literal>buffer_io</></entry> > + <entry>A server process is waiting on any one of the buffer_io locks > + to allow other process to complete the I/O on buffer.</entry> > + </row> > > s/allow other process/allow another process/ > > + <row> > + <entry><literal>buffer_mapping</></entry> > + <entry>A server process is waiting on any one of the buffer_mapping > + locks to associate a data block with buffer in buffer pool.</entry> > + </row> > > s/with buffer/with a buffer/ > s/in buffer pool/in the buffer pool/ > > + <row> > + <entry><literal>lock_manager</></entry> > + <entry>A server process is waiting on any one of the > lock_manager locks > + to add or examine locks for backends or waiting on joining/exiting > + parallel group to perform parallel query.</entry> > + </row> > > s/exiting parallel/exiting a parallel/ > s/perform parallel/perform a parallel/ > > + <row> > + <entry><literal>relation</></entry> > + <entry>A server process is waiting to acquire lock on a > relation.</entry> > + </row> > > s/acquire lock/acquire a lock/ > > + <row> > + <entry><literal>tuple</></entry> > + <entry>A server process is waiting to acquire a lock on tuple.</entry> > + </row> > > s/on tuple/on a tuple/ > > + <row> > + <entry><literal>virtualxid</></entry> > + <entry>A server process is waiting to acquire virtual xid > lock.</entry> > + </row> > > s/acquire virtual/acquire a virtual/ > > The 5 rows that follow the above one can apply a similar substitution > (i.e. s/acquire/acquire a/ ) > > + <para> > + For tranches registered by extensions, the name is specified by extension > + and the same will be displayed as <structfield>wait_event</>. It is quite > + possible that user has registered tranche in one of the backends (by > + having allocation in dynamic shared memory) in which case other backends > + won't have that information, so we display <literal>extension</> for such > + cases. > + </para> > > s/and the same will/and this will/ > s/has registered tranche/has registered the tranche/ Reading one of the corrections back, I wasn't happy with it, so I've changed: "A server process which could be that an autovacuum worker or autovacuum launcher is waiting to update or read the current state of autovacuum workers." to "A server process which could be an autovacuum worker or autovacuum launcher waiting to update or read the current state of autovacuum workers." Thom
Attachment
On 4 March 2016 at 13:41, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Fri, Mar 4, 2016 at 4:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov <aekorotkov@gmail.com> >> wrote: >>> >>> On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>>> >>>> > I think the wait event types should be documented - and the wait >>>> > events too, perhaps. >>>> > >>>> >>>> As discussed upthread, I have added documentation for all the possible >>>> wait events and an example. Some of the LWLocks like AsyncQueueLock and >>>> AsyncCtlLock are used for quite similar purpose, so I have kept there >>>> explanation as same. >>> >>> >>> Do you think it worth grouping rows in "wait_event Description" table by >>> wait event type? >> >> >> They are already grouped (implicitly), do you mean to say that we should >> add wait event type name as well in that table? > > > Yes. > >> >> If yes, then the only slight worry is that there will lot of repetition in >> wait_event_type column, otherwise it is okay. > > > There is morerows attribute of entry tag in Docbook SGML, it behaves like > rowspan in HTML. +1 Yes, we do this elsewhere in the docs. And it is difficult to look through the wait event names at the moment. I'm also not keen on all the "A server process is waiting" all the way down the list. Thom
On Fri, Mar 4, 2016 at 7:23 PM, Thom Brown <thom@linux.com> wrote:
>
> On 4 March 2016 at 13:41, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> >>
> >> If yes, then the only slight worry is that there will lot of repetition in
> >> wait_event_type column, otherwise it is okay.
> >
> >
> > There is morerows attribute of entry tag in Docbook SGML, it behaves like
> > rowspan in HTML.
>
> +1
>
> Yes, we do this elsewhere in the docs. And it is difficult to look
> through the wait event names at the moment.
>
> I'm also not keen on all the "A server process is waiting" all the way
> down the list.
>
>
> On 4 March 2016 at 13:41, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> >>
> >> If yes, then the only slight worry is that there will lot of repetition in
> >> wait_event_type column, otherwise it is okay.
> >
> >
> > There is morerows attribute of entry tag in Docbook SGML, it behaves like
> > rowspan in HTML.
>
> +1
>
I will try to use morerows in documentation.
> Yes, we do this elsewhere in the docs. And it is difficult to look
> through the wait event names at the moment.
>
> I'm also not keen on all the "A server process is waiting" all the way
> down the list.
>
How about giving the column name as "Wait For" instead of "Description" and then use text like "finding or allocating space in shared memory"?
On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
>>
>> If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like rowspan in HTML.
>
RAM = 492GB
>
>>
>> If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like rowspan in HTML.
>
Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table.
As asked above by Robert, below is performance data with the patch.
M/C Details
------------------
IBM POWER-8 24 cores, 192 hardware threadsRAM = 492GB
Performance Data
----------------------------
min_wal_size=15GBmax_wal_size=20GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
pgbench read-only (median of 3, 5-min runs)
clients | BASE | PATCH | % |
1 | 19703.549206 | 19992.141542 | 1.4646718364 |
8 | 120105.542849 | 127717.835367 | 6.3380026745 |
64 | 487334.338764 | 495861.7211254 | 1.7498012521 |
The read-only data shows some improvement with patch, but I think this is mostly attributed to run-to-run variation.
pgbench read-write (median of 3, 30-min runs)
clients | BASE | PATCH | % |
1 | 1703.275728 | 1696.568881 | -0.3937616729 |
8 | 8884.406185 | 9442.387472 | 6.2804567394 |
64 | 32648.82798 | 32113.002416 | -1.6411785572 |
In the above data, the read-write data shows small regression (1.6%) at higher client-count, but when I ran individually that test, the difference was 0.5%. I think it is mostly attributed to run-to-run variation as we see with read-only tests.
Thanks to Mithun C Y for doing performance testing of this patch.
As this patch is adding 4-byte variable to shared memory structure PGPROC, so this is susceptible to memory alignment issues for shared buffers as discussed in thread [1], but in general the performance data doesn't indicate any regression.
Attachment
On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table. I think we should remove "a server process is" from all of these entries. Also, I think this kind of thing should be tightened up: + <entry>A server process is waiting on any one of the commit_timestamp + buffer locks to read or write the commit_timestamp page in the + pg_commit_ts subdirectory.</entry> I'd just write: Waiting to read or write a commit timestamp buffer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 March 2016 at 13:31, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
>>
>> If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like rowspan in HTML.
>Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table.As asked above by Robert, below is performance data with the patch.M/C Details------------------IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GBPerformance Data----------------------------min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9pgbench read-only (median of 3, 5-min runs)
clients BASE PATCH % 1 19703.549206 19992.141542 1.4646718364 8 120105.542849 127717.835367 6.3380026745 64 487334.338764 495861.7211254 1.7498012521 The read-only data shows some improvement with patch, but I think this is mostly attributed to run-to-run variation.pgbench read-write (median of 3, 30-min runs)
clients BASE PATCH % 1 1703.275728 1696.568881 -0.3937616729 8 8884.406185 9442.387472 6.2804567394 64 32648.82798 32113.002416 -1.6411785572 In the above data, the read-write data shows small regression (1.6%) at higher client-count, but when I ran individually that test, the difference was 0.5%. I think it is mostly attributed to run-to-run variation as we see with read-only tests.Thanks to Mithun C Y for doing performance testing of this patch.As this patch is adding 4-byte variable to shared memory structure PGPROC, so this is susceptible to memory alignment issues for shared buffers as discussed in thread [1], but in general the performance data doesn't indicate any regression.
The new patch looks fine with regards to grammar and spelling.
However, the new row-spanning layout isn't declared correctly as you've over-counted by 1 in each morerows attribute, possibly because you equated it to the rowspan attribute in html, which will add one above whatever you specify in the document. "morerows" isn't the total number or rows, but how many more rows in addition to the current one will the row span.
And yes, as Robert mentioned, please can we remove the "A server process is" repetition?
Thom
On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table.
>
> I think we should remove "a server process is" from all of these entries.
>
> Also, I think this kind of thing should be tightened up:
>
> + <entry>A server process is waiting on any one of the commit_timestamp
> + buffer locks to read or write the commit_timestamp page in the
> + pg_commit_ts subdirectory.</entry>
>
> I'd just write: Waiting to read or write a commit timestamp buffer.
>
>
> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table.
>
> I think we should remove "a server process is" from all of these entries.
>
> Also, I think this kind of thing should be tightened up:
>
> + <entry>A server process is waiting on any one of the commit_timestamp
> + buffer locks to read or write the commit_timestamp page in the
> + pg_commit_ts subdirectory.</entry>
>
> I'd just write: Waiting to read or write a commit timestamp buffer.
>
Okay, changed as per suggestion and fixed the morerows issue pointed by Thom.
Attachment
On Thu, Mar 10, 2016 at 12:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > Thanks for the suggestion. I have updated the patch to include >> > wait_event_type information in the wait_event table. >> >> I think we should remove "a server process is" from all of these entries. >> >> Also, I think this kind of thing should be tightened up: >> >> + <entry>A server process is waiting on any one of the >> commit_timestamp >> + buffer locks to read or write the commit_timestamp page in the >> + pg_commit_ts subdirectory.</entry> >> >> I'd just write: Waiting to read or write a commit timestamp buffer. >> > > Okay, changed as per suggestion and fixed the morerows issue pointed by > Thom. Committed with some further editing. In particular, the way you determined whether we could safely access the tranche information for any given ID was wrong; please check over what I did and make sure that isn't also wrong. Whew, this was a long process, but we got there. Some initial pgbench testing shows that by far the most common wait event observed on that workload is WALWriteLock, which is pretty interesting: perf -e cs and LWLOCK_STATS let you measure the most *frequent* wait events, but that ignores duration. Sampling pg_stat_activity tells you which things you're spending the most *time* waiting for, which is awfully neat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 11, 2016 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> Committed with some further editing. In particular, the way you
> determined whether we could safely access the tranche information for
> any given ID was wrong; please check over what I did and make sure
> that isn't also wrong.
>
>
>
> Committed with some further editing. In particular, the way you
> determined whether we could safely access the tranche information for
> any given ID was wrong; please check over what I did and make sure
> that isn't also wrong.
>
There are few typos which I have tried to fix with the attached patch. Can you tell me what was wrong with the way it was done in patch?
@@ -4541,9 +4542,10 @@ AbortSubTransaction(void)
*/
LWLockReleaseAll();
+ pgstat_report_wait_end();
+ pgstat_progress_end_command();
AbortBufferIO();
UnlockBuffers();
- pgstat_progress_end_command();
/* Reset WAL record construction state */
XLogResetInsertion();
@@ -4653,6 +4655,9 @@ AbortSubTransaction(void)
*/
XactReadOnly = s->prevXactReadOnly;
+ /* Report wait end here, when there is no further possibility of wait */
+ pgstat_report_wait_end();
+
RESUME_INTERRUPTS();
}
AbortSubTransaction() does call pgstat_report_wait_end() twice, is this intentional? I have kept it in the end because there is a chance that in between API's can again set the state to wait and also by that time we have not released buffer pins and heavyweight locks, so not sure if it makes sense to report wait end at that stage. I have noticed that in WaitOnLock(), on error the wait end is set, but now again thinking on it, it seems it will be better to set it in AbortTransaction/AbortSubTransaction at end. What do you think?
Attachment
On Fri, Mar 11, 2016 at 1:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Mar 11, 2016 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Committed with some further editing. In particular, the way you >> determined whether we could safely access the tranche information for >> any given ID was wrong; please check over what I did and make sure >> that isn't also wrong. >> > There are few typos which I have tried to fix with the attached patch. Can > you tell me what was wrong with the way it was done in patch? LWLocks can be taken during transaction abort and you might have to wait, so keeping the wait state around from before you started to abort the transaction doesn't make any sense. You aren't waiting at that point, and if you get stuck in some part of the code that isn't instrumented to expose wait instrumentation, you certainly don't want pg_stat_activity to still reflect the way things were when you previously waiting on something else. It's essential that we clear the wait state as early as possible - right after LWLockReleaseAll - because at that point we are no longer waiting. Then you no longer need to do it later on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello,
Here's a small docpatch to fix two typos in the new documentation.
Regards,
Thomas
Here's a small docpatch to fix two typos in the new documentation.
Regards,
Thomas
Le 11/03/2016 07:19, Amit Kapila a écrit :
On Fri, Mar 11, 2016 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> Committed with some further editing. In particular, the way you
> determined whether we could safely access the tranche information for
> any given ID was wrong; please check over what I did and make sure
> that isn't also wrong.
>There are few typos which I have tried to fix with the attached patch. Can you tell me what was wrong with the way it was done in patch?@@ -4541,9 +4542,10 @@ AbortSubTransaction(void)*/LWLockReleaseAll();+ pgstat_report_wait_end();+ pgstat_progress_end_command();AbortBufferIO();UnlockBuffers();- pgstat_progress_end_command();/* Reset WAL record construction state */XLogResetInsertion();@@ -4653,6 +4655,9 @@ AbortSubTransaction(void)*/XactReadOnly = s->prevXactReadOnly;+ /* Report wait end here, when there is no further possibility of wait */+ pgstat_report_wait_end();+RESUME_INTERRUPTS();}AbortSubTransaction() does call pgstat_report_wait_end() twice, is this intentional? I have kept it in the end because there is a chance that in between API's can again set the state to wait and also by that time we have not released buffer pins and heavyweight locks, so not sure if it makes sense to report wait end at that stage. I have noticed that in WaitOnLock(), on error the wait end is set, but now again thinking on it, it seems it will be better to set it in AbortTransaction/AbortSubTransaction at end. What do you think?
Attachment
On 10 March 2016 at 18:58, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 10, 2016 at 12:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> >>> wrote: >>> > Thanks for the suggestion. I have updated the patch to include >>> > wait_event_type information in the wait_event table. >>> >>> I think we should remove "a server process is" from all of these entries. >>> >>> Also, I think this kind of thing should be tightened up: >>> >>> + <entry>A server process is waiting on any one of the >>> commit_timestamp >>> + buffer locks to read or write the commit_timestamp page in the >>> + pg_commit_ts subdirectory.</entry> >>> >>> I'd just write: Waiting to read or write a commit timestamp buffer. >>> >> >> Okay, changed as per suggestion and fixed the morerows issue pointed by >> Thom. > > Committed with some further editing. In particular, the way you > determined whether we could safely access the tranche information for > any given ID was wrong; please check over what I did and make sure > that isn't also wrong. > > Whew, this was a long process, but we got there. Some initial pgbench > testing shows that by far the most common wait event observed on that > workload is WALWriteLock, which is pretty interesting: perf -e cs and > LWLOCK_STATS let you measure the most *frequent* wait events, but that > ignores duration. Sampling pg_stat_activity tells you which things > you're spending the most *time* waiting for, which is awfully neat. It turns out that I hate the fact that the Wait Event Name column is effectively in a random order. If a user sees a message, and goes to look up the value in the wait_event description table, they either have to search with their browser/PDF viewer, or scan down the list looking for the item they're looking for, not knowing how far down it will be. The same goes for wait event type. I've attached a patch to sort the list by wait event type and then wait event name. It also corrects minor SGML indenting issues. Thom
Attachment
On 15 March 2016 at 14:00, Thom Brown <thom@linux.com> wrote: > On 10 March 2016 at 18:58, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 10, 2016 at 12:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> >>>> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila <amit.kapila16@gmail.com> >>>> wrote: >>>> > Thanks for the suggestion. I have updated the patch to include >>>> > wait_event_type information in the wait_event table. >>>> >>>> I think we should remove "a server process is" from all of these entries. >>>> >>>> Also, I think this kind of thing should be tightened up: >>>> >>>> + <entry>A server process is waiting on any one of the >>>> commit_timestamp >>>> + buffer locks to read or write the commit_timestamp page in the >>>> + pg_commit_ts subdirectory.</entry> >>>> >>>> I'd just write: Waiting to read or write a commit timestamp buffer. >>>> >>> >>> Okay, changed as per suggestion and fixed the morerows issue pointed by >>> Thom. >> >> Committed with some further editing. In particular, the way you >> determined whether we could safely access the tranche information for >> any given ID was wrong; please check over what I did and make sure >> that isn't also wrong. >> >> Whew, this was a long process, but we got there. Some initial pgbench >> testing shows that by far the most common wait event observed on that >> workload is WALWriteLock, which is pretty interesting: perf -e cs and >> LWLOCK_STATS let you measure the most *frequent* wait events, but that >> ignores duration. Sampling pg_stat_activity tells you which things >> you're spending the most *time* waiting for, which is awfully neat. > > It turns out that I hate the fact that the Wait Event Name column is > effectively in a random order. If a user sees a message, and goes to > look up the value in the wait_event description table, they either > have to search with their browser/PDF viewer, or scan down the list > looking for the item they're looking for, not knowing how far down it > will be. The same goes for wait event type. > > I've attached a patch to sort the list by wait event type and then > wait event name. It also corrects minor SGML indenting issues. Let's try that again, this time without duplicating a row, and omitting another. Thom
Attachment
On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote: > Here's a small docpatch to fix two typos in the new documentation. Thanks, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 15, 2016 at 10:41 AM, Thom Brown <thom@linux.com> wrote: >> It turns out that I hate the fact that the Wait Event Name column is >> effectively in a random order. If a user sees a message, and goes to >> look up the value in the wait_event description table, they either >> have to search with their browser/PDF viewer, or scan down the list >> looking for the item they're looking for, not knowing how far down it >> will be. The same goes for wait event type. >> >> I've attached a patch to sort the list by wait event type and then >> wait event name. It also corrects minor SGML indenting issues. > > Let's try that again, this time without duplicating a row, and omitting another. Hmm, I'm not sure this is a good idea. I don't think it's crazy to report the locks in the order they are defined in the source code; many people will be familiar with that order, and it might make the list easier to maintain. On the other hand, I'm also not sure this is a bad idea. Alphabetical order is a widely-used standard. So, I'm going to abstain from any strong position here and ask what other people think of Thom's proposed change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Mar 15, 2016 at 10:41 AM, Thom Brown <thom@linux.com> wrote: > >> It turns out that I hate the fact that the Wait Event Name column is > >> effectively in a random order. If a user sees a message, and goes to > >> look up the value in the wait_event description table, they either > >> have to search with their browser/PDF viewer, or scan down the list > >> looking for the item they're looking for, not knowing how far down it > >> will be. The same goes for wait event type. > Hmm, I'm not sure this is a good idea. I don't think it's crazy to > report the locks in the order they are defined in the source code; > many people will be familiar with that order, and it might make the > list easier to maintain. On the other hand, I'm also not sure this is > a bad idea. Alphabetical order is a widely-used standard. So, I'm > going to abstain from any strong position here and ask what other > people think of Thom's proposed change. I think using implementation order is crazy. +1 for alphabetical. If this really makes devs' lives more difficult (and I disagree that it does), let's reorder the items in the source code too. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Michael Paquier
Date:
On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote: >> Here's a small docpatch to fix two typos in the new documentation. > > Thanks, committed. I just had a quick look at the wait_event committed, and I got a little bit disappointed that we actually do not track latch waits yet, which is perhaps not that useful actually as long as an event name is not associated to a given latch wait when calling WaitLatch. I am not asking for that with this release, this is just for the archive's sake, and I don't mind coding that myself anyway if need be. The LWLock tracking facility looks rather cool btw :) -- Michael
On Thu, Mar 17, 2016 at 7:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
> >> Here's a small docpatch to fix two typos in the new documentation.
> >
> > Thanks, committed.
>
> I just had a quick look at the wait_event committed, and I got a
> little bit disappointed that we actually do not track latch waits yet,
> which is perhaps not that useful actually as long as an event name is
> not associated to a given latch wait when calling WaitLatch.
>
You are right and few more like I/O wait are also left out from initial patch intentionally just to get the base functionality in. One of the reasons I have not kept it in the patch was it needs much more thorough performance testing (even though theoretically overhead shouldn't be there) with specific kind of tests.
>
> I am not
> asking for that with this release, this is just for the archive's
> sake, and I don't mind coding that myself anyway if need be.
>
> On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
> >> Here's a small docpatch to fix two typos in the new documentation.
> >
> > Thanks, committed.
>
> I just had a quick look at the wait_event committed, and I got a
> little bit disappointed that we actually do not track latch waits yet,
> which is perhaps not that useful actually as long as an event name is
> not associated to a given latch wait when calling WaitLatch.
>
You are right and few more like I/O wait are also left out from initial patch intentionally just to get the base functionality in. One of the reasons I have not kept it in the patch was it needs much more thorough performance testing (even though theoretically overhead shouldn't be there) with specific kind of tests.
>
> I am not
> asking for that with this release, this is just for the archive's
> sake, and I don't mind coding that myself anyway if need be.
Thanks, feel free to pickup in next release (or for this release, if everybody feels strongly to have it in this release) if you don't see any patch for the same.
On Wed, Mar 16, 2016 at 10:03 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote: >>> Here's a small docpatch to fix two typos in the new documentation. >> >> Thanks, committed. > > I just had a quick look at the wait_event committed, and I got a > little bit disappointed that we actually do not track latch waits yet, > which is perhaps not that useful actually as long as an event name is > not associated to a given latch wait when calling WaitLatch. I am not > asking for that with this release, this is just for the archive's > sake, and I don't mind coding that myself anyway if need be. The > LWLock tracking facility looks rather cool btw :) Yes, I'm quite excited about this. I think it's pretty darn awesome. I doubt that it would be useful to treat a latch wait as an event. It's too generic. You'd want something more specific, like waiting for WAL to arrive or waiting for a tuple from a parallel worker or waiting to write to the client. It'll take some thought to figure out how to organize and categorize that stuff, but it'll also be wicked cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Michael Paquier
Date:
On Thu, Mar 17, 2016 at 11:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 16, 2016 at 10:03 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Mar 16, 2016 at 5:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Mar 15, 2016 at 9:17 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote: >>>> Here's a small docpatch to fix two typos in the new documentation. >>> >>> Thanks, committed. >> >> I just had a quick look at the wait_event committed, and I got a >> little bit disappointed that we actually do not track latch waits yet, >> which is perhaps not that useful actually as long as an event name is >> not associated to a given latch wait when calling WaitLatch. I am not >> asking for that with this release, this is just for the archive's >> sake, and I don't mind coding that myself anyway if need be. The >> LWLock tracking facility looks rather cool btw :) > > Yes, I'm quite excited about this. I think it's pretty darn awesome. > > I doubt that it would be useful to treat a latch wait as an event. > It's too generic. You'd want something more specific, like waiting > for WAL to arrive or waiting for a tuple from a parallel worker or > waiting to write to the client. It'll take some thought to figure out > how to organize and categorize that stuff, but it'll also be wicked > cool. FWIW, my instinctive thought on the matter is to report the event directly in WaitLatch() via a name of the event caller provided directly in it. The category of the event is then defined automatically as we would know its origin. The code path defining the origin point from where a event type comes from is the critical thing I think to define an event category. The LWLock events are doing that in lwlock.c. -- Michael
On Thu, Mar 17, 2016 at 9:22 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > FWIW, my instinctive thought on the matter is to report the event > directly in WaitLatch() via a name of the event caller provided > directly in it. The category of the event is then defined > automatically as we would know its origin. The code path defining the > origin point from where a event type comes from is the critical thing > I think to define an event category. The LWLock events are doing that > in lwlock.c. I'm very skeptical of grouping everything that waits using latches as a latch wait, but maybe it's OK to do it that way. I was thinking more of adding categories like "client wait" with events like "client read" and "client write". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-18 11:01:04 -0400, Robert Haas wrote: > On Thu, Mar 17, 2016 at 9:22 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > FWIW, my instinctive thought on the matter is to report the event > > directly in WaitLatch() via a name of the event caller provided > > directly in it. The category of the event is then defined > > automatically as we would know its origin. The code path defining the > > origin point from where a event type comes from is the critical thing > > I think to define an event category. The LWLock events are doing that > > in lwlock.c. > > I'm very skeptical of grouping everything that waits using latches as > a latch wait, but maybe it's OK to do it that way. I was thinking > more of adding categories like "client wait" with events like "client > read" and "client write". +1. I think categorizing latch waits together will be pretty much meaningless. We use the same latch for a lot of different things, and the context in which we're waiting is the important bit. Andres
Re: RFC: replace pg_stat_activity.waiting with something more descriptive
From
Alexander Korotkov
Date:
Hi!
In PostgresPro, we actually already had it. Now, it's too late to include something new to 9.6. This is why I've rework it and publish at github as an extension for 9.6: https://github.com/postgrespro/pg_wait_sampling/
Hopefully, it could be considered as contrib for 9.7.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Mar 24, 2016 at 7:28 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Since, patch for exposing current wait event information in PGPROC was > committed, it becomes possible to collect wait event statistics using > sampling. Despite I'm not fan of this approach, it is still useful and > definitely better than nothing. > In PostgresPro, we actually already had it. Now, it's too late to include > something new to 9.6. This is why I've rework it and publish at github as > an extension for 9.6: https://github.com/postgrespro/pg_wait_sampling/ > Hopefully, it could be considered as contrib for 9.7. Spiffy. That was fast. I think the sampling approach is going to be best on very large systems under heavy load; I suspect counting every event is going to be too expensive - especially once we add more events for things like block read and client wait. It is quite possible that we can do other things when tracing individual sessions or in scenarios where some performance degradation is OK. But I like the idea of doing the sampling thing first - I think that will be very useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello,
I noticed that a small optimization is possible in the flow of wait stat reporting for the LWLocks, when the pgstat_track_activities is disabled.
If the check for pgstat_track_activities is done before invoking LWLockReportWaitStart() instead of inside the pgstat_report_wait_start(), it can save some of the statements execution where the end result of LWLockReportWaitStart() is a NO-OP because pgstat_track_activities = false.
I also see, that there are other callers of pgstat_report_wait_start() as well, which would also have to change to add a check for the pg_stat_activities, if the check is removed from the pgstat_report_wait_start(). Is the pg_stat_activities check inside pgstat_report_wait_start() because of some protocol being followed? Would it be worth making that change.
Regards,
Neha
Cheers,
Neha
Neha
On Thu, Mar 10, 2016 at 12:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
>>
>> If yes, then the only slight worry is that there will lot of repetition in wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like rowspan in HTML.
>Thanks for the suggestion. I have updated the patch to include wait_event_type information in the wait_event table.As asked above by Robert, below is performance data with the patch.M/C Details------------------IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GBPerformance Data----------------------------min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9pgbench read-only (median of 3, 5-min runs)
clients BASE PATCH % 1 19703.549206 19992.141542 1.4646718364 8 120105.542849 127717.835367 6.3380026745 64 487334.338764 495861.7211254 1.7498012521 The read-only data shows some improvement with patch, but I think this is mostly attributed to run-to-run variation.pgbench read-write (median of 3, 30-min runs)
clients BASE PATCH % 1 1703.275728 1696.568881 -0.3937616729 8 8884.406185 9442.387472 6.2804567394 64 32648.82798 32113.002416 -1.6411785572 In the above data, the read-write data shows small regression (1.6%) at higher client-count, but when I ran individually that test, the difference was 0.5%. I think it is mostly attributed to run-to-run variation as we see with read-only tests.Thanks to Mithun C Y for doing performance testing of this patch.As this patch is adding 4-byte variable to shared memory structure PGPROC, so this is susceptible to memory alignment issues for shared buffers as discussed in thread [1], but in general the performance data doesn't indicate any regression.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 26, 2016 at 4:59 AM, neha khatri <nehakhatri5@gmail.com> wrote:
Hello,I noticed that a small optimization is possible in the flow of wait stat reporting for the LWLocks, when the pgstat_track_activities is disabled.If the check for pgstat_track_activities is done before invoking LWLockReportWaitStart() instead of inside the pgstat_report_wait_start(), it can save some of the statements execution where the end result of LWLockReportWaitStart() is a NO-OP because pgstat_track_activities = false.
This is only called in slow path which means when we have to wait or sleep, so saving few instructions will not make much difference. Note that both the functions you have mentioned are inline functions. However, If you want, you can try that way and see if this leads to any gain.
I also see, that there are other callers of pgstat_report_wait_start() as well, which would also have to change to add a check for the pg_stat_activities, if the check is removed from the pgstat_report_wait_start(). Is the pg_stat_activities check inside pgstat_report_wait_start() because of some protocol being followed?
Not as such, but that variable is mainly used in pgstat.c/.h only.
Would it be worth making that change.
-1 for the proposed change.