Thread: Rethinking stats communication mechanisms
In view of my oprofile results http://archives.postgresql.org/pgsql-hackers/2006-06/msg00859.php I'm thinking we need some major surgery on the way that the stats collection mechanism works. It strikes me that we are using a single communication mechanism to handle what are really two distinct kinds of data: * Current-state information, eg, what backends are alive and what commands they are currently working on. Ideally we'd like this type of info to be 100% up-to-date. But once a particular bit of information (eg a command string) is obsolete, it's not of interest anymore. * Event counts. These accumulate and so past information is still important. On the other hand, it's not so critical that the info be completely up-to-date --- the central counters can lag behind a bit, so long as events eventually get counted. I believe the stats code was designed with the second case in mind, but we've abused it to handle the first case, and that's why we've now got performance problems. If we are willing to assume that the current-state information is of fixed maximum size, we could store it in shared memory. (This suggestion already came up in the recent thread about ps_status, and I think it's been mentioned before too --- but my point here is that we have to separate this case from the event-counting case.) The only real restriction we'd be making is that we can only show the first N characters of current command string, but we're already accepting that limitation in the existing stats code. (And we could make N whatever we wanted, without worrying about UDP datagram limits.) I'm envisioning either adding fields to the PGPROC array, or perhaps better using a separate array with an entry for each backend ID. Backends would write their status info into this array and any interested backend could read it out again. The stats collector process needn't be involved at all AFAICS. This eliminates any process-dispatch overhead to report command start or command termination. Instead we'd have some locking overhead, but contention ought to be low enough that that's not a serious problem. I'm assuming a separate lock for each array entry so that backends don't contend with each other to update their entries; contention occurs only when someone is actively reading the information. We should probably use LWLocks not spinlocks because the time taken to copy a long command string into the shared area would be longer than we ought to hold a spinlock (but this seems a bit debatable given the expected low contention ... any thoughts?) The existing stats collection mechanism seems OK for event counts, although I'd propose two changes: one, get rid of the separate buffer process, and two, find a way to emit event reports in a time-driven way rather than once per transaction commit. I'm a bit vague about how to do the latter at the moment. Comments? regards, tom lane
> It strikes me that we are using a single communication mechanism to > handle what are really two distinct kinds of data: Interesting. I recently read a paper on how to get rid of locks for this kind of pattern. * For the Command String - Problem : need to display the currently executing command in the ps list.- Will only be of use if the command is taking a long, long time.So, it need not be realtime ; no problem if the datacomes with a little delay, or not at all if the command executes quickly.People are only interested in the currently executing command to answer questions like "what query has my server grinding ?" Point : the currently executing query is only interesting to display if it's currently executing. If it's not, it's in the log (if enabled). So, the proposal : Each backend has a shared memory area where to store :- the currently executing command (like in your proposal).- a timestamp-a counter On executing a command, Backend stores the command string, then overwrites the counter with (counter + 1) and with the timestamp of command start. Periodically, like every N seconds, a separate process reads the counter, then reads the data, then reads the counter again.If the counter value changed, the process is repeated. If the counter value did not change, the command string did not change either, so it's valid, and can be used. Other thoughts : If the backend process itself should update its process title, and this operation is costly, it should only be done if the current query has been running for more than T seconds. However syscalls for getting the current time are costly. A separate process can update a counter in shared memory with the current time every N seconds, and the backend can check it. The main point is that if this value is written to every few seconds, but read often by only one process ; or written often but read seldom, there will not be a lot of interprocessor cache trashing on it.
PFC <lists@peufeu.com> writes: > So, the proposal : > On executing a command, Backend stores the command string, then > overwrites the counter with (counter + 1) and with the timestamp of > command start. > Periodically, like every N seconds, a separate process reads the counter, > then reads the data, then reads the counter again. Well, it wouldn't be "periodic", it'd be "whenever someone reads pg_stat_activity". I was considering solutions like this, but I'm a little disturbed by the possibility that the would-be reader might loop indefinitely if the source backend is constantly changing its entry. Still, slow reads of pg_stat_activity might be a good tradeoff for taking overhead out of the update operation. BTW, I think the writer would actually need to bump the counter twice, once before and once after it modifies its stats area. Else there's no way to detect that you've copied a partially-updated stats entry. > If the backend process itself should update its process title, and this > operation is costly, it should only be done if the current query has been > running for more than T seconds. Managing that would cost more than just doing it, I think. regards, tom lane
PFC <lists@peufeu.com> writes: > - Will only be of use if the command is taking a long, long time. > So, it need not be realtime ; no problem if the data comes with a > little delay, or not at all if the command executes quickly. I would dispute this point. Picture a system running a very short very very often. It may still be the main problem, may even be taking 90+% of the cpu time. If you got an accurate snapshot of all the currently executing queries you'll see it popping up suspiciously often. -- greg
Greg Stark <gsstark@mit.edu> writes: > PFC <lists@peufeu.com> writes: > >> - Will only be of use if the command is taking a long, long time. >> So, it need not be realtime ; no problem if the data comes with a >> little delay, or not at all if the command executes quickly. > > I would dispute this point. Picture a system running a very short very very > often. It may still be the main problem, may even be taking 90+% of the cpu > time. If you got an accurate snapshot of all the currently executing queries > you'll see it popping up suspiciously often. Yeah, but if you turn on query logging in that case you'll see the bajillions of short queries, so you don't need the accurate snapshot to diagnose that. -Doug
Douglas McNaught <doug@mcnaught.org> writes: > Yeah, but if you turn on query logging in that case you'll see the > bajillions of short queries, so you don't need the accurate snapshot > to diagnose that. Query logging on a production OLTP machine? a) that would be a huge performance drain on the production system b) it would produce so much logging that it would take a significant amount of disk and i/o resources just to handle and c) you would need to implement special purpose tools just to make sense of these huge logs. -- greg
Greg Stark <gsstark@mit.edu> writes: > Douglas McNaught <doug@mcnaught.org> writes: > >> Yeah, but if you turn on query logging in that case you'll see the >> bajillions of short queries, so you don't need the accurate snapshot >> to diagnose that. > > Query logging on a production OLTP machine? a) that would be a huge > performance drain on the production system b) it would produce so much logging > that it would take a significant amount of disk and i/o resources just to > handle and c) you would need to implement special purpose tools just to make > sense of these huge logs. (a) and (b): of course you would only do it on a temporary basis for problem diagnosis. We do that with our production apps where I work (when absolutely necessary). (c): Perl. :) Ideally, you'd find the query storm problem in load testing before you ever got to production. I hope to someday visit that planet--it must be nice. -Doug
Douglas McNaught <doug@mcnaught.org> writes: > (a) and (b): of course you would only do it on a temporary basis for > problem diagnosis. Temporary or not it isn't really an option when you're dealing with high volumes. You could imagine a setup where say, 1% of page requests randomly turn on debugging to get a random sample of database traffic. There are always solutions. But my point is that there's no reason to think that only long queries are useful for snapshots. Short queries are equally capable of consuming resources in aggregate. Hiding some subset of queries assuming they're uninteresting is only going to produce deceptive results data. > Ideally, you'd find the query storm problem in load testing before you > ever got to production. I hope to someday visit that planet--it must > be nice. Indeed, when you get there send me a postcard :) -- greg
> The existing stats collection mechanism seems OK for event > counts, although I'd propose two changes: one, get rid of the > separate buffer process, and two, find a way to emit event > reports in a time-driven way rather than once per transaction > commit. I'm a bit vague about how to do the latter at the moment. Might it not be a win to also store "per backend global values" in the shared memory segment? Things like "time of last command", "number of transactions executed in this backend", "backend start time" and other values that are fixed-size? You can obviously not do it for things like per-table values, since the size can't be predicted, but all per-backend counters that are fixed size should be able to do this, I think. And if it's just a counter, it should be reasonably safe to just do the increment operation without locking, since there's only one writer for each process. That should have a much lower overhead than UDP or whatever to the stats process, no? It might be worthwhile to add a section for things like bgwriter (and possibly the archiver?) to deliver statics that we can add statistics views for. (they can obviously not use a standard backend "struct" for this since they'd have completely different values to report) //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: > Might it not be a win to also store "per backend global values" in the > shared memory segment? Things like "time of last command", "number of > transactions executed in this backend", "backend start time" and other > values that are fixed-size? I'm including backend start time, command start time, etc under the heading of "current status" which'll be in the shared memory. However, I don't believe in trying to count events (like transaction commits) that way. If we do then we risk losing events whenever a backend quits and is replaced by another. I haven't yet looked through the stats in detail, but this approach basically presumes that we are only going to count events per-table and per-database --- I am thinking that the background stats collector process won't even keep track of individual backends anymore. (So, we'll fix the old problem of loss of backend-exit messages resulting in bogus displays.) regards, tom lane
Ühel kenal päeval, P, 2006-06-18 kell 15:09, kirjutas Tom Lane: > "Magnus Hagander" <mha@sollentuna.net> writes: > > Might it not be a win to also store "per backend global values" in the > > shared memory segment? Things like "time of last command", "number of > > transactions executed in this backend", "backend start time" and other > > values that are fixed-size? One thing that is doable in constant size memory and would be enormously usable for us is counting to-level function calls and storing their total (and possibly also max) duration. The resaon being, that our production databases are accessed by clients using functions only (with some uninteresting exeptions of course), that is call in form of "SELECT x,y,z FROM myfunc(i,j,k)" So reserving N*1.5 slots (N being the number of functions defined at databse startup) would be ok. If more than N*0.5 functions are defined in the database lifetime, then the rest are simply ignored (not counted). Or maybe a better approach would be to have a conf variable "number-of-functions-to-track" and a special boolean flag track_me in pg_functions. In this way you don't accidentally run out of shared mem by defining lots of new functions and then restarting the cluster. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
I wrote: > PFC <lists@peufeu.com> writes: >> So, the proposal : >> On executing a command, Backend stores the command string, then >> overwrites the counter with (counter + 1) and with the timestamp of >> command start. >> Periodically, like every N seconds, a separate process reads the counter, >> then reads the data, then reads the counter again. > BTW, I think the writer would actually need to bump the counter twice, > once before and once after it modifies its stats area. Else there's > no way to detect that you've copied a partially-updated stats entry. Actually, neither of these ideas works: it's possible that the reader copies the entry between the two increments of the counter. Then, it won't see any reason to re-read, but nonetheless it has copied an inconsistent partially-modified entry. Anyone know a variant of this that really works? regards, tom lane
> > > BTW, I think the writer would actually need to bump the > counter twice, > > once before and once after it modifies its stats area. > Else there's > > no way to detect that you've copied a partially-updated stats entry. > > Actually, neither of these ideas works: it's possible that > the reader copies the entry between the two increments of the > counter. Then, it won't see any reason to re-read, but > nonetheless it has copied an inconsistent partially-modified entry. > > Anyone know a variant of this that really works? > Here's a theory: If the counter is bumped to an odd number before modification, and an even number after it's done, then the reader will know it needs to re-read if the counter is an odd number. This might be assuming too much about what the writer knows about the current contents of the counter, but since it's per-back end, I think it would work. Regards, Paul Bort
"Bort, Paul" <pbort@tmwsystems.com> writes: >> Anyone know a variant of this that really works? > Here's a theory: If the counter is bumped to an odd number before > modification, and an even number after it's done, then the reader will > know it needs to re-read if the counter is an odd number. Great minds think alike ;-) ... I just committed exactly that protocol. I believe it is correct, because AFAICS there are only four possible risk cases: * reader's read starts before and ends after writer's update: reader will certainly note a change in update counter. * reader's read starts before and ends within writer's update: reader will note a change in update counter. * reader's read starts within and ends after writer's update: reader will note a change in update counter. * reader's read starts within and ends within writer's update: reader will see update counter as odd. Am I missing anything? Note in particular that this protocol does not assume atomic update of the counter, so we don't need to worry about whether int is sig_atomic_t. If any of the bytes change, we have what we need. We could use a counter narrower than int, but then there's some tiny risk that the counter could wrap all the way around while the reader is blocked. regards, tom lane
> Great minds think alike ;-) ... I just committed exactly that protocol. > I believe it is correct, because AFAICS there are only four possible > risk cases: Congrats ! For general culture you might be interested in reading this : http://en.wikipedia.org/wiki/Software_transactional_memoryhttp://libcmt.sourceforge.net/
> > Might it not be a win to also store "per backend global > values" in the > > shared memory segment? Things like "time of last command", > "number of > > transactions executed in this backend", "backend start > time" and other > > values that are fixed-size? > > I'm including backend start time, command start time, etc > under the heading of "current status" which'll be in the > shared memory. However, I don't believe in trying to count > events (like transaction commits) that way. If we do then we > risk losing events whenever a backend quits and is replaced > by another. Well, in many cases that's not a problem. It might be interesting to for example know that a backend has run nnn transactions before ending up in the state where it is now (say, idle in transaction and idle for a long time). The part about this being "transient data" that can go away along with a backend quit would still hold true. What were your thoughts about storing bgwriter and archiver statistics that way? Good or bad idea? > I haven't yet looked through the stats in detail, but this > approach basically presumes that we are only going to count > events per-table and per-database --- I am thinking that the > background stats collector process won't even keep track of > individual backends anymore. (So, we'll fix the old problem > of loss of backend-exit messages resulting in bogus displays.) Right. As I see you have now implemented ;-) /Magnus
> > * reader's read starts before and ends after writer's update: reader > will certainly note a change in update counter. > > * reader's read starts before and ends within writer's update: reader > will note a change in update counter. > > * reader's read starts within and ends after writer's update: reader > will note a change in update counter. > > * reader's read starts within and ends within writer's update: reader > will see update counter as odd. > > Am I missing anything? > The only remaining concern would be the possibility of the reader thrashing because the writer is updating so often that the reader never gets the same counter twice. IIRC, the reader was only sampling, not trying to catch every entry, so that will help. But is it enough? Regards, Paul Bort
On Sun, Jun 18, 2006 at 05:26:16PM -0400, Tom Lane wrote: > I wrote: > > PFC <lists@peufeu.com> writes: > >> So, the proposal : > >> On executing a command, Backend stores the command string, then > >> overwrites the counter with (counter + 1) and with the timestamp of > >> command start. > >> Periodically, like every N seconds, a separate process reads the counter, > >> then reads the data, then reads the counter again. > > > BTW, I think the writer would actually need to bump the counter twice, > > once before and once after it modifies its stats area. Else there's > > no way to detect that you've copied a partially-updated stats entry. > > Actually, neither of these ideas works: it's possible that the reader > copies the entry between the two increments of the counter. Then, it > won't see any reason to re-read, but nonetheless it has copied an > inconsistent partially-modified entry. > > Anyone know a variant of this that really works? > > regards, tom lane > For a single writer, the algorithm is the latch routine that I have described previously. If you call an odd value of the counter the latch, then it would be: Update - 1. Set counter to odd number. 2. Update content. 3. Increment counter which releases the latch. Read - 1. Copy value of version into a register. 2. Read the content. 3. Read counter. If odd, go to 1. 4. If counter is different then that in register, go to 1. For multiple writers, you will need an atomic action to set the "latch". Ken
On Sun, Jun 18, 2006 at 11:07:41PM -0400, Tom Lane wrote: > "Bort, Paul" <pbort@tmwsystems.com> writes: > >> Anyone know a variant of this that really works? > > > Here's a theory: If the counter is bumped to an odd number before > > modification, and an even number after it's done, then the reader will > > know it needs to re-read if the counter is an odd number. > > Great minds think alike ;-) ... I just committed exactly that protocol. > I believe it is correct, because AFAICS there are only four possible > risk cases: > > * reader's read starts before and ends after writer's update: reader > will certainly note a change in update counter. > > * reader's read starts before and ends within writer's update: reader > will note a change in update counter. > > * reader's read starts within and ends after writer's update: reader > will note a change in update counter. > > * reader's read starts within and ends within writer's update: reader > will see update counter as odd. > > Am I missing anything? > > Note in particular that this protocol does not assume atomic update > of the counter, so we don't need to worry about whether int is > sig_atomic_t. If any of the bytes change, we have what we need. > We could use a counter narrower than int, but then there's some tiny > risk that the counter could wrap all the way around while the reader > is blocked. > > regards, tom lane > The reader needs to read the counter/version at the beginning of the read and then again after the read. If the value is the same, it can assume that the read/data is okay. I am assuming that there is only one possible writer, otherwise the writer will need to lock the area from other writers. This can be done with an atomic instruction such as compare-and-swap or test-and-set. Ken Marshall