Thread: Performance monitor signal handler
I was going to implement the signal handler like we do with Cancel, where the signal sets a flag and we check the status of the flag in various _safe_ places. Can anyone think of a better way to get information out of a backend? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010312 12:12] wrote: > I was going to implement the signal handler like we do with Cancel, > where the signal sets a flag and we check the status of the flag in > various _safe_ places. > > Can anyone think of a better way to get information out of a backend? Why not use a static area of the shared memory segment? Is it possible to have a spinlock over it so that an external utility can take a snapshot of it with the spinlock held? Also, this could work for other stuff as well, instead of overloading a lot of signal handlers one could just periodically poll a region of the shared segment. just some ideas.. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
At 13:34 12/03/01 -0800, Alfred Perlstein wrote: >Is it possible >to have a spinlock over it so that an external utility can take a snapshot >of it with the spinlock held? I'd suggest that locking the stats area might be a bad idea; there is only one writer for each backend-specific chunk, and it won't matter a hell of a lot if a reader gets inconsistent views (since I assume they will be re-reading every second or so). All the stats area should contain would be a bunch of counters with timestamps, I think, and the cost up writing to it should be kept to an absolute minimum. > >just some ideas.. > Unfortunatley, based on prior discussions, Bruce seems quite opposed to a shared memory solution. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
* Philip Warner <pjw@rhyme.com.au> [010312 18:56] wrote: > At 13:34 12/03/01 -0800, Alfred Perlstein wrote: > >Is it possible > >to have a spinlock over it so that an external utility can take a snapshot > >of it with the spinlock held? > > I'd suggest that locking the stats area might be a bad idea; there is only > one writer for each backend-specific chunk, and it won't matter a hell of a > lot if a reader gets inconsistent views (since I assume they will be > re-reading every second or so). All the stats area should contain would be > a bunch of counters with timestamps, I think, and the cost up writing to it > should be kept to an absolute minimum. > > > > > >just some ideas.. > > > > Unfortunatley, based on prior discussions, Bruce seems quite opposed to a > shared memory solution. Ok, here's another nifty idea. On reciept of the info signal, the backends collaborate to piece together a status file. The status file is given a temporay name. When complete the status file is rename(2)'d over a well known file. This ought to always give a consistant snapshot of the file to whomever opens it. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
> > I think Tom has previously stated that there are technical reasons not to > > do IO in signal handlers, and I have philosophical problems with > > performance monitors that ask 50 backends to do file IO. I really do think > > shared memory is TWTG. > > I wasn't really suggesting any of those courses of action, all I > suggested was using rename(2) to give a seperate appilcation a > consistant snapshot of the stats. > > Actually, what makes the most sense (although it may be a performance > killer) is to have the backends update a system table that the external > app can query. Yes, it seems storing info in shared memory and having a system table to access it is the way to go. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > >This ought to always give a consistant snapshot of the file to > >whomever opens it. > > > > I think Tom has previously stated that there are technical reasons not to > do IO in signal handlers, and I have philosophical problems with > performance monitors that ask 50 backends to do file IO. I really do think > shared memory is TWTG. The good news is that right now pgmonitor gets all its information from 'ps', and only shows the query when the user asks for it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> >This ought to always give a consistant snapshot of the file to >whomever opens it. > I think Tom has previously stated that there are technical reasons not to do IO in signal handlers, and I have philosophical problems with performance monitors that ask 50 backends to do file IO. I really do think shared memory is TWTG. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
* Philip Warner <pjw@rhyme.com.au> [010313 06:42] wrote: > > > >This ought to always give a consistant snapshot of the file to > >whomever opens it. > > > > I think Tom has previously stated that there are technical reasons not to > do IO in signal handlers, and I have philosophical problems with > performance monitors that ask 50 backends to do file IO. I really do think > shared memory is TWTG. I wasn't really suggesting any of those courses of action, all I suggested was using rename(2) to give a seperate appilcation a consistant snapshot of the stats. Actually, what makes the most sense (although it may be a performance killer) is to have the backends update a system table that the external app can query. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
> At 13:34 12/03/01 -0800, Alfred Perlstein wrote: > >Is it possible > >to have a spinlock over it so that an external utility can take a snapshot > >of it with the spinlock held? > > I'd suggest that locking the stats area might be a bad idea; there is only > one writer for each backend-specific chunk, and it won't matter a hell of a > lot if a reader gets inconsistent views (since I assume they will be > re-reading every second or so). All the stats area should contain would be > a bunch of counters with timestamps, I think, and the cost up writing to it > should be kept to an absolute minimum. > > > > > >just some ideas.. > > > > Unfortunatley, based on prior discussions, Bruce seems quite opposed to a > shared memory solution. No, I like the shared memory idea. Such an idea will have to wait for 7.2, and second, there are limits to how much shared memory I can use. Eventually, I think shared memory will be the way to go. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>On reciept of the info signal, the backends collaborate to piece >together a status file. The status file is given a temporay name. >When complete the status file is rename(2)'d over a well known >file. Reporting to files, particularly well known ones, could lead to race conditions. All in all, I think your better off passing messages through pipes or a similar communication method. I really liked the idea of a "server" that could parse/analyze data from multiple backends. My 2/100 worth...
* Thomas Swan <tswan-lst@ics.olemiss.edu> [010313 13:37] wrote: > > >On reciept of the info signal, the backends collaborate to piece > >together a status file. The status file is given a temporay name. > >When complete the status file is rename(2)'d over a well known > >file. > > Reporting to files, particularly well known ones, could lead to race > conditions. > > All in all, I think your better off passing messages through pipes or a > similar communication method. > > I really liked the idea of a "server" that could parse/analyze data from > multiple backends. > > My 2/100 worth... Take a few moments to think about the semantics of rename(2). Yes, you would still need syncronization between the backend processes to do this correctly, but not any external app. The external app can just open the file, assuming it exists it will always have a complete and consistant snapshot of whatever the backends agreed on. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
Bruce Momjian wrote: > > Yes, it seems storing info in shared memory and having a system table to > access it is the way to go. Depends, first of all we need to know WHAT we want to collect. If we talk about block read/write statistics and such on a per table base, which is IMHO the most accurate thing for tuning purposes, then we're talking about an infinitesize of shared memory perhaps. And shared memory has all the interlocking problems we want to avoid. What about a collector deamon, fired up by the postmaster and receiving UDP packets from the backends. Under heavy load,it might miss some statistic messages, well, but that's not as bad as having locks causing backends to loose performance. The postmaster could already provide the UDP socket for the backends, so the collector can know the peer address from which to accept statistics messages only. Any message from another peer address is simply ignored. For getting the statistics out of it, the collector has his own server socket, using TCP and providingsome lookup protocol. Now whatever the backend has to tell the collector, it simply throws a UDP packet into his direction. If the collectorcan catch it or not, not the backends problem. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck <janwieck@Yahoo.com> writes: > What about a collector deamon, fired up by the postmaster and > receiving UDP packets from the backends. Under heavy load, it > might miss some statistic messages, well, but that's not as > bad as having locks causing backends to loose performance. Interesting thought, but we don't want UDP I think; that just opens up a whole can of worms about checking access permissions and so forth. Why not a simple pipe? The postmaster creates the pipe and the collector daemon inherits one end, while all the backends inherit the other end. regards, tom lane
Tom Lane wrote: > Jan Wieck <janwieck@Yahoo.com> writes: > > What about a collector deamon, fired up by the postmaster and > > receiving UDP packets from the backends. Under heavy load, it > > might miss some statistic messages, well, but that's not as > > bad as having locks causing backends to loose performance. > > Interesting thought, but we don't want UDP I think; that just opens > up a whole can of worms about checking access permissions and so forth. > Why not a simple pipe? The postmaster creates the pipe and the > collector daemon inherits one end, while all the backends inherit the > other end. I don't think so - though I haven't tested the following yet, but AFAIR it's correct. Have the postmaster creating two UDP sockets before it forks off the collector. It can examine the peer addresses ofboth, so they don't need well known port numbers, it can be the randomly ones assigned by the kernel. Thus,we don't need SO_REUSE on them either. Now, since the collector is forked off by the postmaster, it knows the peer address of the other socket. And sinceall backends get forked off from the postmaster as well, they'll all use the same peer address, don't they? So all the collector has to look at is the sender address including port number of the packets. It needs tobe what the postmaster examined, anything else is from someone else and goes to bit heaven. The same way the backendsknow where to send their statistics. If I'm right that in the case of fork() all children share the same socket with the same peer address, then it'seven safe in the case the collector dies. The postmaster can still hold the collectors socket and will notice thatthe collector died (due to a wait() returning it's PID) and can fire up another one. Again some packets got lost(plus all the so far collected statistics, hmmm - aint that a cool way to reset statistic counters - killingthe collector?), but it did not disturb any live backend in any way. They will never get any signal, don't care about what's done with their statistics and such. They just do their work... Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Tom Lane wrote: > Jan Wieck <janwieck@Yahoo.com> writes: > > What about a collector deamon, fired up by the postmaster and > > receiving UDP packets from the backends. Under heavy load, it > > might miss some statistic messages, well, but that's not as > > bad as having locks causing backends to loose performance. > > Interesting thought, but we don't want UDP I think; that just opens > up a whole can of worms about checking access permissions and so forth. > Why not a simple pipe? The postmaster creates the pipe and the > collector daemon inherits one end, while all the backends inherit the > other end. I don't think so - though I haven't tested the following yet, but AFAIR it's correct. Have the postmaster creating two UDP sockets before it forks off the collector. It can examine the peer addresses ofboth, so they don't need well known port numbers, it can be the randomly ones assigned by the kernel. Thus,we don't need SO_REUSE on them either. Now, since the collector is forked off by the postmaster, it knows the peer address of the other socket. And sinceall backends get forked off from the postmaster as well, they'll all use the same peer address, don't they? So all the collector has to look at is the sender address including port number of the packets. It needs tobe what the postmaster examined, anything else is from someone else and goes to bit heaven. The same way the backendsknow where to send their statistics. If I'm right that in the case of fork() all children share the same socket with the same peer address, then it'seven safe in the case the collector dies. The postmaster can still hold the collectors socket and will notice thatthe collector died (due to a wait() returning it's PID) and can fire up another one. Again some packets got lost(plus all the so far collected statistics, hmmm - aint that a cool way to reset statistic counters - killingthe collector?), but it did not disturb any live backend in any way. They will never get any signal, don't care about what's done with their statistics and such. They just do their work... Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
At 06:57 15/03/01 -0500, Jan Wieck wrote: > > And shared memory has all the interlocking problems we want > to avoid. I suspect that if we keep per-backend data in a separate area, then we don;t need locking since there is only one writer. It does not matter if a reader gets an inconsistent view, the same as if you drop a few UDP packets. > What about a collector deamon, fired up by the postmaster and > receiving UDP packets from the backends. This does sound appealing; it means that individual backend data (IO etc) will survive past the termination of the backend. I'd like to see the stats survive the death of the collector if possible, possibly even survive a stop/start of the postmaster. > Now whatever the backend has to tell the collector, it simply > throws a UDP packet into his direction. If the collector can > catch it or not, not the backends problem. If we get the backends to keep the stats they are sending in local counters as well, then they can send the counter value (not delta) each time, which would mean that the collector would not 'miss' anything - just it's operations/sec might see a hiccough. This could have a sidebenefit that(if wewanted to?) we could allow a client to query their own counters to get an idea of the costs of their queries. When we need to reset the counters that should be done explicitly, I think. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
* Philip Warner <pjw@rhyme.com.au> [010315 16:14] wrote: > At 06:57 15/03/01 -0500, Jan Wieck wrote: > > > > And shared memory has all the interlocking problems we want > > to avoid. > > I suspect that if we keep per-backend data in a separate area, then we > don;t need locking since there is only one writer. It does not matter if a > reader gets an inconsistent view, the same as if you drop a few UDP packets. No, this is completely different. Lost data is probably better than incorrect data. Either use locks or a copying mechanism. People will depend on the data returned making sense. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
At 16:17 15/03/01 -0800, Alfred Perlstein wrote: > >Lost data is probably better than incorrect data. Either use locks >or a copying mechanism. People will depend on the data returned >making sense. > But with per-backend data, there is only ever *one* writer to a given set of counters. Everyone else is a reader. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote: > At 16:17 15/03/01 -0800, Alfred Perlstein wrote: > > > >Lost data is probably better than incorrect data. Either use locks > >or a copying mechanism. People will depend on the data returned > >making sense. > > > > But with per-backend data, there is only ever *one* writer to a given set > of counters. Everyone else is a reader. This doesn't prevent a reader from getting an inconsistant view. Think about a 64bit counter on a 32bit machine. If you charged per megabyte, wouldn't it upset you to have a small chance of loosing 4 billion units of sale? (ie, doing a read after an addition that wraps the low 32 bits but before the carry is done to the top most signifigant 32bits?) Ok, what what if everything can be read atomically by itself? You're still busted the minute you need to export any sort of compound stat. If A, B and C need to add up to 100 you have a read race. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
* Philip Warner <pjw@rhyme.com.au> [010315 17:08] wrote: > At 16:55 15/03/01 -0800, Alfred Perlstein wrote: > >* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote: > >> At 16:17 15/03/01 -0800, Alfred Perlstein wrote: > >> > > >> >Lost data is probably better than incorrect data. Either use locks > >> >or a copying mechanism. People will depend on the data returned > >> >making sense. > >> > > >> > >> But with per-backend data, there is only ever *one* writer to a given set > >> of counters. Everyone else is a reader. > > > >This doesn't prevent a reader from getting an inconsistant view. > > > >Think about a 64bit counter on a 32bit machine. If you charged per > >megabyte, wouldn't it upset you to have a small chance of loosing > >4 billion units of sale? > > > >(ie, doing a read after an addition that wraps the low 32 bits > >but before the carry is done to the top most signifigant 32bits?) > > I assume this means we can not rely on the existence of any kind of > interlocked add on 64 bit machines? > > > >Ok, what what if everything can be read atomically by itself? > > > >You're still busted the minute you need to export any sort of > >compound stat. > > Which is why the backends should not do anything other than maintain the > raw data. If there is atomic data than can cause inconsistency, then a > dropped UDP packet will do the same. The UDP packet (a COPY) can contain a consistant snapshot of the data. If you have dependancies, you fit a consistant snapshot into a single packet. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
At 16:55 15/03/01 -0800, Alfred Perlstein wrote: >* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote: >> At 16:17 15/03/01 -0800, Alfred Perlstein wrote: >> > >> >Lost data is probably better than incorrect data. Either use locks >> >or a copying mechanism. People will depend on the data returned >> >making sense. >> > >> >> But with per-backend data, there is only ever *one* writer to a given set >> of counters. Everyone else is a reader. > >This doesn't prevent a reader from getting an inconsistant view. > >Think about a 64bit counter on a 32bit machine. If you charged per >megabyte, wouldn't it upset you to have a small chance of loosing >4 billion units of sale? > >(ie, doing a read after an addition that wraps the low 32 bits >but before the carry is done to the top most signifigant 32bits?) I assume this means we can not rely on the existence of any kind of interlocked add on 64 bit machines? >Ok, what what if everything can be read atomically by itself? > >You're still busted the minute you need to export any sort of >compound stat. Which is why the backends should not do anything other than maintain the raw data. If there is atomic data than can cause inconsistency, then a dropped UDP packet will do the same. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > > But I prefer the UDP/Collector model anyway; it gives use greater > flexibility + the ability to keep stats past backend termination, and,as > you say, removes any possible locking requirements from the backends. OK, did some tests... The postmaster can create a SOCK_DGRAM socket at startup and bind(2) it to "127.0.0.1:0", what causes the kernel toassign a non-privileged port number that then can be read with getsockname(2). No other process can have a socket with the same port number for the lifetime of the postmaster. If the socket get's ready, it'll read one backend message from it with recvfrom(2). The fromaddr must be "127.0.0.1:xxx" where xxx is the port number the kernel assigned to the above socket. Yes, this is his own one, shared with postmaster and all backends. So both, the postmaster and the backends can use this one UDP socket, which the backends inherit on fork(2), to send messages to the collector. If such a UDP packet really came from a process other than the postmaster or a backend, well then the sysadmin has a more severe problem than manipulated DB runtime statistics :-) Running a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here, I've been able to loose no single message during the parallel regression test, if each backend sends one 1K sized message per query executed, and the collector simply sucks them out of the socket. Message losses start if the collector does a per message idle loop like this: for (i=0,sum=0;i<250000;i++,sum+=1); Uh - not much time to spend if the statistics should at least be half accurate. And it would become worse in SMP systems. So that was a nifty idea, but I think it'd cause much more statistic losses than I assumed at first. Back to drawing board. Maybe a SYS-V message queue can serve? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
* Jan Wieck <JanWieck@yahoo.com> [010316 08:08] wrote: > Philip Warner wrote: > > > > But I prefer the UDP/Collector model anyway; it gives use greater > > flexibility + the ability to keep stats past backend termination, and,as > > you say, removes any possible locking requirements from the backends. > > OK, did some tests... > > The postmaster can create a SOCK_DGRAM socket at startup and > bind(2) it to "127.0.0.1:0", what causes the kernel to assign > a non-privileged port number that then can be read with > getsockname(2). No other process can have a socket with the > same port number for the lifetime of the postmaster. > > If the socket get's ready, it'll read one backend message > from it with recvfrom(2). The fromaddr must be > "127.0.0.1:xxx" where xxx is the port number the kernel > assigned to the above socket. Yes, this is his own one, > shared with postmaster and all backends. So both, the > postmaster and the backends can use this one UDP socket, > which the backends inherit on fork(2), to send messages to > the collector. If such a UDP packet really came from a > process other than the postmaster or a backend, well then the > sysadmin has a more severe problem than manipulated DB > runtime statistics :-) Doing this is a bad idea: a) it allows any program to start spamming localhost:randport with messages and screw with the postmaster. b) it may even allow remote people to mess with it, (see recent bugtraq articles about this) You should use a unix domain socket (at least when possible). > Running a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here, > I've been able to loose no single message during the parallel > regression test, if each backend sends one 1K sized message > per query executed, and the collector simply sucks them out > of the socket. Message losses start if the collector does a > per message idle loop like this: > > for (i=0,sum=0;i<250000;i++,sum+=1); > > Uh - not much time to spend if the statistics should at least > be half accurate. And it would become worse in SMP systems. > So that was a nifty idea, but I think it'd cause much more > statistic losses than I assumed at first. > > Back to drawing board. Maybe a SYS-V message queue can serve? I wouldn't say back to the drawing board, I would say two steps back. What about instead of sending deltas, you send totals? This would allow you to loose messages and still maintain accurate stats. You can also enable SIGIO on the socket, then have a signal handler buffer packets that arrive when not actively select()ing on the UDP socket. You can then use sigsetmask(2) to provide mutual exclusion with your SIGIO handler and general select()ing on the socket. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Jan Wieck <JanWieck@Yahoo.com> writes: > Uh - not much time to spend if the statistics should at least > be half accurate. And it would become worse in SMP systems. > So that was a nifty idea, but I think it'd cause much more > statistic losses than I assumed at first. > Back to drawing board. Maybe a SYS-V message queue can serve? That would be the same as a pipe: backends would block if the collector stopped accepting data. I do like the "auto discard" aspect of this UDP-socket approach. I think Philip had the right idea: each backend should send totals, not deltas, in its messages. Then, it doesn't matter (much) if the collector loses some messages --- that just means that sometimes it has a slightly out-of-date idea about how much work some backends have done. It should be easy to design the software so that that just makes a small, transient error in the currently displayed statistics. regards, tom lane
At 17:10 15/03/01 -0800, Alfred Perlstein wrote: >> >> Which is why the backends should not do anything other than maintain the >> raw data. If there is atomic data than can cause inconsistency, then a >> dropped UDP packet will do the same. > >The UDP packet (a COPY) can contain a consistant snapshot of the data. >If you have dependancies, you fit a consistant snapshot into a single >packet. If we were going to go the shared memory way, then yes, as soon as we start collecting dependant data we would need locking, but IOs, locking stats, flushes, cache hits/misses are not really in this category. But I prefer the UDP/Collector model anyway; it gives use greater flexibility + the ability to keep stats past backend termination, and,as you say, removes any possible locking requirements from the backends. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Alfred Perlstein wrote: > * Jan Wieck <JanWieck@yahoo.com> [010316 08:08] wrote: > > Philip Warner wrote: > > > > > > But I prefer the UDP/Collector model anyway; it gives use greater > > > flexibility + the ability to keep stats past backend termination, and,as > > > you say, removes any possible locking requirements from the backends. > > > > OK, did some tests... > > > > The postmaster can create a SOCK_DGRAM socket at startup and > > bind(2) it to "127.0.0.1:0", what causes the kernel to assign > > a non-privileged port number that then can be read with > > getsockname(2). No other process can have a socket with the > > same port number for the lifetime of the postmaster. > > > > If the socket get's ready, it'll read one backend message > > from it with recvfrom(2). The fromaddr must be > > "127.0.0.1:xxx" where xxx is the port number the kernel > > assigned to the above socket. Yes, this is his own one, > > shared with postmaster and all backends. So both, the > > postmaster and the backends can use this one UDP socket, > > which the backends inherit on fork(2), to send messages to > > the collector. If such a UDP packet really came from a > > process other than the postmaster or a backend, well then the > > sysadmin has a more severe problem than manipulated DB > > runtime statistics :-) > > Doing this is a bad idea: > > a) it allows any program to start spamming localhost:randport with > messages and screw with the postmaster. > > b) it may even allow remote people to mess with it, (see recent > bugtraq articles about this) So it's possible for a UDP socket to recvfrom(2) and get packets with a fromaddr localhost:my_own_non_SO_REUSE_port that really came from somewhere else? If that's possible, the packets must be coming over the network. Oterwise it's the local superuser sending them,and in that case it's not worth any more discussion because root on your system has more powerful possibilitiesto muck around with your database. And if someone outside the local system is doing it, it's time forsome filter rules, isn't it? > You should use a unix domain socket (at least when possible). Unix domain UDP? > > > Running a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here, > > I've been able to loose no single message during the parallel > > regression test, if each backend sends one 1K sized message > > per query executed, and the collector simply sucks them out > > of the socket. Message losses start if the collector does a > > per message idle loop like this: > > > > for (i=0,sum=0;i<250000;i++,sum+=1); > > > > Uh - not much time to spend if the statistics should at least > > be half accurate. And it would become worse in SMP systems. > > So that was a nifty idea, but I think it'd cause much more > > statistic losses than I assumed at first. > > > > Back to drawing board. Maybe a SYS-V message queue can serve? > > I wouldn't say back to the drawing board, I would say two steps back. > > What about instead of sending deltas, you send totals? This would > allow you to loose messages and still maintain accurate stats. Similar problem as with shared memory - size. If a long running backend of a multithousand table database needsto send access stats per table - and had accessed them all up to now - it'll be alot of wasted bandwidth. > > You can also enable SIGIO on the socket, then have a signal handler > buffer packets that arrive when not actively select()ing on the > UDP socket. You can then use sigsetmask(2) to provide mutual > exclusion with your SIGIO handler and general select()ing on the > socket. I already thought that priorizing the socket-drain this way: there is a fairly big receive buffer. If the buffer isempty, it does a blocking select(2). If it's not, it does a non- blocking (0-timeout) one and only if the non-blocking tells that there aren't new messages waiting, it'll process one buffered message and try to receiveagain. Will give it a shot. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > Uh - not much time to spend if the statistics should at least > > be half accurate. And it would become worse in SMP systems. > > So that was a nifty idea, but I think it'd cause much more > > statistic losses than I assumed at first. > > > Back to drawing board. Maybe a SYS-V message queue can serve? > > That would be the same as a pipe: backends would block if the collector > stopped accepting data. I do like the "auto discard" aspect of this > UDP-socket approach. Does a pipe guarantee that a buffer, written with one atomic write(2), never can get intermixed with other data on the readers end? I know that you know what I mean, but for the broader audience: Let's define a message to thecollector to be 4byte-len,len-bytes. Now hundreds of backends hammer messages into the (shared) writing endof the pipe, all with different sizes. Is it GUARANTEED that a read(4bytes),read(nbytes) sequencewill allways return one complete message and never intermixed parts of different write(2)s? With message queues, this is guaranteed. Also, message queues would make it easy to query the collected statistics(see below). > I think Philip had the right idea: each backend should send totals, > not deltas, in its messages. Then, it doesn't matter (much) if the > collector loses some messages --- that just means that sometimes it > has a slightly out-of-date idea about how much work some backends have > done. It should be easy to design the software so that that just makes > a small, transient error in the currently displayed statistics. If we use two message queues (IPC_PRIVATE is enough here), one into collector and one into backend direction, this'dbe an easy way to collect and query statistics. The backends send delta stats messages to the collector on one queue. Message queues block, by default, but thebackend could use IPC_NOWAIT and just go on and collect up, as long as it finally will use a blocking call beforeexiting. We'll loose statistics for backends that go down in flames (coredump), but who cares for statisticsthen? To query statistics, we have a set of new builtin functions. All functions share a global statistics snapshot in the backend. If on function call the snapshot doesn't exist or was generated by another XACT/commandcounter, the backend sends a statistics request for his database ID to the collector and waits forthe messages to arrive on the second message queue. It can pick up the messages meant for him via message type,which's equal to his backend number +1, because the collector will send 'em as such. For table access stats forexample, the snapshot will have slots identified by the tables OID, so a function pg_get_tables_seqscan_count(oid) should be easy to implement. And setting up views that present access stats inreadable format is a nobrainer. Now we have communication only between the backends and the collector. And we're certain that only someone able to SELECT from a system view will ever see this information. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
* Tom Lane <tgl@sss.pgh.pa.us> [010316 10:06] wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > Uh - not much time to spend if the statistics should at least > > be half accurate. And it would become worse in SMP systems. > > So that was a nifty idea, but I think it'd cause much more > > statistic losses than I assumed at first. > > > Back to drawing board. Maybe a SYS-V message queue can serve? > > That would be the same as a pipe: backends would block if the collector > stopped accepting data. I do like the "auto discard" aspect of this > UDP-socket approach. > > I think Philip had the right idea: each backend should send totals, > not deltas, in its messages. Then, it doesn't matter (much) if the > collector loses some messages --- that just means that sometimes it > has a slightly out-of-date idea about how much work some backends have > done. It should be easy to design the software so that that just makes > a small, transient error in the currently displayed statistics. MSGSND(3) FreeBSD Library Functions Manual MSGSND(3) ERRORS msgsnd() will fail if: [EAGAIN] There was no space for this message either on the queue, or in the whole system,and IPC_NOWAIT was set in msgflg. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Jan Wieck <JanWieck@Yahoo.com> writes: > Does a pipe guarantee that a buffer, written with one atomic > write(2), never can get intermixed with other data on the > readers end? Yes. The HPUX man page for write(2) sez: o Write requests of {PIPE_BUF} bytes or less will not be interleaved with data from other processesdoing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved,on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set. Stevens' _UNIX Network Programming_ (1990) states this is true for all pipes (nameless or named) on all flavors of Unix, and furthermore states that PIPE_BUF is at least 4K on all systems. I don't have any relevant Posix standards to look at, but I'm not worried about assuming this to be true. > With message queues, this is guaranteed. Also, message queues > would make it easy to query the collected statistics (see > below). I will STRONGLY object to any proposal that we use message queues. We've already had enough problems with the ridiculously low kernel limits that are commonly imposed on shmem and SysV semaphores. We don't need to buy into that silliness yet again with message queues. I don't believe they gain us anything over pipes anyway. The real problem with either pipes or message queues is that backends will block if the collector stops collecting data. I don't think we want that. I suppose we could have the backends write a pipe with O_NONBLOCK and ignore failure, however: o If the O_NONBLOCK flag is set, write() requests will be handled differently, in the following ways: - The write() function will not block the process. - A write request for {PIPE_BUF} or fewer bytes will have the following effect: If there issufficient space available in the pipe, write() will transfer all the data and return the numberof bytes requested. Otherwise, write() will transfer no data and return -1 with errno set to EAGAIN. Since we already ignore SIGPIPE, we don't need to worry about losing the collector entirely. Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? regards, tom lane
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > Does a pipe guarantee that a buffer, written with one atomic > > write(2), never can get intermixed with other data on the > > readers end? > > Yes. The HPUX man page for write(2) sez: > > o Write requests of {PIPE_BUF} bytes or less will not be > interleaved with data from other processes doing writes on the > same pipe. Writes of greater than {PIPE_BUF} bytes may have > data interleaved, on arbitrary boundaries, with writes by > other processes, whether or not the O_NONBLOCK flag of the > file status flags is set. > > Stevens' _UNIX Network Programming_ (1990) states this is true for all > pipes (nameless or named) on all flavors of Unix, and furthermore states > that PIPE_BUF is at least 4K on all systems. I don't have any relevant > Posix standards to look at, but I'm not worried about assuming this to > be true. That's good news - and maybe a Good Assumption (TM). > > With message queues, this is guaranteed. Also, message queues > > would make it easy to query the collected statistics (see > > below). > > I will STRONGLY object to any proposal that we use message queues. > We've already had enough problems with the ridiculously low kernel > limits that are commonly imposed on shmem and SysV semaphores. > We don't need to buy into that silliness yet again with message queues. > I don't believe they gain us anything over pipes anyway. OK. > The real problem with either pipes or message queues is that backends > will block if the collector stops collecting data. I don't think we > want that. I suppose we could have the backends write a pipe with > O_NONBLOCK and ignore failure, however: > > o If the O_NONBLOCK flag is set, write() requests will be > handled differently, in the following ways: > > - The write() function will not block the process. > > - A write request for {PIPE_BUF} or fewer bytes will have > the following effect: If there is sufficient space > available in the pipe, write() will transfer all the data > and return the number of bytes requested. Otherwise, > write() will transfer no data and return -1 with errno set > to EAGAIN. > > Since we already ignore SIGPIPE, we don't need to worry about losing the > collector entirely. That's not what the manpage said. It said that in the case you're inside PIPE_BUF size and using O_NONBLOCK, you either send complete messages or nothing, getting an EAGAIN then. So we could do the same here and write to the pipe. In the case we cannot, just count up and try again next year (or so). > > Now this would put a pretty tight time constraint on the collector: > fall more than 4K behind, you start losing data. I am not sure if > a UDP socket would provide more buffering or not; anyone know? Again, this ain't what the manpage said. If there's sufficient space available in the pipe in combination with that PIPE_BUF is at least 4K doesn't necessarily mean that the pipes buffer space is 4K. Well, what I'm missing is the ability to filter out statistics reports on the backend side via msgrcv(2)smsgtype :-( Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Tom Lane wrote: > Now this would put a pretty tight time constraint on the collector: > fall more than 4K behind, you start losing data. I am not sure if > a UDP socket would provide more buffering or not; anyone know? Looks like Linux has something around 16-32K of buffer space for UDP sockets. Just from eyeballing the fprintf(3) output of my destructively hacked postleprechaun. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Jan Wieck wrote: > Tom Lane wrote: > > Now this would put a pretty tight time constraint on the collector: > > fall more than 4K behind, you start losing data. I am not sure if > > a UDP socket would provide more buffering or not; anyone know? > > Looks like Linux has something around 16-32K of buffer space > for UDP sockets. Just from eyeballing the fprintf(3) output > of my destructively hacked postleprechaun. Just to get some evidence at hand - could some owners of different platforms compile and run the attached little C source please? (The program tests how much data can be stuffed into a pipe or a Sys-V message queue before the writer would block or get an EAGAIN error). My output on RedHat6.1 Linux 2.2.17 is: Pipe buffer is 4096 bytes Sys-V message queue buffer is 16384 bytes Seems Tom is (unfortunately) right. The pipe blocks at 4K. So a Sys-V message queue, with the ability to distribute messages from the collector to individual backends with kernel support via "mtype" is four times by unestimated complexity better here. What does your system say? I really never thought that Sys-V IPC is a good way to go at all. I hate it's incompatibility to the select(2) system call and all these OS/installation dependant restrictions. But I'm tempted to reevaluate it "for this case". Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Attachment
Jan Wieck <JanWieck@yahoo.com> writes: > Just to get some evidence at hand - could some owners of > different platforms compile and run the attached little C > source please? HPUX 10.20: Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes regards, tom lane
> Just to get some evidence at hand - could some owners of > different platforms compile and run the attached little C > source please? $ uname -srm FreeBSD 4.1.1-STABLE $ ./jan Pipe buffer is 16384 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.5 alpha $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.5_BETA2 i386 $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.4.2 i386 $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.4.1 sparc $ ./jan Pipe buffer is 4096 bytes Bad system call (core dumped) # no SysV IPC in running kernel $ uname -srm HP-UX B.11.11 9000/800 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes $ uname -srm HP-UX B.11.00 9000/813 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes $ uname -srm HP-UX B.10.20 9000/871 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes HP-UX can also use STREAMS based pipes if the kernel parameter streampipes is set. Using STREAMS based pipes increases the pipe buffer size by a lot: # uname -srm HP-UX B.11.11 9000/800 # ./jan Pipe buffer is 131072 bytes Sys-V message queue buffer is 16384 bytes # uname -srm HP-UX B.11.00 9000/800 # ./jan Pipe buffer is 131072 bytes Sys-V message queue buffer is 16384 bytes Regards, Giles
* Jan Wieck <JanWieck@Yahoo.com> [010316 16:35]: > Jan Wieck wrote: > > Tom Lane wrote: > > > Now this would put a pretty tight time constraint on the collector: > > > fall more than 4K behind, you start losing data. I am not sure if > > > a UDP socket would provide more buffering or not; anyone know? > > > > Looks like Linux has something around 16-32K of buffer space > > for UDP sockets. Just from eyeballing the fprintf(3) output > > of my destructively hacked postleprechaun. > > Just to get some evidence at hand - could some owners of > different platforms compile and run the attached little C > source please? > > (The program tests how much data can be stuffed into a pipe > or a Sys-V message queue before the writer would block or get > an EAGAIN error). > > My output on RedHat6.1 Linux 2.2.17 is: > > Pipe buffer is 4096 bytes > Sys-V message queue buffer is 16384 bytes > > Seems Tom is (unfortunately) right. The pipe blocks at 4K. > > So a Sys-V message queue, with the ability to distribute > messages from the collector to individual backends with > kernel support via "mtype" is four times by unestimated > complexity better here. What does your system say? > > I really never thought that Sys-V IPC is a good way to go at > all. I hate it's incompatibility to the select(2) system > call and all these OS/installation dependant restrictions. > But I'm tempted to reevaluate it "for this case". > > > Jan $ ./queuetest Pipe buffer is 32768 bytes Sys-V message queue buffer is 4096 bytes $ uname -a UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5 $ I think some of these are configurable... LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
* Larry Rosenman <ler@lerctr.org> [010316 20:47]: > * Jan Wieck <JanWieck@Yahoo.com> [010316 16:35]: > $ ./queuetest > Pipe buffer is 32768 bytes > Sys-V message queue buffer is 4096 bytes > $ uname -a > UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5 > $ > > I think some of these are configurable... They both are. FIFOBLKSIZE and MSGMNB or some such kernel tunable. I can get more info if you need it. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
At 13:49 16/03/01 -0500, Jan Wieck wrote: > > Similar problem as with shared memory - size. If a long > running backend of a multithousand table database needs to > send access stats per table - and had accessed them all up to > now - it'll be alot of wasted bandwidth. Not if you only send totals for individual counters when they change; some stats may never be resynced, but for the most part it will work. Also, does Unix allow interrupts to occur as a result of data arrivibg in a pipe? If so, how about: - All backends to do *blocking* IO to collector. - Collector to receive an interrupt when a message arrives; while in the interrupt it reads the buffer into a local queue, and returns from the interrupt. - Main line code processes the queue and writes it to a memory mapped file for durability. - If collector dies, postmaster starts another immediately, which slears the backlog of data in the pipe and then remaps the file. - Each backend has its own local copy of it's counters which *possibly* to collector can ask for when it restarts. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote: > At 13:49 16/03/01 -0500, Jan Wieck wrote: > > > > Similar problem as with shared memory - size. If a long > > running backend of a multithousand table database needs to > > send access stats per table - and had accessed them all up to > > now - it'll be alot of wasted bandwidth. > > Not if you only send totals for individual counters when they change; some > stats may never be resynced, but for the most part it will work. Also, does > Unix allow interrupts to occur as a result of data arrivibg in a pipe? If > so, how about: > > - All backends to do *blocking* IO to collector. The general problem remains. We only have one central collector with a limited receive capacity. The more load is on the machine, the smaller it's capacity gets. The more complex the DB schemas get and the more load is on the system, the more interesting accurate statistics get. Both factors are contraproductive. More complexschema means more tables and thus bigger messages. More load means more messages. Having good statisticson a toy system while they get worse for a web backend server that's really under pressure is braindeadfrom the start. We don't want the backends to block, so that they can do THEIR work. That's to process queries, nothing else. Pipes seem to be inappropriate because their buffer is limited to 4K on Linux and most BSD flavours. Message queues are too because they are limited to 2K on most BSD's. So only sockets remain. If we have multiple processes that try to receive from the UDP socket, condense the received packets into summary messages and send them to the central collector, this might solve the problem. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
On Sat, Mar 17, 2001 at 09:33:03AM -0500, Jan Wieck wrote: > > The general problem remains. We only have one central > collector with a limited receive capacity. The more load is > on the machine, the smaller it's capacity gets. The more > complex the DB schemas get and the more load is on the > system, the more interesting accurate statistics get. Both > factors are contraproductive. More complex schema means more > tables and thus bigger messages. More load means more > messages. Having good statistics on a toy system while they > get worse for a web backend server that's really under > pressure is braindead from the start. > Just as another suggestion, what about sending the data to a different computer, so instead of tying up the database server with processing the statistics, you have another computer that has some free time to do the processing. Some drawbacks are that you can't automatically start/restart it from the postmaster and it will put a little more load on the network, but it seems to mostly solve the issues of blocked pipes and using too much cpu time on the database server.
Samuel Sieb <samuel@sieb.net> writes: > Just as another suggestion, what about sending the data to a different > computer, so instead of tying up the database server with processing the > statistics, you have another computer that has some free time to do the > processing. > Some drawbacks are that you can't automatically start/restart it from the > postmaster and it will put a little more load on the network, ... and a lot more load on the CPU. Same-machine "network" connections are much cheaper (on most kernels, anyway) than real network connections. I think all of this discussion is vast overkill. No one has yet demonstrated that it's not sufficient to have *one* collector process and a lossy transmission method. Let's try that first, and if it really proves to be unworkable then we can get out the lily-gilding equipment. But there is tons more stuff to do before we have useful stats at all, and I don't think that this aspect is the most critical part of the problem. regards, tom lane
> ... and a lot more load on the CPU. Same-machine "network" connections > are much cheaper (on most kernels, anyway) than real network > connections. > > I think all of this discussion is vast overkill. No one has yet > demonstrated that it's not sufficient to have *one* collector process > and a lossy transmission method. Let's try that first, and if it really > proves to be unworkable then we can get out the lily-gilding equipment. > But there is tons more stuff to do before we have useful stats at all, > and I don't think that this aspect is the most critical part of the > problem. Agreed. Sounds like overkill. How about a per-backend shared memory area for stats, plus a global shared memory area that each backend can add to when it exists. That meets most of our problem. The only open issue is per-table stuff, and I would like to see some circular buffer implemented to handle that, with a collection process that has access to shared memory. Even better, have an SQL table updated with the per-table stats periodically. How about a collector process that periodically reads though the shared memory and UPDATE's SQL tables with the information. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The only open issue is per-table stuff, and I would like to see some > circular buffer implemented to handle that, with a collection process > that has access to shared memory. That will get us into locking/contention issues. OTOH, frequent trips to the kernel to send stats messages --- regardless of the transport mechanism chosen --- don't seem all that cheap either. > Even better, have an SQL table updated with the per-table stats > periodically. That will be horribly expensive, if it's a real table. I think you missed the point that somebody made a little while ago about waiting for functions that can return tuple sets. Once we have that, the stats tables can be *virtual* tables, ie tables that are computed on-demand by some function. That will be a lot less overhead than physically updating an actual table. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The only open issue is per-table stuff, and I would like to see some > > circular buffer implemented to handle that, with a collection process > > that has access to shared memory. > > That will get us into locking/contention issues. OTOH, frequent trips > to the kernel to send stats messages --- regardless of the transport > mechanism chosen --- don't seem all that cheap either. I am confused. Reading/writing shared memory is not a kernel call, right? I agree on the locking contention problems of a circular buffer. > > > Even better, have an SQL table updated with the per-table stats > > periodically. > > That will be horribly expensive, if it's a real table. But per-table stats aren't something that people will look at often, right? They can sit in the collector's memory for quite a while. See people wanting to look at per-backend stuff frequently, and that is why I thought share memory should be good, and a global area for aggregate stats for all backends. > I think you missed the point that somebody made a little while ago > about waiting for functions that can return tuple sets. Once we have > that, the stats tables can be *virtual* tables, ie tables that are > computed on-demand by some function. That will be a lot less overhead > than physically updating an actual table. Yes, but do we want to keep these stats between postmaster restarts? And what about writing them to tables when our storage of table stats gets too big? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Even better, have an SQL table updated with the per-table stats > periodically. >> >> That will be horribly expensive, if it's a real table. > But per-table stats aren't something that people will look at often, > right? They can sit in the collector's memory for quite a while. See > people wanting to look at per-backend stuff frequently, and that is why > I thought share memory should be good, and a global area for aggregate > stats for all backends. >> I think you missed the point that somebody made a little while ago >> about waiting for functions that can return tuple sets. Once we have >> that, the stats tables can be *virtual* tables, ie tables that are >> computed on-demand by some function. That will be a lot less overhead >> than physically updating an actual table. > Yes, but do we want to keep these stats between postmaster restarts? > And what about writing them to tables when our storage of table stats > gets too big? All those points seem to me to be arguments in *favor* of a virtual- table approach, not arguments against it. Or are you confusing the method of collecting stats with the method of making the collected stats available for use? regards, tom lane
> > But per-table stats aren't something that people will look at often, > > right? They can sit in the collector's memory for quite a while. See > > people wanting to look at per-backend stuff frequently, and that is why > > I thought share memory should be good, and a global area for aggregate > > stats for all backends. > > >> I think you missed the point that somebody made a little while ago > >> about waiting for functions that can return tuple sets. Once we have > >> that, the stats tables can be *virtual* tables, ie tables that are > >> computed on-demand by some function. That will be a lot less overhead > >> than physically updating an actual table. > > > Yes, but do we want to keep these stats between postmaster restarts? > > And what about writing them to tables when our storage of table stats > > gets too big? > > All those points seem to me to be arguments in *favor* of a virtual- > table approach, not arguments against it. > > Or are you confusing the method of collecting stats with the method > of making the collected stats available for use? Maybe I am confusing them. I didn't see a distinction in the discussion. I assumed the UDP/message passing of information to the collector was the way statistics were collected, and I don't understand why a per-backend area and global area, with some kind of cicular buffer for per-table stuff isn't the cheapest, cleanest solution. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > Samuel Sieb <samuel@sieb.net> writes: > > Just as another suggestion, what about sending the data to a different > > computer, so instead of tying up the database server with processing the > > statistics, you have another computer that has some free time to do the > > processing. > > > Some drawbacks are that you can't automatically start/restart it from the > > postmaster and it will put a little more load on the network, > > ... and a lot more load on the CPU. Same-machine "network" connections > are much cheaper (on most kernels, anyway) than real network > connections. > > I think all of this discussion is vast overkill. No one has yet > demonstrated that it's not sufficient to have *one* collector process > and a lossy transmission method. Let's try that first, and if it really > proves to be unworkable then we can get out the lily-gilding equipment. > But there is tons more stuff to do before we have useful stats at all, > and I don't think that this aspect is the most critical part of the > problem. Well, back to my initial approach with the UDP socket collector. I now have a collector simply reading all messages from the socket. It doesn't do anything useful except for counting their number. Every backend sends a couple of 1K junk messages at the beginning of the main loop. Up to 16 messages, thereis no time(1) measurable delay in the execution of the "make runcheck". The dummy collector can keep up during the parallel regression test until the backends send 64 messages each time, at that number he lost 1.25% of the messages. That is an amount of statistics data of >256MB tobe collected. Most of the test queries will never generate 1K of message, so that there should be some space here. My plan now is to add some real functionality to the collector and the backend, to see if that has an impact. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
On Fri, Mar 16, 2001 at 05:25:24PM -0500, Jan Wieck wrote: > Jan Wieck wrote: ... > Just to get some evidence at hand - could some owners of > different platforms compile and run the attached little C > source please? ... > Seems Tom is (unfortunately) right. The pipe blocks at 4K. On NetBSD-1.5S/i386 with just the highly conservative shmem defaults: Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes Cheers, Patrick
Jan Wieck <JanWieck@yahoo.com> writes: > Just to get some evidence at hand - could some owners of > different platforms compile and run the attached little C > source please? > (The program tests how much data can be stuffed into a pipe > or a Sys-V message queue before the writer would block or get > an EAGAIN error). One final followup on this --- I wasted a fair amount of time just now trying to figure out why Perl 5.6.0 was silently hanging up in its self-tests (at op/taint, which seems pretty unrelated...). The upshot: Jan's test program had left a 16k SysV message queue hanging about, and that queue was filling all available SysV message space on my machine. Seems Perl tries to test message-queue sending, and it was patiently waiting for some message space to come free. In short, the SysV message queue limits are so tiny that not only are you quite likely to get bollixed up if you use messages, but you're likely to bollix anything else that's using message queues too. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Only shared memory gives us near-zero cost for write/read. 99% of > backends will not be using stats, so it has to be cheap. Not with a circular buffer it's not cheap, because you need interlocking on writes. Your claim that you can get away without that is simply false. You won't just get lost messages, you'll get corrupted messages. > The collector program can read the shared memory stats and keep hashed > values of accumulated stats. It uses the "Loops" variable to know if it > has read the current information in the buffer. And how does it sleep until the counter has been advanced? Seems to me it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay is 10 ms then it's guaranteed to miss a lot of data under load). regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Only shared memory gives us near-zero cost for write/read. 99% of > > backends will not be using stats, so it has to be cheap. > > Not with a circular buffer it's not cheap, because you need interlocking > on writes. Your claim that you can get away without that is simply > false. You won't just get lost messages, you'll get corrupted messages. How do I get corrupt messages if they are all five bytes? If I write five bytes, and another does the same, I guess the assembler could intersperse the writes so the oid gets to be a corrupt value. Any cheap way around this, perhaps by skiping/clearing the write on a collision? > > > The collector program can read the shared memory stats and keep hashed > > values of accumulated stats. It uses the "Loops" variable to know if it > > has read the current information in the buffer. > > And how does it sleep until the counter has been advanced? Seems to me > it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay > is 10 ms then it's guaranteed to miss a lot of data under load). I figured it could just wake up every few seconds and check. It will remember the loop counter and current pointer, and read any new information. I was thinking of a 20k buffer, which could cover about 4k events. Should we think about doing these writes into an OS file, and only enabling the writes when we know there is a collector reading them, perhaps using a /tmp file to activate recording. We could allocation 1MB and be sure not to miss anything, even with a circular setup. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I have a new statistics collection proposal. I suggest three shared memory areas: One per backend to hold the query string and other per-backend statsOne global area to hold accumulated stats for all backendsOneglobal circular buffer to hold per-table/object stats The circular buffer will look like: (Loops) Start---------------------------End | current pointer Loops is incremented every time the pointer reaches "end". Each statistics record will have a length of five bytes made up of oid(4) and action(1). By having the same length for all statistics records, we don't need to perform any locking of the buffer. A backend will grab the current pointer, add five to it, and write into the reserved 5-byte area. If two backends write at the same time, one overwrites the other, but this is just statistics information, so it is not a great lose. Only shared memory gives us near-zero cost for write/read. 99% of backends will not be using stats, so it has to be cheap. The collector program can read the shared memory stats and keep hashed values of accumulated stats. It uses the "Loops" variable to know if it has read the current information in the buffer. When it receives a signal, it can dump its stats to a file in standard COPY format of <oid><tab><action><tab><count>. It can also reset its counters with a signal. Comments? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Only shared memory gives us near-zero cost for write/read. 99% of > > > backends will not be using stats, so it has to be cheap. > > > > Not with a circular buffer it's not cheap, because you need interlocking > > on writes. Your claim that you can get away without that is simply > > false. You won't just get lost messages, you'll get corrupted messages. > > How do I get corrupt messages if they are all five bytes? If I write > five bytes, and another does the same, I guess the assembler could > intersperse the writes so the oid gets to be a corrupt value. Any cheap > way around this, perhaps by skiping/clearing the write on a collision? > > > > > > The collector program can read the shared memory stats and keep hashed > > > values of accumulated stats. It uses the "Loops" variable to know if it > > > has read the current information in the buffer. > > > > And how does it sleep until the counter has been advanced? Seems to me > > it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay > > is 10 ms then it's guaranteed to miss a lot of data under load). > > I figured it could just wake up every few seconds and check. It will > remember the loop counter and current pointer, and read any new > information. I was thinking of a 20k buffer, which could cover about 4k > events. Here I wonder what your EVENT is. With an Oid as identifier and a 1 byte (even if it'd be anoter 32-bit value), how many messages do you want to generate to get these statistics: - Number of sequential scans done per table. - Number of tuples returned via sequential scans per table. - Numberof buffer cache lookups done through sequential scans per table. - Number of buffer cache hits forsequential scans per table. - Number of tuples inserted per table. - Number of tuples updated per table. - Number of tuples deleted per table. - Number of index scans done per index. - Number of index tuplesreturned per index. - Number of buffer cache lookups done due to scans per index. - Number of buffercache hits per index. - Number of valid heap tuples returned via index scan per index. - Number ofbuffer cache lookups done for heap fetches via index scan per index. - Number of buffer cache hits for heapfetches via index scan per index. - Number of buffer cache lookups not accountable for any of the above. - Number of buffer cache hits not accountable for any of the above. What I see is that there's a difference in what we two want to see in the statistics. You're talking about lookingat the actual querystring and such. That's information useful for someone actually looking at a server, to see what a particular backend is doing. On my notebook a parallel regression test (containing>4,000 queries) passes by under 1:30, that's more than 40 queries per second. So that doesn't tell me much. What I'm after is to collect the above data over a week or so and then generate a report to identify the hot spots of the schema. Which tables/indices cause the most disk I/O, what's the average percentage of tuples returned in scans(not from the query, I mean from the single scan inside of the joins). That's the information I need to know where to look for possibly better qualifications, useless indices that aren't worth to maintain and the like. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
> > I figured it could just wake up every few seconds and check. It will > > remember the loop counter and current pointer, and read any new > > information. I was thinking of a 20k buffer, which could cover about 4k > > events. > > Here I wonder what your EVENT is. With an Oid as identifier > and a 1 byte (even if it'd be anoter 32-bit value), how many > messages do you want to generate to get these statistics: > > - Number of sequential scans done per table. > - Number of tuples returned via sequential scans per table. > - Number of buffer cache lookups done through sequential > scans per table. > - Number of buffer cache hits for sequential scans per > table. > - Number of tuples inserted per table. > - Number of tuples updated per table. > - Number of tuples deleted per table. > - Number of index scans done per index. > - Number of index tuples returned per index. > - Number of buffer cache lookups done due to scans per > index. > - Number of buffer cache hits per index. > - Number of valid heap tuples returned via index scan per > index. > - Number of buffer cache lookups done for heap fetches via > index scan per index. > - Number of buffer cache hits for heap fetches via index > scan per index. > - Number of buffer cache lookups not accountable for any of > the above. > - Number of buffer cache hits not accountable for any of > the above. > > What I see is that there's a difference in what we two want > to see in the statistics. You're talking about looking at the > actual querystring and such. That's information useful for > someone actually looking at a server, to see what a > particular backend is doing. On my notebook a parallel > regression test (containing >4,000 queries) passes by under > 1:30, that's more than 40 queries per second. So that doesn't > tell me much. > > What I'm after is to collect the above data over a week or so > and then generate a report to identify the hot spots of the > schema. Which tables/indices cause the most disk I/O, what's > the average percentage of tuples returned in scans (not from > the query, I mean from the single scan inside of the joins). > That's the information I need to know where to look for > possibly better qualifications, useless indices that aren't > worth to maintain and the like. > I was going to have the per-table stats insert a stat record every time it does a sequential scan, so it sould be [oid][sequential_scan_value] and allow the collector to gather that and aggregate it. I didn't think we wanted each backend to do the aggregation per oid. Seems expensive. Maybe we would need a count for things like "number of rows returned" so it would be [oid][stat_type][value]. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I have talked to Jan over the phone, and he has convinced me that UDP is the proper way to communicate stats to the collector, rather than my shared memory idea. The advantages of his UDP approach is that the collector can sleep on the UDP socket rather than having the collector poll the shared memory area. It also has the auto-discard option. He will make logging configurable on a per-database level, so it can be turned off when not in use. He has a trial UDP implementation that he will post soon. Also, I asked him to try DGRAM Unix-domain sockets for performance reasons. My Steven's book says it they should be supported. He can put the socket file in /data. > > > I figured it could just wake up every few seconds and check. It will > > > remember the loop counter and current pointer, and read any new > > > information. I was thinking of a 20k buffer, which could cover about 4k > > > events. > > > > Here I wonder what your EVENT is. With an Oid as identifier > > and a 1 byte (even if it'd be anoter 32-bit value), how many > > messages do you want to generate to get these statistics: > > > > - Number of sequential scans done per table. > > - Number of tuples returned via sequential scans per table. > > - Number of buffer cache lookups done through sequential > > scans per table. > > - Number of buffer cache hits for sequential scans per > > table. > > - Number of tuples inserted per table. > > - Number of tuples updated per table. > > - Number of tuples deleted per table. > > - Number of index scans done per index. > > - Number of index tuples returned per index. > > - Number of buffer cache lookups done due to scans per > > index. > > - Number of buffer cache hits per index. > > - Number of valid heap tuples returned via index scan per > > index. > > - Number of buffer cache lookups done for heap fetches via > > index scan per index. > > - Number of buffer cache hits for heap fetches via index > > scan per index. > > - Number of buffer cache lookups not accountable for any of > > the above. > > - Number of buffer cache hits not accountable for any of > > the above. > > > > What I see is that there's a difference in what we two want > > to see in the statistics. You're talking about looking at the > > actual querystring and such. That's information useful for > > someone actually looking at a server, to see what a > > particular backend is doing. On my notebook a parallel > > regression test (containing >4,000 queries) passes by under > > 1:30, that's more than 40 queries per second. So that doesn't > > tell me much. > > > > What I'm after is to collect the above data over a week or so > > and then generate a report to identify the hot spots of the > > schema. Which tables/indices cause the most disk I/O, what's > > the average percentage of tuples returned in scans (not from > > the query, I mean from the single scan inside of the joins). > > That's the information I need to know where to look for > > possibly better qualifications, useless indices that aren't > > worth to maintain and the like. > > > > I was going to have the per-table stats insert a stat record every time > it does a sequential scan, so it sould be [oid][sequential_scan_value] > and allow the collector to gather that and aggregate it. > > I didn't think we wanted each backend to do the aggregation per oid. > Seems expensive. Maybe we would need a count for things like "number of > rows returned" so it would be [oid][stat_type][value]. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > I have talked to Jan over the phone, and he has convinced me that UDP is > the proper way to communicate stats to the collector, rather than my > shared memory idea. > > The advantages of his UDP approach is that the collector can sleep on > the UDP socket rather than having the collector poll the shared memory > area. It also has the auto-discard option. He will make logging > configurable on a per-database level, so it can be turned off when not > in use. > > He has a trial UDP implementation that he will post soon. Also, I asked > him to try DGRAM Unix-domain sockets for performance reasons. My > Steven's book says it they should be supported. He can put the socket > file in /data. "Trial" implementation attached :-) First attachment is a patch for various backend files plus generating two new source files. If your patch(1) doesn't put 'em automatically, they go to src/include/pgstat.h and src/backend/postmaster/pgstat.c. BTW: tgl on 2/99 was right, the hash_destroy() really crashes. Maybe we want to pull out the fix I've done (includes some new feature for hash table memory allocation) and apply that to 7.1? Second attachment is a tarfile that should unpack to contrib/pgstat_tmp. I've placed the SQL level functions into a shared module for now. The sql script also creates a couple of views. - pgstat_all_tables shows scan- and tuple based statistics for all tables. pgstat_sys_tables and pgstat_user_tables filter out (you guess what) system or user tables. - pgstatio_all_tables, pgstatio_sys_tables and pgstatio_user_tables show buffer IO statistics for tables. - pgstat_*_indexes and pgstatio_*_indexes are similar like the above, just that they give detailed info about each single index. - pgstatio_*_sequences shows buffer IO statistics about - right, sequences. Since sequences aren't scanned regularely, they have no scan- and tuple related view. - pgstat_activity shows informations about all currently running backends of the entire instance. The underlying function for displaying the actual query returns NULL allways for non-superusers. - pgstat_database shows transaction commit/abort counts and cumulated buffer IO statistics for all existing databases. The collector writes frequently a file data/pgstat.stat (approx. every 500 milliseconds as long as there is something to tell, so nothing is done if the entire installation sleeps). He also reads this file on startup, so collected statistics survive postmaster restarts. TODO: - Are PF_UNIX SOCK_DGRAM sockets supported on all the platforms we do? If not, what's wrong with the current implementation? - There is no way yet to tell the collector about objects (relations and databases) removed from the database. Basically that could be done with messages too, but who will send them and how can we guarantee that they'll be generated even if somebody never queries the statistics? Thus, the current collector will grow, and grow, and grow until you remove the pgstat.stat file while the postmaster is down. - Also there aren't functions or messages implemented to explicitly reset statistics. - Possible additions would be to remember when the backends started and collect resource usage (rstat(2)) information as well. - The entire thing needs an additional attribute in pg_database that tells the backends what to tell the collector at all. Just to make them quiet again. So far for an actual snapshot. Comments? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #