Thread: Performance monitor signal handler

Performance monitor signal handler

From
Bruce Momjian
Date:
I was going to implement the signal handler like we do with Cancel,
where the signal sets a flag and we check the status of the flag in
various _safe_ places.

Can anyone think of a better way to get information out of a backend?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [010312 12:12] wrote:
> I was going to implement the signal handler like we do with Cancel,
> where the signal sets a flag and we check the status of the flag in
> various _safe_ places.
> 
> Can anyone think of a better way to get information out of a backend?

Why not use a static area of the shared memory segment?  Is it possible
to have a spinlock over it so that an external utility can take a snapshot
of it with the spinlock held?

Also, this could work for other stuff as well, instead of overloading
a lot of signal handlers one could just periodically poll a region of
the shared segment.

just some ideas..

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/



Re: Performance monitor signal handler

From
Philip Warner
Date:
At 13:34 12/03/01 -0800, Alfred Perlstein wrote:
>Is it possible
>to have a spinlock over it so that an external utility can take a snapshot
>of it with the spinlock held?

I'd suggest that locking the stats area might be a bad idea; there is only
one writer for each backend-specific chunk, and it won't matter a hell of a
lot if a reader gets inconsistent views (since I assume they will be
re-reading every second or so). All the stats area should contain would be
a bunch of counters with timestamps, I think, and the cost up writing to it
should be kept to an absolute minimum.


>
>just some ideas..
>

Unfortunatley, based on prior discussions, Bruce seems quite opposed to a
shared memory solution.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Philip Warner <pjw@rhyme.com.au> [010312 18:56] wrote:
> At 13:34 12/03/01 -0800, Alfred Perlstein wrote:
> >Is it possible
> >to have a spinlock over it so that an external utility can take a snapshot
> >of it with the spinlock held?
> 
> I'd suggest that locking the stats area might be a bad idea; there is only
> one writer for each backend-specific chunk, and it won't matter a hell of a
> lot if a reader gets inconsistent views (since I assume they will be
> re-reading every second or so). All the stats area should contain would be
> a bunch of counters with timestamps, I think, and the cost up writing to it
> should be kept to an absolute minimum.
> 
> 
> >
> >just some ideas..
> >
> 
> Unfortunatley, based on prior discussions, Bruce seems quite opposed to a
> shared memory solution.

Ok, here's another nifty idea.

On reciept of the info signal, the backends collaborate to piece
together a status file.  The status file is given a temporay name.
When complete the status file is rename(2)'d over a well known
file.

This ought to always give a consistant snapshot of the file to
whomever opens it.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/



Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> > I think Tom has previously stated that there are technical reasons not to
> > do IO in signal handlers, and I have philosophical problems with
> > performance monitors that ask 50 backends to do file IO. I really do think
> > shared memory is TWTG.
> 
> I wasn't really suggesting any of those courses of action, all I
> suggested was using rename(2) to give a seperate appilcation a
> consistant snapshot of the stats.
> 
> Actually, what makes the most sense (although it may be a performance
> killer) is to have the backends update a system table that the external
> app can query.

Yes, it seems storing info in shared memory and having a system table to
access it is the way to go.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> >
> >This ought to always give a consistant snapshot of the file to
> >whomever opens it.
> >
> 
> I think Tom has previously stated that there are technical reasons not to
> do IO in signal handlers, and I have philosophical problems with
> performance monitors that ask 50 backends to do file IO. I really do think
> shared memory is TWTG.

The good news is that right now pgmonitor gets all its information from
'ps', and only shows the query when the user asks for it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Philip Warner
Date:
>
>This ought to always give a consistant snapshot of the file to
>whomever opens it.
>

I think Tom has previously stated that there are technical reasons not to
do IO in signal handlers, and I have philosophical problems with
performance monitors that ask 50 backends to do file IO. I really do think
shared memory is TWTG.




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Philip Warner <pjw@rhyme.com.au> [010313 06:42] wrote:
> >
> >This ought to always give a consistant snapshot of the file to
> >whomever opens it.
> >
> 
> I think Tom has previously stated that there are technical reasons not to
> do IO in signal handlers, and I have philosophical problems with
> performance monitors that ask 50 backends to do file IO. I really do think
> shared memory is TWTG.

I wasn't really suggesting any of those courses of action, all I
suggested was using rename(2) to give a seperate appilcation a
consistant snapshot of the stats.

Actually, what makes the most sense (although it may be a performance
killer) is to have the backends update a system table that the external
app can query.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/



Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> At 13:34 12/03/01 -0800, Alfred Perlstein wrote:
> >Is it possible
> >to have a spinlock over it so that an external utility can take a snapshot
> >of it with the spinlock held?
> 
> I'd suggest that locking the stats area might be a bad idea; there is only
> one writer for each backend-specific chunk, and it won't matter a hell of a
> lot if a reader gets inconsistent views (since I assume they will be
> re-reading every second or so). All the stats area should contain would be
> a bunch of counters with timestamps, I think, and the cost up writing to it
> should be kept to an absolute minimum.
> 
> 
> >
> >just some ideas..
> >
> 
> Unfortunatley, based on prior discussions, Bruce seems quite opposed to a
> shared memory solution.

No, I like the shared memory idea.  Such an idea will have to wait for
7.2, and second, there are limits to how much shared memory I can use. 

Eventually, I think shared memory will be the way to go.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Thomas Swan
Date:
>On reciept of the info signal, the backends collaborate to piece
>together a status file.  The status file is given a temporay name.
>When complete the status file is rename(2)'d over a well known
>file.

Reporting to files, particularly well known ones, could lead to race 
conditions.

All in all, I think your better off passing messages through pipes or a 
similar communication method.

I really liked the idea of a "server" that could parse/analyze data from 
multiple backends.

My 2/100 worth...





Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Thomas Swan <tswan-lst@ics.olemiss.edu> [010313 13:37] wrote:
> 
> >On reciept of the info signal, the backends collaborate to piece
> >together a status file.  The status file is given a temporay name.
> >When complete the status file is rename(2)'d over a well known
> >file.
> 
> Reporting to files, particularly well known ones, could lead to race 
> conditions.
> 
> All in all, I think your better off passing messages through pipes or a 
> similar communication method.
> 
> I really liked the idea of a "server" that could parse/analyze data from 
> multiple backends.
> 
> My 2/100 worth...

Take a few moments to think about the semantics of rename(2).

Yes, you would still need syncronization between the backend
processes to do this correctly, but not any external app.

The external app can just open the file, assuming it exists it
will always have a complete and consistant snapshot of whatever
the backends agreed on.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/



Re: Performance monitor signal handler

From
Jan Wieck
Date:
Bruce Momjian wrote:
>
> Yes, it seems storing info in shared memory and having a system table to
> access it is the way to go.
   Depends,
   first  of all we need to know WHAT we want to collect.  If we   talk about block read/write statistics  and  such
on a  per   table  base, which is IMHO the most accurate thing for tuning   purposes, then we're talking about an
infinitesize of shared   memory perhaps.
 
   And  shared  memory has all the interlocking problems we want   to avoid.
   What about a collector deamon, fired up by the postmaster and   receiving UDP packets from the backends. Under heavy
load,it   might miss some statistic messages, well, but that's  not  as   bad as having locks causing backends to loose
performance.
   The  postmaster  could already provide the UDP socket for the   backends, so the collector can know  the  peer
address from   which  to  accept  statistics messages only. Any message from   another peer address is  simply
ignored.  For  getting  the   statistics  out  of  it,  the  collector  has  his own server   socket, using TCP and
providingsome lookup protocol.
 
   Now whatever the backend has to tell the collector, it simply   throws  a UDP packet into his direction. If the
collectorcan   catch it or not, not the backends problem.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Tom Lane
Date:
Jan Wieck <janwieck@Yahoo.com> writes:
>     What about a collector deamon, fired up by the postmaster and
>     receiving UDP packets from the backends. Under heavy load, it
>     might miss some statistic messages, well, but that's  not  as
>     bad as having locks causing backends to loose performance.

Interesting thought, but we don't want UDP I think; that just opens
up a whole can of worms about checking access permissions and so forth.
Why not a simple pipe?  The postmaster creates the pipe and the
collector daemon inherits one end, while all the backends inherit the
other end.
        regards, tom lane


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Jan Wieck <janwieck@Yahoo.com> writes:
> >     What about a collector deamon, fired up by the postmaster and
> >     receiving UDP packets from the backends. Under heavy load, it
> >     might miss some statistic messages, well, but that's  not  as
> >     bad as having locks causing backends to loose performance.
>
> Interesting thought, but we don't want UDP I think; that just opens
> up a whole can of worms about checking access permissions and so forth.
> Why not a simple pipe?  The postmaster creates the pipe and the
> collector daemon inherits one end, while all the backends inherit the
> other end.
   I don't think so - though I haven't tested the following yet,   but AFAIR it's correct.
   Have the postmaster creating two UDP sockets before it  forks   off the collector. It can examine the peer addresses
ofboth,   so they don't need well known port numbers,  it  can  be  the   randomly  ones  assigned  by  the kernel.
Thus,we don't need   SO_REUSE on them either.
 
   Now, since the collector is forked off by the postmaster,  it   knows  the  peer  address  of the other socket. And
sinceall   backends get forked off from the postmaster as well,  they'll   all  use  the  same  peer  address,  don't
they? So all the   collector has to look at is the sender address including port   number  of  the  packets.  It needs
tobe what the postmaster   examined, anything else is from someone else and goes to  bit   heaven.  The  same  way the
backendsknow where to send their   statistics.
 
   If I'm right that in the case of fork()  all  children  share   the  same  socket  with the same peer address, then
it'seven   safe in the case the collector dies. The postmaster can still   hold the collectors socket and will notice
thatthe collector   died (due to a wait() returning it's PID)  and  can  fire  up   another one. Again some packets got
lost(plus all the so far   collected statistics, hmmm - aint that a cool  way  to  reset   statistic  counters -
killingthe collector?), but it did not   disturb any live backend in any way. They will never get  any   signal,  don't
care  about what's done with their statistics   and such. They just do their work...
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Jan Wieck <janwieck@Yahoo.com> writes:
> >     What about a collector deamon, fired up by the postmaster and
> >     receiving UDP packets from the backends. Under heavy load, it
> >     might miss some statistic messages, well, but that's  not  as
> >     bad as having locks causing backends to loose performance.
>
> Interesting thought, but we don't want UDP I think; that just opens
> up a whole can of worms about checking access permissions and so forth.
> Why not a simple pipe?  The postmaster creates the pipe and the
> collector daemon inherits one end, while all the backends inherit the
> other end.
   I don't think so - though I haven't tested the following yet,   but AFAIR it's correct.
   Have the postmaster creating two UDP sockets before it  forks   off the collector. It can examine the peer addresses
ofboth,   so they don't need well known port numbers,  it  can  be  the   randomly  ones  assigned  by  the kernel.
Thus,we don't need   SO_REUSE on them either.
 
   Now, since the collector is forked off by the postmaster,  it   knows  the  peer  address  of the other socket. And
sinceall   backends get forked off from the postmaster as well,  they'll   all  use  the  same  peer  address,  don't
they? So all the   collector has to look at is the sender address including port   number  of  the  packets.  It needs
tobe what the postmaster   examined, anything else is from someone else and goes to  bit   heaven.  The  same  way the
backendsknow where to send their   statistics.
 
   If I'm right that in the case of fork()  all  children  share   the  same  socket  with the same peer address, then
it'seven   safe in the case the collector dies. The postmaster can still   hold the collectors socket and will notice
thatthe collector   died (due to a wait() returning it's PID)  and  can  fire  up   another one. Again some packets got
lost(plus all the so far   collected statistics, hmmm - aint that a cool  way  to  reset   statistic  counters -
killingthe collector?), but it did not   disturb any live backend in any way. They will never get  any   signal,  don't
care  about what's done with their statistics   and such. They just do their work...
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Philip Warner
Date:
At 06:57 15/03/01 -0500, Jan Wieck wrote:
>
>    And  shared  memory has all the interlocking problems we want
>    to avoid.

I suspect that if we keep per-backend data in a separate area, then we
don;t need locking since there is only one writer. It does not matter if a
reader gets an inconsistent view, the same as if you drop a few UDP packets.


>    What about a collector deamon, fired up by the postmaster and
>    receiving UDP packets from the backends. 

This does sound appealing; it means that individual backend data (IO etc)
will survive past the termination of the backend. I'd like to see the stats
survive the death of the collector if possible, possibly even survive a
stop/start of the postmaster.


>    Now whatever the backend has to tell the collector, it simply
>    throws  a UDP packet into his direction. If the collector can
>    catch it or not, not the backends problem.

If we get the backends to keep the stats they are sending in local counters
as well, then they can send the counter value (not delta) each time, which
would mean that the collector would not 'miss' anything - just it's
operations/sec might see a hiccough. This could have a sidebenefit that(if
wewanted to?) we could allow a client to query their own counters to get an
idea of the costs of their queries.

When we need to reset the counters that should be done explicitly, I think.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Philip Warner <pjw@rhyme.com.au> [010315 16:14] wrote:
> At 06:57 15/03/01 -0500, Jan Wieck wrote:
> >
> >    And  shared  memory has all the interlocking problems we want
> >    to avoid.
> 
> I suspect that if we keep per-backend data in a separate area, then we
> don;t need locking since there is only one writer. It does not matter if a
> reader gets an inconsistent view, the same as if you drop a few UDP packets.

No, this is completely different.

Lost data is probably better than incorrect data.  Either use locks
or a copying mechanism.  People will depend on the data returned
making sense.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]



Re: Performance monitor signal handler

From
Philip Warner
Date:
At 16:17 15/03/01 -0800, Alfred Perlstein wrote:
>
>Lost data is probably better than incorrect data.  Either use locks
>or a copying mechanism.  People will depend on the data returned
>making sense.
>

But with per-backend data, there is only ever *one* writer to a given set
of counters. Everyone else is a reader.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote:
> At 16:17 15/03/01 -0800, Alfred Perlstein wrote:
> >
> >Lost data is probably better than incorrect data.  Either use locks
> >or a copying mechanism.  People will depend on the data returned
> >making sense.
> >
> 
> But with per-backend data, there is only ever *one* writer to a given set
> of counters. Everyone else is a reader.

This doesn't prevent a reader from getting an inconsistant view.

Think about a 64bit counter on a 32bit machine.  If you charged per
megabyte, wouldn't it upset you to have a small chance of loosing
4 billion units of sale?

(ie, doing a read after an addition that wraps the low 32 bits
but before the carry is done to the top most signifigant 32bits?)

Ok, what what if everything can be read atomically by itself?

You're still busted the minute you need to export any sort of
compound stat.

If A, B and C need to add up to 100 you have a read race.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]



Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Philip Warner <pjw@rhyme.com.au> [010315 17:08] wrote:
> At 16:55 15/03/01 -0800, Alfred Perlstein wrote:
> >* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote:
> >> At 16:17 15/03/01 -0800, Alfred Perlstein wrote:
> >> >
> >> >Lost data is probably better than incorrect data.  Either use locks
> >> >or a copying mechanism.  People will depend on the data returned
> >> >making sense.
> >> >
> >> 
> >> But with per-backend data, there is only ever *one* writer to a given set
> >> of counters. Everyone else is a reader.
> >
> >This doesn't prevent a reader from getting an inconsistant view.
> >
> >Think about a 64bit counter on a 32bit machine.  If you charged per
> >megabyte, wouldn't it upset you to have a small chance of loosing
> >4 billion units of sale?
> >
> >(ie, doing a read after an addition that wraps the low 32 bits
> >but before the carry is done to the top most signifigant 32bits?)
> 
> I assume this means we can not rely on the existence of any kind of
> interlocked add on 64 bit machines?
> 
> 
> >Ok, what what if everything can be read atomically by itself?
> >
> >You're still busted the minute you need to export any sort of
> >compound stat.
> 
> Which is why the backends should not do anything other than maintain the
> raw data. If there is atomic data than can cause inconsistency, then a
> dropped UDP packet will do the same.

The UDP packet (a COPY) can contain a consistant snapshot of the data.
If you have dependancies, you fit a consistant snapshot into a single
packet.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]



Re: Performance monitor signal handler

From
Philip Warner
Date:
At 16:55 15/03/01 -0800, Alfred Perlstein wrote:
>* Philip Warner <pjw@rhyme.com.au> [010315 16:46] wrote:
>> At 16:17 15/03/01 -0800, Alfred Perlstein wrote:
>> >
>> >Lost data is probably better than incorrect data.  Either use locks
>> >or a copying mechanism.  People will depend on the data returned
>> >making sense.
>> >
>> 
>> But with per-backend data, there is only ever *one* writer to a given set
>> of counters. Everyone else is a reader.
>
>This doesn't prevent a reader from getting an inconsistant view.
>
>Think about a 64bit counter on a 32bit machine.  If you charged per
>megabyte, wouldn't it upset you to have a small chance of loosing
>4 billion units of sale?
>
>(ie, doing a read after an addition that wraps the low 32 bits
>but before the carry is done to the top most signifigant 32bits?)

I assume this means we can not rely on the existence of any kind of
interlocked add on 64 bit machines?


>Ok, what what if everything can be read atomically by itself?
>
>You're still busted the minute you need to export any sort of
>compound stat.

Which is why the backends should not do anything other than maintain the
raw data. If there is atomic data than can cause inconsistency, then a
dropped UDP packet will do the same.




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Philip Warner wrote:
>
> But I prefer the UDP/Collector model anyway; it gives use greater
> flexibility + the ability to keep stats past backend termination, and,as
> you say, removes any possible locking requirements from the backends.
   OK, did some tests...
   The  postmaster can create a SOCK_DGRAM socket at startup and   bind(2) it to "127.0.0.1:0", what causes the kernel
toassign   a  non-privileged  port  number  that  then  can be read with   getsockname(2). No other process can have a
socket with  the   same port number for the lifetime of the postmaster.
 
   If  the  socket  get's  ready, it'll read one backend message   from   it   with   recvfrom(2).   The   fromaddr
must   be   "127.0.0.1:xxx"  where  xxx  is  the  port  number the kernel   assigned to the above socket.  Yes,  this
is his  own  one,   shared  with  postmaster  and  all  backends.  So  both,  the   postmaster and the backends can
use this  one  UDP  socket,   which  the  backends  inherit on fork(2), to send messages to   the collector. If such  a
UDP  packet  really  came  from  a   process other than the postmaster or a backend, well then the   sysadmin has  a
more severe  problem  than  manipulated  DB   runtime statistics :-)
 
   Running  a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here,   I've been able to loose no single message during the
parallel  regression  test,  if each backend sends one 1K sized message   per query executed, and the collector simply
sucks them  out   of  the  socket. Message losses start if the collector does a   per message idle loop like this:
 
       for (i=0,sum=0;i<250000;i++,sum+=1);
   Uh - not much time to spend if the statistics should at least   be  half  accurate. And it would become worse in SMP
systems.  So that was a nifty idea, but I think it'd  cause  much  more   statistic losses than I assumed at first.
 
   Back to drawing board. Maybe a SYS-V message queue can serve?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Jan Wieck <JanWieck@yahoo.com> [010316 08:08] wrote:
> Philip Warner wrote:
> >
> > But I prefer the UDP/Collector model anyway; it gives use greater
> > flexibility + the ability to keep stats past backend termination, and,as
> > you say, removes any possible locking requirements from the backends.
> 
>     OK, did some tests...
> 
>     The  postmaster can create a SOCK_DGRAM socket at startup and
>     bind(2) it to "127.0.0.1:0", what causes the kernel to assign
>     a  non-privileged  port  number  that  then  can be read with
>     getsockname(2). No other process can have a socket  with  the
>     same port number for the lifetime of the postmaster.
> 
>     If  the  socket  get's  ready, it'll read one backend message
>     from   it   with   recvfrom(2).   The   fromaddr   must    be
>     "127.0.0.1:xxx"  where  xxx  is  the  port  number the kernel
>     assigned to the above socket.  Yes,  this  is  his  own  one,
>     shared  with  postmaster  and  all  backends.  So  both,  the
>     postmaster and the backends can  use  this  one  UDP  socket,
>     which  the  backends  inherit on fork(2), to send messages to
>     the collector. If such  a  UDP  packet  really  came  from  a
>     process other than the postmaster or a backend, well then the
>     sysadmin has  a  more  severe  problem  than  manipulated  DB
>     runtime statistics :-)

Doing this is a bad idea:

a) it allows any program to start spamming localhost:randport with
messages and screw with the postmaster.

b) it may even allow remote people to mess with it, (see recent
bugtraq articles about this)

You should use a unix domain socket (at least when possible).

>     Running  a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here,
>     I've been able to loose no single message during the parallel
>     regression  test,  if each backend sends one 1K sized message
>     per query executed, and the collector simply sucks  them  out
>     of  the  socket. Message losses start if the collector does a
>     per message idle loop like this:
> 
>         for (i=0,sum=0;i<250000;i++,sum+=1);
> 
>     Uh - not much time to spend if the statistics should at least
>     be  half  accurate. And it would become worse in SMP systems.
>     So that was a nifty idea, but I think it'd  cause  much  more
>     statistic losses than I assumed at first.
> 
>     Back to drawing board. Maybe a SYS-V message queue can serve?

I wouldn't say back to the drawing board, I would say two steps back.

What about instead of sending deltas, you send totals?  This would
allow you to loose messages and still maintain accurate stats.

You can also enable SIGIO on the socket, then have a signal handler
buffer packets that arrive when not actively select()ing on the
UDP socket.  You can then use sigsetmask(2) to provide mutual
exclusion with your SIGIO handler and general select()ing on the
socket.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]



Re: Performance monitor signal handler

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
>     Uh - not much time to spend if the statistics should at least
>     be  half  accurate. And it would become worse in SMP systems.
>     So that was a nifty idea, but I think it'd  cause  much  more
>     statistic losses than I assumed at first.

>     Back to drawing board. Maybe a SYS-V message queue can serve?

That would be the same as a pipe: backends would block if the collector
stopped accepting data.  I do like the "auto discard" aspect of this
UDP-socket approach.

I think Philip had the right idea: each backend should send totals,
not deltas, in its messages.  Then, it doesn't matter (much) if the
collector loses some messages --- that just means that sometimes it
has a slightly out-of-date idea about how much work some backends have
done.  It should be easy to design the software so that that just makes
a small, transient error in the currently displayed statistics.
        regards, tom lane


Re: Performance monitor signal handler

From
Philip Warner
Date:
At 17:10 15/03/01 -0800, Alfred Perlstein wrote:
>> 
>> Which is why the backends should not do anything other than maintain the
>> raw data. If there is atomic data than can cause inconsistency, then a
>> dropped UDP packet will do the same.
>
>The UDP packet (a COPY) can contain a consistant snapshot of the data.
>If you have dependancies, you fit a consistant snapshot into a single
>packet.

If we were going to go the shared memory way, then yes, as soon as we start
collecting dependant data we would need locking, but IOs, locking stats,
flushes, cache hits/misses are not really in this category.

But I prefer the UDP/Collector model anyway; it gives use greater
flexibility + the ability to keep stats past backend termination, and,as
you say, removes any possible locking requirements from the backends.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Alfred Perlstein wrote:
> * Jan Wieck <JanWieck@yahoo.com> [010316 08:08] wrote:
> > Philip Warner wrote:
> > >
> > > But I prefer the UDP/Collector model anyway; it gives use greater
> > > flexibility + the ability to keep stats past backend termination, and,as
> > > you say, removes any possible locking requirements from the backends.
> >
> >     OK, did some tests...
> >
> >     The  postmaster can create a SOCK_DGRAM socket at startup and
> >     bind(2) it to "127.0.0.1:0", what causes the kernel to assign
> >     a  non-privileged  port  number  that  then  can be read with
> >     getsockname(2). No other process can have a socket  with  the
> >     same port number for the lifetime of the postmaster.
> >
> >     If  the  socket  get's  ready, it'll read one backend message
> >     from   it   with   recvfrom(2).   The   fromaddr   must    be
> >     "127.0.0.1:xxx"  where  xxx  is  the  port  number the kernel
> >     assigned to the above socket.  Yes,  this  is  his  own  one,
> >     shared  with  postmaster  and  all  backends.  So  both,  the
> >     postmaster and the backends can  use  this  one  UDP  socket,
> >     which  the  backends  inherit on fork(2), to send messages to
> >     the collector. If such  a  UDP  packet  really  came  from  a
> >     process other than the postmaster or a backend, well then the
> >     sysadmin has  a  more  severe  problem  than  manipulated  DB
> >     runtime statistics :-)
>
> Doing this is a bad idea:
>
> a) it allows any program to start spamming localhost:randport with
> messages and screw with the postmaster.
>
> b) it may even allow remote people to mess with it, (see recent
> bugtraq articles about this)
   So  it's  possible  for  a  UDP socket to recvfrom(2) and get   packets with  a  fromaddr
localhost:my_own_non_SO_REUSE_port  that really came from somewhere else?
 
   If  that's  possible,  the  packets  must  be coming over the   network.  Oterwise it's the local superuser sending
them,and   in  that case it's not worth any more discussion because root   on your system has more powerful
possibilitiesto muck around   with  your  database. And if someone outside the local system   is doing it, it's time
forsome filter rules, isn't it?
 

> You should use a unix domain socket (at least when possible).
   Unix domain UDP?

>
> >     Running  a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here,
> >     I've been able to loose no single message during the parallel
> >     regression  test,  if each backend sends one 1K sized message
> >     per query executed, and the collector simply sucks  them  out
> >     of  the  socket. Message losses start if the collector does a
> >     per message idle loop like this:
> >
> >         for (i=0,sum=0;i<250000;i++,sum+=1);
> >
> >     Uh - not much time to spend if the statistics should at least
> >     be  half  accurate. And it would become worse in SMP systems.
> >     So that was a nifty idea, but I think it'd  cause  much  more
> >     statistic losses than I assumed at first.
> >
> >     Back to drawing board. Maybe a SYS-V message queue can serve?
>
> I wouldn't say back to the drawing board, I would say two steps back.
>
> What about instead of sending deltas, you send totals?  This would
> allow you to loose messages and still maintain accurate stats.
   Similar problem as with shared  memory  -  size.  If  a  long   running  backend  of  a multithousand table database
needsto   send access stats per table - and had accessed them all up to   now - it'll be alot of wasted bandwidth.
 

>
> You can also enable SIGIO on the socket, then have a signal handler
> buffer packets that arrive when not actively select()ing on the
> UDP socket.  You can then use sigsetmask(2) to provide mutual
> exclusion with your SIGIO handler and general select()ing on the
> socket.
   I  already thought that priorizing the socket-drain this way:   there is a fairly big receive buffer. If the buffer
isempty,   it  does  a  blocking  select(2). If it's not, it does a non-   blocking (0-timeout) one and only if the
non-blocking tells   that  there  aren't  new  messages waiting, it'll process one   buffered message and try to
receiveagain.
 
   Will give it a shot.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> >     Uh - not much time to spend if the statistics should at least
> >     be  half  accurate. And it would become worse in SMP systems.
> >     So that was a nifty idea, but I think it'd  cause  much  more
> >     statistic losses than I assumed at first.
>
> >     Back to drawing board. Maybe a SYS-V message queue can serve?
>
> That would be the same as a pipe: backends would block if the collector
> stopped accepting data.  I do like the "auto discard" aspect of this
> UDP-socket approach.
   Does  a pipe guarantee that a buffer, written with one atomic   write(2), never can get intermixed with  other  data
on  the   readers  end?   I know that you know what I mean, but for the   broader audience: Let's define a message to
thecollector  to   be  4byte-len,len-bytes.   Now  hundreds  of  backends hammer   messages into the (shared) writing
endof the pipe, all  with   different     sizes.     Is     it    GUARANTEED    that    a   read(4bytes),read(nbytes)
sequencewill  allways  return  one   complete  message  and  never  intermixed  parts of different   write(2)s?
 
   With message queues, this is guaranteed. Also, message queues   would  make  it  easy  to query the collected
statistics(see   below).
 

> I think Philip had the right idea: each backend should send totals,
> not deltas, in its messages.  Then, it doesn't matter (much) if the
> collector loses some messages --- that just means that sometimes it
> has a slightly out-of-date idea about how much work some backends have
> done.  It should be easy to design the software so that that just makes
> a small, transient error in the currently displayed statistics.
   If we use two message queues (IPC_PRIVATE  is  enough  here),   one  into collector and one into backend direction,
this'dbe   an easy way to collect and query statistics.
 
   The backends send delta stats messages to  the  collector  on   one  queue. Message queues block, by default, but
thebackend   could use IPC_NOWAIT and just go on and collect up,  as  long   as  it finally will use a blocking call
beforeexiting. We'll   loose  statistics  for  backends  that  go  down  in   flames   (coredump), but who cares for
statisticsthen?
 
   To  query statistics, we have a set of new builtin functions.   All functions share  a  global  statistics  snapshot
in  the   backend.  If  on  function call the snapshot doesn't exist or   was generated by  another
XACT/commandcounter, the  backend   sends  a  statistics  request  for  his  database  ID  to the   collector and waits
forthe messages to arrive on the  second   message  queue. It can pick up the messages meant for him via   message
type,which's equal to his backend number +1, because   the  collector will send 'em as such.  For table access stats
forexample, the snapshot will have slots identified  by  the   tables  OID,  so  a function
pg_get_tables_seqscan_count(oid)  should be easy  to  implement.  And  setting  up  views  that   present access stats
inreadable format is a nobrainer.
 
   Now  we  have communication only between the backends and the   collector.  And we're  certain  that  only  someone
able to   SELECT from a system view will ever see this information.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [010316 10:06] wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> >     Uh - not much time to spend if the statistics should at least
> >     be  half  accurate. And it would become worse in SMP systems.
> >     So that was a nifty idea, but I think it'd  cause  much  more
> >     statistic losses than I assumed at first.
> 
> >     Back to drawing board. Maybe a SYS-V message queue can serve?
> 
> That would be the same as a pipe: backends would block if the collector
> stopped accepting data.  I do like the "auto discard" aspect of this
> UDP-socket approach.
> 
> I think Philip had the right idea: each backend should send totals,
> not deltas, in its messages.  Then, it doesn't matter (much) if the
> collector loses some messages --- that just means that sometimes it
> has a slightly out-of-date idea about how much work some backends have
> done.  It should be easy to design the software so that that just makes
> a small, transient error in the currently displayed statistics.

MSGSND(3)              FreeBSD Library Functions Manual              MSGSND(3)


ERRORS    msgsnd() will fail if:
    [EAGAIN]           There was no space for this message either on the                       queue, or in the whole
system,and IPC_NOWAIT was set                       in msgflg.
 

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]



Re: Performance monitor signal handler

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
>     Does  a pipe guarantee that a buffer, written with one atomic
>     write(2), never can get intermixed with  other  data  on  the
>     readers  end?

Yes.  The HPUX man page for write(2) sez:
         o  Write requests of {PIPE_BUF} bytes or less will not be            interleaved with data from other
processesdoing writes on the            same pipe.  Writes of greater than {PIPE_BUF} bytes may have            data
interleaved,on arbitrary boundaries, with writes by            other processes, whether or not the O_NONBLOCK flag of
the           file status flags is set.
 

Stevens' _UNIX Network Programming_ (1990) states this is true for all
pipes (nameless or named) on all flavors of Unix, and furthermore states
that PIPE_BUF is at least 4K on all systems.  I don't have any relevant
Posix standards to look at, but I'm not worried about assuming this to
be true.

>     With message queues, this is guaranteed. Also, message queues
>     would  make  it  easy  to query the collected statistics (see
>     below).

I will STRONGLY object to any proposal that we use message queues.
We've already had enough problems with the ridiculously low kernel
limits that are commonly imposed on shmem and SysV semaphores.
We don't need to buy into that silliness yet again with message queues.
I don't believe they gain us anything over pipes anyway.

The real problem with either pipes or message queues is that backends
will block if the collector stops collecting data.  I don't think we
want that.  I suppose we could have the backends write a pipe with
O_NONBLOCK and ignore failure, however:
         o  If the O_NONBLOCK flag is set, write() requests will  be            handled differently, in the following
ways:
            -  The write() function will not block the process.
            -  A write request for {PIPE_BUF} or fewer bytes  will have               the following effect:  If there
issufficient space               available in the pipe, write() will transfer all the data               and return the
numberof bytes  requested.  Otherwise,               write() will transfer no data and return -1 with errno set
     to EAGAIN.
 

Since we already ignore SIGPIPE, we don't need to worry about losing the
collector entirely.

Now this would put a pretty tight time constraint on the collector:
fall more than 4K behind, you start losing data.  I am not sure if
a UDP socket would provide more buffering or not; anyone know?
        regards, tom lane


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> >     Does  a pipe guarantee that a buffer, written with one atomic
> >     write(2), never can get intermixed with  other  data  on  the
> >     readers  end?
>
> Yes.  The HPUX man page for write(2) sez:
>
>           o  Write requests of {PIPE_BUF} bytes or less will not be
>              interleaved with data from other processes doing writes on the
>              same pipe.  Writes of greater than {PIPE_BUF} bytes may have
>              data interleaved, on arbitrary boundaries, with writes by
>              other processes, whether or not the O_NONBLOCK flag of the
>              file status flags is set.
>
> Stevens' _UNIX Network Programming_ (1990) states this is true for all
> pipes (nameless or named) on all flavors of Unix, and furthermore states
> that PIPE_BUF is at least 4K on all systems.  I don't have any relevant
> Posix standards to look at, but I'm not worried about assuming this to
> be true.
   That's good news - and maybe a Good Assumption (TM).

> >     With message queues, this is guaranteed. Also, message queues
> >     would  make  it  easy  to query the collected statistics (see
> >     below).
>
> I will STRONGLY object to any proposal that we use message queues.
> We've already had enough problems with the ridiculously low kernel
> limits that are commonly imposed on shmem and SysV semaphores.
> We don't need to buy into that silliness yet again with message queues.
> I don't believe they gain us anything over pipes anyway.
  OK.

> The real problem with either pipes or message queues is that backends
> will block if the collector stops collecting data.  I don't think we
> want that.  I suppose we could have the backends write a pipe with
> O_NONBLOCK and ignore failure, however:
>
>           o  If the O_NONBLOCK flag is set, write() requests will  be
>              handled differently, in the following ways:
>
>              -  The write() function will not block the process.
>
>              -  A write request for {PIPE_BUF} or fewer bytes  will have
>                 the following effect:  If there is sufficient space
>                 available in the pipe, write() will transfer all the data
>                 and return the number of bytes  requested.  Otherwise,
>                 write() will transfer no data and return -1 with errno set
>                 to EAGAIN.
>
> Since we already ignore SIGPIPE, we don't need to worry about losing the
> collector entirely.
   That's  not  what  the manpage said. It said that in the case   you're inside PIPE_BUF size and using O_NONBLOCK,
you either   send complete messages or nothing, getting an EAGAIN then.
 
   So  we  could  do the same here and write to the pipe. In the   case we cannot, just count up and try  again  next
year (or   so).
 

>
> Now this would put a pretty tight time constraint on the collector:
> fall more than 4K behind, you start losing data.  I am not sure if
> a UDP socket would provide more buffering or not; anyone know?
   Again,   this   ain't  what  the  manpage  said.  If  there's   sufficient space available in the pipe  in
combination with   that  PIPE_BUF  is  at least 4K doesn't necessarily mean that   the pipes buffer space is 4K.
 
   Well,  what  I'm  missing  is  the  ability  to  filter   out   statistics reports on the backend side via
msgrcv(2)smsgtype   :-(
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Now this would put a pretty tight time constraint on the collector:
> fall more than 4K behind, you start losing data.  I am not sure if
> a UDP socket would provide more buffering or not; anyone know?
   Looks  like Linux has something around 16-32K of buffer space   for UDP sockets. Just from eyeballing the
fprintf(3) output   of my destructively hacked postleprechaun.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Jan Wieck
Date:
Jan Wieck wrote:
> Tom Lane wrote:
> > Now this would put a pretty tight time constraint on the collector:
> > fall more than 4K behind, you start losing data.  I am not sure if
> > a UDP socket would provide more buffering or not; anyone know?
>
>     Looks  like Linux has something around 16-32K of buffer space
>     for UDP sockets. Just from eyeballing the  fprintf(3)  output
>     of my destructively hacked postleprechaun.

    Just  to  get  some  evidence  at hand - could some owners of
    different platforms compile and run  the  attached  little  C
    source please?

    (The  program  tests how much data can be stuffed into a pipe
    or a Sys-V message queue before the writer would block or get
    an EAGAIN error).

    My output on RedHat6.1 Linux 2.2.17 is:

        Pipe buffer is 4096 bytes
        Sys-V message queue buffer is 16384 bytes

    Seems Tom is (unfortunately) right. The pipe blocks at 4K.

    So  a  Sys-V  message  queue,  with the ability to distribute
    messages from  the  collector  to  individual  backends  with
    kernel  support  via  "mtype"  is  four  times by unestimated
    complexity better here.  What does your system say?

    I really never thought that Sys-V IPC is a good way to go  at
    all.   I  hate  it's  incompatibility to the select(2) system
    call and all these  OS/installation  dependant  restrictions.
    But I'm tempted to reevaluate it "for this case".


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Attachment

Re: Performance monitor signal handler

From
Tom Lane
Date:
Jan Wieck <JanWieck@yahoo.com> writes:
>     Just  to  get  some  evidence  at hand - could some owners of
>     different platforms compile and run  the  attached  little  C
>     source please?

HPUX 10.20:

Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes
        regards, tom lane


Re: Performance monitor signal handler

From
Giles Lean
Date:
>     Just  to  get  some  evidence  at hand - could some owners of
>     different platforms compile and run  the  attached  little  C
>     source please?

$ uname -srm
FreeBSD 4.1.1-STABLE
$ ./jan
Pipe buffer is 16384 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.5 alpha
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.5_BETA2 i386
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.4.2 i386
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.4.1 sparc
$ ./jan
Pipe buffer is 4096 bytes
Bad system call (core dumped)    # no SysV IPC in running kernel

$ uname -srm
HP-UX B.11.11 9000/800
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

$ uname -srm
HP-UX B.11.00 9000/813
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

$ uname -srm
HP-UX B.10.20 9000/871
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

HP-UX can also use STREAMS based pipes if the kernel parameter
streampipes is set.  Using STREAMS based pipes increases the pipe
buffer size by a lot:

# uname -srm 
HP-UX B.11.11 9000/800
# ./jan
Pipe buffer is 131072 bytes
Sys-V message queue buffer is 16384 bytes

# uname -srm
HP-UX B.11.00 9000/800
# ./jan
Pipe buffer is 131072 bytes
Sys-V message queue buffer is 16384 bytes

Regards,

Giles


Re: Performance monitor signal handler

From
Larry Rosenman
Date:
* Jan Wieck <JanWieck@Yahoo.com> [010316 16:35]:
> Jan Wieck wrote:
> > Tom Lane wrote:
> > > Now this would put a pretty tight time constraint on the collector:
> > > fall more than 4K behind, you start losing data.  I am not sure if
> > > a UDP socket would provide more buffering or not; anyone know?
> >
> >     Looks  like Linux has something around 16-32K of buffer space
> >     for UDP sockets. Just from eyeballing the  fprintf(3)  output
> >     of my destructively hacked postleprechaun.
> 
>     Just  to  get  some  evidence  at hand - could some owners of
>     different platforms compile and run  the  attached  little  C
>     source please?
> 
>     (The  program  tests how much data can be stuffed into a pipe
>     or a Sys-V message queue before the writer would block or get
>     an EAGAIN error).
> 
>     My output on RedHat6.1 Linux 2.2.17 is:
> 
>         Pipe buffer is 4096 bytes
>         Sys-V message queue buffer is 16384 bytes
> 
>     Seems Tom is (unfortunately) right. The pipe blocks at 4K.
> 
>     So  a  Sys-V  message  queue,  with the ability to distribute
>     messages from  the  collector  to  individual  backends  with
>     kernel  support  via  "mtype"  is  four  times by unestimated
>     complexity better here.  What does your system say?
> 
>     I really never thought that Sys-V IPC is a good way to go  at
>     all.   I  hate  it's  incompatibility to the select(2) system
>     call and all these  OS/installation  dependant  restrictions.
>     But I'm tempted to reevaluate it "for this case".
> 
> 
> Jan
$ ./queuetest
Pipe buffer is 32768 bytes
Sys-V message queue buffer is 4096 bytes
$ uname -a
UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5
$ 

I think some of these are configurable...

LER

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


Re: Performance monitor signal handler

From
Larry Rosenman
Date:
* Larry Rosenman <ler@lerctr.org> [010316 20:47]:
> * Jan Wieck <JanWieck@Yahoo.com> [010316 16:35]:
> $ ./queuetest
> Pipe buffer is 32768 bytes
> Sys-V message queue buffer is 4096 bytes
> $ uname -a
> UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5
> $ 
> 
> I think some of these are configurable...
They both are.  FIFOBLKSIZE and MSGMNB or some such kernel tunable.

I can get more info if you need it.

LER

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


Re: Performance monitor signal handler

From
Philip Warner
Date:
At 13:49 16/03/01 -0500, Jan Wieck wrote:
>
>    Similar problem as with shared  memory  -  size.  If  a  long
>    running  backend  of  a multithousand table database needs to
>    send access stats per table - and had accessed them all up to
>    now - it'll be alot of wasted bandwidth.

Not if you only send totals for individual counters when they change; some
stats may never be resynced, but for the most part it will work. Also, does
Unix allow interrupts to occur as a result of data arrivibg in a pipe? If
so, how about:

- All backends to do *blocking* IO to collector.

- Collector to receive an interrupt when a message arrives; while in the
interrupt it reads the buffer into a local queue, and returns from the
interrupt.

- Main line code processes the queue and writes it to a memory mapped file
for durability.

- If collector dies, postmaster starts another immediately, which slears
the backlog of data in the pipe and then remaps the file.

- Each backend has its own local copy of it's counters which *possibly* to
collector can ask for when it restarts.




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Philip Warner wrote:
> At 13:49 16/03/01 -0500, Jan Wieck wrote:
> >
> >    Similar problem as with shared  memory  -  size.  If  a  long
> >    running  backend  of  a multithousand table database needs to
> >    send access stats per table - and had accessed them all up to
> >    now - it'll be alot of wasted bandwidth.
>
> Not if you only send totals for individual counters when they change; some
> stats may never be resynced, but for the most part it will work. Also, does
> Unix allow interrupts to occur as a result of data arrivibg in a pipe? If
> so, how about:
>
> - All backends to do *blocking* IO to collector.
   The  general  problem  remains.  We  only  have  one  central   collector with a limited receive capacity.  The more
load is   on  the  machine,  the  smaller it's capacity gets.  The more   complex the DB schemas get  and  the  more
load is  on  the   system,  the  more interesting accurate statistics get.  Both   factors are contraproductive. More
complexschema means  more   tables  and  thus  bigger  messages.  More  load  means  more   messages.  Having good
statisticson a toy system while  they   get  worse  for  a  web  backend  server  that's really under   pressure is
braindeadfrom the start.
 
   We don't want the backends to block,  so  that  they  can  do   THEIR work. That's to process queries, nothing
else.
   Pipes  seem  to  be  inappropriate  because  their  buffer is   limited to 4K on Linux and most BSD flavours.
Message queues   are too because they are limited to 2K on most BSD's. So only   sockets remain.
 
   If we have multiple processes that try to  receive  from  the   UDP  socket,  condense  the  received  packets  into
summary   messages and send them to the central collector,  this  might   solve the problem.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Samuel Sieb
Date:
On Sat, Mar 17, 2001 at 09:33:03AM -0500, Jan Wieck wrote:
> 
>     The  general  problem  remains.  We  only  have  one  central
>     collector with a limited receive capacity.  The more load  is
>     on  the  machine,  the  smaller it's capacity gets.  The more
>     complex the DB schemas get  and  the  more  load  is  on  the
>     system,  the  more interesting accurate statistics get.  Both
>     factors are contraproductive. More complex schema means  more
>     tables  and  thus  bigger  messages.  More  load  means  more
>     messages.  Having good statistics on a toy system while  they
>     get  worse  for  a  web  backend  server  that's really under
>     pressure is braindead from the start.
> 
Just as another suggestion, what about sending the data to a different
computer, so instead of tying up the database server with processing the
statistics, you have another computer that has some free time to do the
processing.

Some drawbacks are that you can't automatically start/restart it from the
postmaster and it will put a little more load on the network, but it seems
to mostly solve the issues of blocked pipes and using too much cpu time
on the database server.



Re: Performance monitor signal handler

From
Tom Lane
Date:
Samuel Sieb <samuel@sieb.net> writes:
> Just as another suggestion, what about sending the data to a different
> computer, so instead of tying up the database server with processing the
> statistics, you have another computer that has some free time to do the
> processing.

> Some drawbacks are that you can't automatically start/restart it from the
> postmaster and it will put a little more load on the network,

... and a lot more load on the CPU.  Same-machine "network" connections
are much cheaper (on most kernels, anyway) than real network
connections.

I think all of this discussion is vast overkill.  No one has yet
demonstrated that it's not sufficient to have *one* collector process
and a lossy transmission method.  Let's try that first, and if it really
proves to be unworkable then we can get out the lily-gilding equipment.
But there is tons more stuff to do before we have useful stats at all,
and I don't think that this aspect is the most critical part of the
problem.
        regards, tom lane


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> ... and a lot more load on the CPU.  Same-machine "network" connections
> are much cheaper (on most kernels, anyway) than real network
> connections.
> 
> I think all of this discussion is vast overkill.  No one has yet
> demonstrated that it's not sufficient to have *one* collector process
> and a lossy transmission method.  Let's try that first, and if it really
> proves to be unworkable then we can get out the lily-gilding equipment.
> But there is tons more stuff to do before we have useful stats at all,
> and I don't think that this aspect is the most critical part of the
> problem.

Agreed.  Sounds like overkill.

How about a per-backend shared memory area for stats, plus a global
shared memory area that each backend can add to when it exists.  That
meets most of our problem.

The only open issue is per-table stuff, and I would like to see some
circular buffer implemented to handle that, with a collection process
that has access to shared memory.  Even better, have an SQL table
updated with the per-table stats periodically.  How about a collector
process that periodically reads though the shared memory and UPDATE's
SQL tables with the information.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> The only open issue is per-table stuff, and I would like to see some
> circular buffer implemented to handle that, with a collection process
> that has access to shared memory.

That will get us into locking/contention issues.  OTOH, frequent trips
to the kernel to send stats messages --- regardless of the transport
mechanism chosen --- don't seem all that cheap either.

> Even better, have an SQL table updated with the per-table stats
> periodically.

That will be horribly expensive, if it's a real table.

I think you missed the point that somebody made a little while ago
about waiting for functions that can return tuple sets.  Once we have
that, the stats tables can be *virtual* tables, ie tables that are
computed on-demand by some function.  That will be a lot less overhead
than physically updating an actual table.
        regards, tom lane


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > The only open issue is per-table stuff, and I would like to see some
> > circular buffer implemented to handle that, with a collection process
> > that has access to shared memory.
> 
> That will get us into locking/contention issues.  OTOH, frequent trips
> to the kernel to send stats messages --- regardless of the transport
> mechanism chosen --- don't seem all that cheap either.

I am confused.  Reading/writing shared memory is not a kernel call,
right?

I agree on the locking contention problems of a circular buffer.

> 
> > Even better, have an SQL table updated with the per-table stats
> > periodically.
> 
> That will be horribly expensive, if it's a real table.

But per-table stats aren't something that people will look at often,
right?  They can sit in the collector's memory for quite a while.  See
people wanting to look at per-backend stuff frequently, and that is why
I thought share memory should be good, and a global area for aggregate
stats for all backends.

> I think you missed the point that somebody made a little while ago
> about waiting for functions that can return tuple sets.  Once we have
> that, the stats tables can be *virtual* tables, ie tables that are
> computed on-demand by some function.  That will be a lot less overhead
> than physically updating an actual table.

Yes, but do we want to keep these stats between postmaster restarts? 
And what about writing them to tables when our storage of table stats
gets too big?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Even better, have an SQL table updated with the per-table stats
> periodically.
>> 
>> That will be horribly expensive, if it's a real table.

> But per-table stats aren't something that people will look at often,
> right?  They can sit in the collector's memory for quite a while.  See
> people wanting to look at per-backend stuff frequently, and that is why
> I thought share memory should be good, and a global area for aggregate
> stats for all backends.

>> I think you missed the point that somebody made a little while ago
>> about waiting for functions that can return tuple sets.  Once we have
>> that, the stats tables can be *virtual* tables, ie tables that are
>> computed on-demand by some function.  That will be a lot less overhead
>> than physically updating an actual table.

> Yes, but do we want to keep these stats between postmaster restarts? 
> And what about writing them to tables when our storage of table stats
> gets too big?

All those points seem to me to be arguments in *favor* of a virtual-
table approach, not arguments against it.

Or are you confusing the method of collecting stats with the method
of making the collected stats available for use?
        regards, tom lane


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> > But per-table stats aren't something that people will look at often,
> > right?  They can sit in the collector's memory for quite a while.  See
> > people wanting to look at per-backend stuff frequently, and that is why
> > I thought share memory should be good, and a global area for aggregate
> > stats for all backends.
> 
> >> I think you missed the point that somebody made a little while ago
> >> about waiting for functions that can return tuple sets.  Once we have
> >> that, the stats tables can be *virtual* tables, ie tables that are
> >> computed on-demand by some function.  That will be a lot less overhead
> >> than physically updating an actual table.
> 
> > Yes, but do we want to keep these stats between postmaster restarts? 
> > And what about writing them to tables when our storage of table stats
> > gets too big?
> 
> All those points seem to me to be arguments in *favor* of a virtual-
> table approach, not arguments against it.
> 
> Or are you confusing the method of collecting stats with the method
> of making the collected stats available for use?

Maybe I am confusing them.  I didn't see a distinction in the
discussion.

I assumed the UDP/message passing of information to the collector was
the way statistics were collected, and I don't understand why a
per-backend area and global area, with some kind of cicular buffer for
per-table stuff isn't the cheapest, cleanest solution.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Tom Lane wrote:
> Samuel Sieb <samuel@sieb.net> writes:
> > Just as another suggestion, what about sending the data to a different
> > computer, so instead of tying up the database server with processing the
> > statistics, you have another computer that has some free time to do the
> > processing.
>
> > Some drawbacks are that you can't automatically start/restart it from the
> > postmaster and it will put a little more load on the network,
>
> ... and a lot more load on the CPU.  Same-machine "network" connections
> are much cheaper (on most kernels, anyway) than real network
> connections.
>
> I think all of this discussion is vast overkill.  No one has yet
> demonstrated that it's not sufficient to have *one* collector process
> and a lossy transmission method.  Let's try that first, and if it really
> proves to be unworkable then we can get out the lily-gilding equipment.
> But there is tons more stuff to do before we have useful stats at all,
> and I don't think that this aspect is the most critical part of the
> problem.
   Well,
   back  to my initial approach with the UDP socket collector. I   now have a collector simply reading  all  messages
from the   socket.  It  doesn't  do  anything useful except for counting   their number.
 
   Every backend sends a couple  of  1K  junk  messages  at  the   beginning  of  the  main loop. Up to 16 messages,
thereis no   time(1) measurable  delay  in  the  execution  of  the  "make   runcheck".
 
   The   dummy   collector  can  keep  up  during  the  parallel   regression test until the  backends  send  64
messages each   time,  at  that number he lost 1.25% of the messages. That is   an amount of statistics data of >256MB
tobe collected.  Most   of  the  test  queries  will never generate 1K of message, so   that there should be some space
here.
   My plan  now  is  to  add  some  real  functionality  to  the   collector and the backend, to see if that has an
impact.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Patrick Welche
Date:
On Fri, Mar 16, 2001 at 05:25:24PM -0500, Jan Wieck wrote:
> Jan Wieck wrote:
...
>     Just  to  get  some  evidence  at hand - could some owners of
>     different platforms compile and run  the  attached  little  C
>     source please?
... 
>     Seems Tom is (unfortunately) right. The pipe blocks at 4K.

On NetBSD-1.5S/i386 with just the highly conservative shmem defaults:

Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

Cheers,

Patrick


Re: Performance monitor signal handler

From
Tom Lane
Date:
Jan Wieck <JanWieck@yahoo.com> writes:
>     Just  to  get  some  evidence  at hand - could some owners of
>     different platforms compile and run  the  attached  little  C
>     source please?
>     (The  program  tests how much data can be stuffed into a pipe
>     or a Sys-V message queue before the writer would block or get
>     an EAGAIN error).

One final followup on this --- I wasted a fair amount of time just
now trying to figure out why Perl 5.6.0 was silently hanging up
in its self-tests (at op/taint, which seems pretty unrelated...).

The upshot: Jan's test program had left a 16k SysV message queue
hanging about, and that queue was filling all available SysV message
space on my machine.  Seems Perl tries to test message-queue sending,
and it was patiently waiting for some message space to come free.

In short, the SysV message queue limits are so tiny that not only
are you quite likely to get bollixed up if you use messages, but
you're likely to bollix anything else that's using message queues too.
        regards, tom lane


Re: Performance monitor signal handler

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Only shared memory gives us near-zero cost for write/read.  99% of
> backends will not be using stats, so it has to be cheap.

Not with a circular buffer it's not cheap, because you need interlocking
on writes.  Your claim that you can get away without that is simply
false.  You won't just get lost messages, you'll get corrupted messages.

> The collector program can read the shared memory stats and keep hashed
> values of accumulated stats.  It uses the "Loops" variable to know if it
> has read the current information in the buffer.

And how does it sleep until the counter has been advanced?  Seems to me
it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay
is 10 ms then it's guaranteed to miss a lot of data under load).
        regards, tom lane


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Only shared memory gives us near-zero cost for write/read.  99% of
> > backends will not be using stats, so it has to be cheap.
> 
> Not with a circular buffer it's not cheap, because you need interlocking
> on writes.  Your claim that you can get away without that is simply
> false.  You won't just get lost messages, you'll get corrupted messages.

How do I get corrupt messages if they are all five bytes?  If I write
five bytes, and another does the same, I guess the assembler could
intersperse the writes so the oid gets to be a corrupt value.  Any cheap
way around this, perhaps by skiping/clearing the write on a collision?

> 
> > The collector program can read the shared memory stats and keep hashed
> > values of accumulated stats.  It uses the "Loops" variable to know if it
> > has read the current information in the buffer.
> 
> And how does it sleep until the counter has been advanced?  Seems to me
> it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay
> is 10 ms then it's guaranteed to miss a lot of data under load).

I figured it could just wake up every few seconds and check.  It will
remember the loop counter and current pointer, and read any new
information.  I was thinking of a 20k buffer, which could cover about 4k
events.

Should we think about doing these writes into an OS file, and only
enabling the writes when we know there is a collector reading them,
perhaps using a /tmp file to activate recording.  We could allocation
1MB and be sure not to miss anything, even with a circular setup.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
I have a new statistics collection proposal.

I suggest three shared memory areas:
One per backend to hold the query string and other per-backend statsOne global area to hold accumulated stats for all
backendsOneglobal circular buffer to hold per-table/object stats
 

The circular buffer will look like:
(Loops) Start---------------------------End                            |                            current pointer

Loops is incremented every time the pointer reaches "end".

Each statistics record will have a length of five bytes made up of
oid(4) and action(1).  By having the same length for all statistics
records, we don't need to perform any locking of the buffer.  A backend
will grab the current pointer, add five to it, and write into the
reserved 5-byte area.  If two backends write at the same time, one
overwrites the other, but this is just statistics information, so it is
not a great lose.

Only shared memory gives us near-zero cost for write/read.  99% of
backends will not be using stats, so it has to be cheap.

The collector program can read the shared memory stats and keep hashed
values of accumulated stats.  It uses the "Loops" variable to know if it
has read the current information in the buffer.  When it receives a
signal, it can dump its stats to a file in standard COPY format of
<oid><tab><action><tab><count>.  It can also reset its counters with a
signal.

Comments?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Bruce Momjian wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Only shared memory gives us near-zero cost for write/read.  99% of
> > > backends will not be using stats, so it has to be cheap.
> >
> > Not with a circular buffer it's not cheap, because you need interlocking
> > on writes.  Your claim that you can get away without that is simply
> > false.  You won't just get lost messages, you'll get corrupted messages.
>
> How do I get corrupt messages if they are all five bytes?  If I write
> five bytes, and another does the same, I guess the assembler could
> intersperse the writes so the oid gets to be a corrupt value.  Any cheap
> way around this, perhaps by skiping/clearing the write on a collision?
>
> >
> > > The collector program can read the shared memory stats and keep hashed
> > > values of accumulated stats.  It uses the "Loops" variable to know if it
> > > has read the current information in the buffer.
> >
> > And how does it sleep until the counter has been advanced?  Seems to me
> > it has to busy-wait (bad) or sleep (worse; if the minimum sleep delay
> > is 10 ms then it's guaranteed to miss a lot of data under load).
>
> I figured it could just wake up every few seconds and check.  It will
> remember the loop counter and current pointer, and read any new
> information.  I was thinking of a 20k buffer, which could cover about 4k
> events.
   Here  I  wonder what your EVENT is. With an Oid as identifier   and a 1 byte (even if it'd be anoter 32-bit value),
how many   messages do you want to generate to get these statistics:
 
   -   Number of sequential scans done per table.   -   Number of tuples returned via sequential scans per table.   -
Numberof buffer cache lookups  done  through  sequential       scans per table.   -   Number  of  buffer  cache  hits
forsequential scans per       table.   -   Number of tuples inserted per table.   -   Number of tuples updated per
table.  -   Number of tuples deleted per table.   -   Number of index scans done per index.   -   Number of index
tuplesreturned per index.   -   Number of buffer cache lookups  done  due  to  scans  per       index.   -   Number of
buffercache hits per index.   -   Number  of  valid heap tuples returned via index scan per       index.   -   Number
ofbuffer cache lookups done for heap fetches  via       index scan per index.   -   Number  of  buffer  cache hits for
heapfetches via index       scan per index.   -   Number of buffer cache lookups not accountable for any of       the
above.  -   Number  of  buffer  cache hits not accountable for any of       the above.
 
   What I see is that there's a difference in what we  two  want   to see in the statistics. You're talking about
lookingat the   actual querystring and such. That's  information  useful  for   someone   actually  looking  at  a
server, to  see  what  a   particular backend  is  doing.  On  my  notebook  a  parallel   regression  test
(containing>4,000 queries) passes by under   1:30, that's more than 40 queries per second. So that doesn't   tell me
much.
   What I'm after is to collect the above data over a week or so   and then generate a report to identify the hot spots
of  the   schema.  Which tables/indices cause the most disk I/O, what's   the average percentage of tuples returned in
scans(not  from   the  query, I mean from the single scan inside of the joins).   That's the information I need  to
know where  to  look  for   possibly  better  qualifications, useless indices that aren't   worth to maintain and the
like.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Performance monitor signal handler

From
Bruce Momjian
Date:
> > I figured it could just wake up every few seconds and check.  It will
> > remember the loop counter and current pointer, and read any new
> > information.  I was thinking of a 20k buffer, which could cover about 4k
> > events.
> 
>     Here  I  wonder what your EVENT is. With an Oid as identifier
>     and a 1 byte (even if it'd be anoter 32-bit value), how  many
>     messages do you want to generate to get these statistics:
> 
>     -   Number of sequential scans done per table.
>     -   Number of tuples returned via sequential scans per table.
>     -   Number of buffer cache lookups  done  through  sequential
>         scans per table.
>     -   Number  of  buffer  cache  hits  for sequential scans per
>         table.
>     -   Number of tuples inserted per table.
>     -   Number of tuples updated per table.
>     -   Number of tuples deleted per table.
>     -   Number of index scans done per index.
>     -   Number of index tuples returned per index.
>     -   Number of buffer cache lookups  done  due  to  scans  per
>         index.
>     -   Number of buffer cache hits per index.
>     -   Number  of  valid heap tuples returned via index scan per
>         index.
>     -   Number of buffer cache lookups done for heap fetches  via
>         index scan per index.
>     -   Number  of  buffer  cache hits for heap fetches via index
>         scan per index.
>     -   Number of buffer cache lookups not accountable for any of
>         the above.
>     -   Number  of  buffer  cache hits not accountable for any of
>         the above.
> 
>     What I see is that there's a difference in what we  two  want
>     to see in the statistics. You're talking about looking at the
>     actual querystring and such. That's  information  useful  for
>     someone   actually  looking  at  a  server,  to  see  what  a
>     particular backend  is  doing.  On  my  notebook  a  parallel
>     regression  test  (containing >4,000 queries) passes by under
>     1:30, that's more than 40 queries per second. So that doesn't
>     tell me much.
> 
>     What I'm after is to collect the above data over a week or so
>     and then generate a report to identify the hot spots  of  the
>     schema.  Which tables/indices cause the most disk I/O, what's
>     the average percentage of tuples returned in scans (not  from
>     the  query, I mean from the single scan inside of the joins).
>     That's the information I need  to  know  where  to  look  for
>     possibly  better  qualifications, useless indices that aren't
>     worth to maintain and the like.
> 

I was going to have the per-table stats insert a stat record every time
it does a sequential scan, so it sould be [oid][sequential_scan_value]
and allow the collector to gather that and aggregate it.

I didn't think we wanted each backend to do the aggregation per oid. 
Seems expensive. Maybe we would need a count for things like "number of
rows returned" so it would be [oid][stat_type][value].

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Bruce Momjian
Date:
I have talked to Jan over the phone, and he has convinced me that UDP is
the proper way to communicate stats to the collector, rather than my
shared memory idea.

The advantages of his UDP approach is that the collector can sleep on
the UDP socket rather than having the collector poll the shared memory
area.  It also has the auto-discard option.  He will make logging
configurable on a per-database level, so it can be turned off when not
in use.

He has a trial UDP implementation that he will post soon.  Also, I asked
him to try DGRAM Unix-domain sockets for performance reasons.  My
Steven's book says it they should be supported.  He can put the socket
file in /data.



> > > I figured it could just wake up every few seconds and check.  It will
> > > remember the loop counter and current pointer, and read any new
> > > information.  I was thinking of a 20k buffer, which could cover about 4k
> > > events.
> > 
> >     Here  I  wonder what your EVENT is. With an Oid as identifier
> >     and a 1 byte (even if it'd be anoter 32-bit value), how  many
> >     messages do you want to generate to get these statistics:
> > 
> >     -   Number of sequential scans done per table.
> >     -   Number of tuples returned via sequential scans per table.
> >     -   Number of buffer cache lookups  done  through  sequential
> >         scans per table.
> >     -   Number  of  buffer  cache  hits  for sequential scans per
> >         table.
> >     -   Number of tuples inserted per table.
> >     -   Number of tuples updated per table.
> >     -   Number of tuples deleted per table.
> >     -   Number of index scans done per index.
> >     -   Number of index tuples returned per index.
> >     -   Number of buffer cache lookups  done  due  to  scans  per
> >         index.
> >     -   Number of buffer cache hits per index.
> >     -   Number  of  valid heap tuples returned via index scan per
> >         index.
> >     -   Number of buffer cache lookups done for heap fetches  via
> >         index scan per index.
> >     -   Number  of  buffer  cache hits for heap fetches via index
> >         scan per index.
> >     -   Number of buffer cache lookups not accountable for any of
> >         the above.
> >     -   Number  of  buffer  cache hits not accountable for any of
> >         the above.
> > 
> >     What I see is that there's a difference in what we  two  want
> >     to see in the statistics. You're talking about looking at the
> >     actual querystring and such. That's  information  useful  for
> >     someone   actually  looking  at  a  server,  to  see  what  a
> >     particular backend  is  doing.  On  my  notebook  a  parallel
> >     regression  test  (containing >4,000 queries) passes by under
> >     1:30, that's more than 40 queries per second. So that doesn't
> >     tell me much.
> > 
> >     What I'm after is to collect the above data over a week or so
> >     and then generate a report to identify the hot spots  of  the
> >     schema.  Which tables/indices cause the most disk I/O, what's
> >     the average percentage of tuples returned in scans (not  from
> >     the  query, I mean from the single scan inside of the joins).
> >     That's the information I need  to  know  where  to  look  for
> >     possibly  better  qualifications, useless indices that aren't
> >     worth to maintain and the like.
> > 
> 
> I was going to have the per-table stats insert a stat record every time
> it does a sequential scan, so it sould be [oid][sequential_scan_value]
> and allow the collector to gather that and aggregate it.
> 
> I didn't think we wanted each backend to do the aggregation per oid. 
> Seems expensive. Maybe we would need a count for things like "number of
> rows returned" so it would be [oid][stat_type][value].
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance monitor signal handler

From
Jan Wieck
Date:
Bruce Momjian wrote:
> I have talked to Jan over the phone, and he has convinced me that UDP is
> the proper way to communicate stats to the collector, rather than my
> shared memory idea.
>
> The advantages of his UDP approach is that the collector can sleep on
> the UDP socket rather than having the collector poll the shared memory
> area.  It also has the auto-discard option.  He will make logging
> configurable on a per-database level, so it can be turned off when not
> in use.
>
> He has a trial UDP implementation that he will post soon.  Also, I asked
> him to try DGRAM Unix-domain sockets for performance reasons.  My
> Steven's book says it they should be supported.  He can put the socket
> file in /data.

"Trial" implementation attached :-)

    First  attachment  is  a patch for various backend files plus
    generating two new source files. If your patch(1) doesn't put
    'em   automatically,  they  go  to  src/include/pgstat.h  and
    src/backend/postmaster/pgstat.c.

    BTW:  tgl  on  2/99  was  right,  the  hash_destroy()  really
    crashes.  Maybe  we  want  to  pull  out  the  fix  I've done
    (includes some new feature for hash table memory  allocation)
    and apply that to 7.1?

    Second   attachment  is  a  tarfile  that  should  unpack  to
    contrib/pgstat_tmp.  I've placed the SQL level functions into
    a shared module for now. The sql script also creates a couple
    of views.

    -   pgstat_all_tables shows scan- and tuple based  statistics
        for all tables.  pgstat_sys_tables and pgstat_user_tables
        filter out (you guess what) system or user tables.

    -   pgstatio_all_tables,       pgstatio_sys_tables        and
        pgstatio_user_tables   show   buffer  IO  statistics  for
        tables.

    -   pgstat_*_indexes and pgstatio_*_indexes are similar  like
        the  above,  just that they give detailed info about each
        single index.

    -   pgstatio_*_sequences shows buffer IO statistics  about  -
        right,   sequences.    Since   sequences  aren't  scanned
        regularely, they have no scan- and tuple related view.

    -   pgstat_activity shows informations  about  all  currently
        running  backends  of the entire instance. The underlying
        function for displaying the  actual  query  returns  NULL
        allways for non-superusers.

    -   pgstat_database shows transaction commit/abort counts and
        cumulated  buffer  IO   statistics   for   all   existing
        databases.

    The  collector  writes  frequently  a  file  data/pgstat.stat
    (approx. every 500 milliseconds as long as there is something
    to  tell,  so  nothing  is  done  if  the entire installation
    sleeps). He also reads this file  on  startup,  so  collected
    statistics survive postmaster restarts.

    TODO:

    -   Are  PF_UNIX  SOCK_DGRAM  sockets  supported  on  all the
        platforms we do?  If not, what's wrong with  the  current
        implementation?

    -   There  is  no way yet to tell the collector about objects
        (relations and  databases)  removed  from  the  database.
        Basically  that  could be done with messages too, but who
        will send them and how can we guarantee that  they'll  be
        generated  even if somebody never queries the statistics?
        Thus, the current collector will grow, and grow, and grow
        until   you   remove   the  pgstat.stat  file  while  the
        postmaster is down.

    -   Also there aren't functions or  messages  implemented  to
        explicitly reset statistics.

    -   Possible additions would be to remember when the backends
        started and collect resource usage (rstat(2)) information
        as well.

    -   The   entire  thing  needs  an  additional  attribute  in
        pg_database that tells the  backends  what  to  tell  the
        collector at all. Just to make them quiet again.

    So far for an actual snapshot. Comments?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Attachment