Thread: Rethinking stats communication mechanisms

Rethinking stats communication mechanisms

From
Tom Lane
Date:
In view of my oprofile results
http://archives.postgresql.org/pgsql-hackers/2006-06/msg00859.php
I'm thinking we need some major surgery on the way that the stats
collection mechanism works.

It strikes me that we are using a single communication mechanism to
handle what are really two distinct kinds of data:

* Current-state information, eg, what backends are alive and what
commands they are currently working on.  Ideally we'd like this type of
info to be 100% up-to-date.  But once a particular bit of information
(eg a command string) is obsolete, it's not of interest anymore.

* Event counts.  These accumulate and so past information is still
important.  On the other hand, it's not so critical that the info be
completely up-to-date --- the central counters can lag behind a bit,
so long as events eventually get counted.

I believe the stats code was designed with the second case in mind,
but we've abused it to handle the first case, and that's why we've
now got performance problems.

If we are willing to assume that the current-state information is of
fixed maximum size, we could store it in shared memory.  (This
suggestion already came up in the recent thread about ps_status,
and I think it's been mentioned before too --- but my point here is
that we have to separate this case from the event-counting case.)
The only real restriction we'd be making is that we can only show the
first N characters of current command string, but we're already
accepting that limitation in the existing stats code.  (And we could
make N whatever we wanted, without worrying about UDP datagram limits.)
I'm envisioning either adding fields to the PGPROC array, or perhaps
better using a separate array with an entry for each backend ID.
Backends would write their status info into this array and any
interested backend could read it out again.  The stats collector
process needn't be involved at all AFAICS.  This eliminates any
process-dispatch overhead to report command start or command
termination.  Instead we'd have some locking overhead, but contention
ought to be low enough that that's not a serious problem.  I'm assuming
a separate lock for each array entry so that backends don't contend with
each other to update their entries; contention occurs only when someone
is actively reading the information.  We should probably use LWLocks not
spinlocks because the time taken to copy a long command string into the
shared area would be longer than we ought to hold a spinlock (but this
seems a bit debatable given the expected low contention ... any
thoughts?)

The existing stats collection mechanism seems OK for event counts,
although I'd propose two changes: one, get rid of the separate
buffer process, and two, find a way to emit event reports in a
time-driven way rather than once per transaction commit.  I'm a bit
vague about how to do the latter at the moment.

Comments?
        regards, tom lane


Re: Rethinking stats communication mechanisms

From
PFC
Date:
> It strikes me that we are using a single communication mechanism to
> handle what are really two distinct kinds of data:
Interesting.
I recently read a paper on how to get rid of locks for this kind of  
pattern.

* For the Command String
- Problem : need to display the currently executing command in the ps  
list.- Will only be of use if the command is taking a long, long time.So, it need not be realtime ; no problem if the
datacomes with a little  
 
delay, or not at all if the command executes quickly.People are only interested in the currently executing command to
answer 
 
questions like "what query has my server grinding ?"
Point : the currently executing query is only interesting to display if  
it's currently executing. If it's not, it's in the log (if enabled).
So, the proposal :
Each backend has a shared memory area where to store :- the currently executing command (like in your proposal).- a
timestamp-a counter
 
On executing a command, Backend stores the command string, then  
overwrites the counter with (counter + 1) and with the timestamp of  
command start.
Periodically, like every N seconds, a separate process reads the counter,  
then reads the data, then reads the counter again.If the counter value changed, the process is repeated. If the counter

 
value did not change, the command string did not change either, so it's  
valid, and can be used.
Other thoughts :
If the backend process itself should update its process title, and this  
operation is costly, it should only be done if the current query has been  
running for more than T seconds. However syscalls for getting the current  
time are costly. A separate process can update a counter in shared memory  
with the current time every N seconds, and the backend can check it.
The main point is that if this value is written to every few seconds, but  
read often by only one process ; or written often but read seldom, there  
will not be a lot of interprocessor cache trashing on it.


Re: Rethinking stats communication mechanisms

From
Tom Lane
Date:
PFC <lists@peufeu.com> writes:
>     So, the proposal :
>     On executing a command, Backend stores the command string, then  
> overwrites the counter with (counter + 1) and with the timestamp of  
> command start.
>     Periodically, like every N seconds, a separate process reads the counter,  
> then reads the data, then reads the counter again.

Well, it wouldn't be "periodic", it'd be "whenever someone reads
pg_stat_activity".  I was considering solutions like this, but I'm a
little disturbed by the possibility that the would-be reader might
loop indefinitely if the source backend is constantly changing its
entry.  Still, slow reads of pg_stat_activity might be a good tradeoff
for taking overhead out of the update operation.

BTW, I think the writer would actually need to bump the counter twice,
once before and once after it modifies its stats area.  Else there's
no way to detect that you've copied a partially-updated stats entry.

>     If the backend process itself should update its process title, and this  
> operation is costly, it should only be done if the current query has been  
> running for more than T seconds.

Managing that would cost more than just doing it, I think.
        regards, tom lane


Re: Rethinking stats communication mechanisms

From
Greg Stark
Date:
PFC <lists@peufeu.com> writes:

>     - Will only be of use if the command is taking a long, long time.
>     So, it need not be realtime ; no problem if the data comes with a
> little  delay, or not at all if the command executes quickly.

I would dispute this point. Picture a system running a very short very very
often. It may still be the main problem, may even be taking 90+% of the cpu
time. If you got an accurate snapshot of all the currently executing queries
you'll see it popping up suspiciously often.

-- 
greg



Re: Rethinking stats communication mechanisms

From
Douglas McNaught
Date:
Greg Stark <gsstark@mit.edu> writes:

> PFC <lists@peufeu.com> writes:
>
>>     - Will only be of use if the command is taking a long, long time.
>>     So, it need not be realtime ; no problem if the data comes with a
>> little  delay, or not at all if the command executes quickly.
>
> I would dispute this point. Picture a system running a very short very very
> often. It may still be the main problem, may even be taking 90+% of the cpu
> time. If you got an accurate snapshot of all the currently executing queries
> you'll see it popping up suspiciously often.

Yeah, but if you turn on query logging in that case you'll see the
bajillions of short queries, so you don't need the accurate snapshot
to diagnose that.

-Doug


Re: Rethinking stats communication mechanisms

From
Greg Stark
Date:
Douglas McNaught <doug@mcnaught.org> writes:

> Yeah, but if you turn on query logging in that case you'll see the
> bajillions of short queries, so you don't need the accurate snapshot
> to diagnose that.

Query logging on a production OLTP machine? a) that would be a huge
performance drain on the production system b) it would produce so much logging
that it would take a significant amount of disk and i/o resources just to
handle and c) you would need to implement special purpose tools just to make
sense of these huge logs.


-- 
greg



Re: Rethinking stats communication mechanisms

From
Douglas McNaught
Date:
Greg Stark <gsstark@mit.edu> writes:

> Douglas McNaught <doug@mcnaught.org> writes:
>
>> Yeah, but if you turn on query logging in that case you'll see the
>> bajillions of short queries, so you don't need the accurate snapshot
>> to diagnose that.
>
> Query logging on a production OLTP machine? a) that would be a huge
> performance drain on the production system b) it would produce so much logging
> that it would take a significant amount of disk and i/o resources just to
> handle and c) you would need to implement special purpose tools just to make
> sense of these huge logs.

(a) and (b): of course you would only do it on a temporary basis for
problem diagnosis.  We do that with our production apps where I work
(when absolutely necessary).

(c): Perl.  :)

Ideally, you'd find the query storm problem in load testing before you
ever got to production.  I hope to someday visit that planet--it must
be nice.

-Doug


Re: Rethinking stats communication mechanisms

From
Greg Stark
Date:
Douglas McNaught <doug@mcnaught.org> writes:

> (a) and (b): of course you would only do it on a temporary basis for
> problem diagnosis.

Temporary or not it isn't really an option when you're dealing with high
volumes. You could imagine a setup where say, 1% of page requests randomly
turn on debugging to get a random sample of database traffic. There are always
solutions. 

But my point is that there's no reason to think that only long queries are
useful for snapshots. Short queries are equally capable of consuming resources
in aggregate. Hiding some subset of queries assuming they're uninteresting is
only going to produce deceptive results data.

> Ideally, you'd find the query storm problem in load testing before you
> ever got to production.  I hope to someday visit that planet--it must
> be nice.

Indeed, when you get there send me a postcard :)

-- 
greg



Re: Rethinking stats communication mechanisms

From
"Magnus Hagander"
Date:
> The existing stats collection mechanism seems OK for event
> counts, although I'd propose two changes: one, get rid of the
> separate buffer process, and two, find a way to emit event
> reports in a time-driven way rather than once per transaction
> commit.  I'm a bit vague about how to do the latter at the moment.

Might it not be a win to also store "per backend global values" in the
shared memory segment? Things like "time of last command", "number of
transactions executed in this backend", "backend start time" and other
values that are fixed-size?

You can obviously not do it for things like per-table values, since the
size can't be predicted, but all per-backend counters that are fixed
size should be able to do this, I think. And if it's just a counter, it
should be reasonably safe to just do the increment operation without
locking, since there's only one writer for each process. That should
have a much lower overhead than UDP or whatever to the stats process,
no?

It might be worthwhile to add a section for things like bgwriter (and
possibly the archiver?) to deliver statics that we can add statistics
views for. (they can obviously not use a standard backend "struct" for
this since they'd have completely different values to report)

//Magnus


Re: Rethinking stats communication mechanisms

From
Tom Lane
Date:
"Magnus Hagander" <mha@sollentuna.net> writes:
> Might it not be a win to also store "per backend global values" in the
> shared memory segment? Things like "time of last command", "number of
> transactions executed in this backend", "backend start time" and other
> values that are fixed-size?

I'm including backend start time, command start time, etc under the
heading of "current status" which'll be in the shared memory.  However,
I don't believe in trying to count events (like transaction commits)
that way.  If we do then we risk losing events whenever a backend quits
and is replaced by another.

I haven't yet looked through the stats in detail, but this approach
basically presumes that we are only going to count events per-table and
per-database --- I am thinking that the background stats collector
process won't even keep track of individual backends anymore.  (So,
we'll fix the old problem of loss of backend-exit messages resulting
in bogus displays.)
        regards, tom lane


Re: Rethinking stats communication mechanisms

From
Hannu Krosing
Date:
Ühel kenal päeval, P, 2006-06-18 kell 15:09, kirjutas Tom Lane:
> "Magnus Hagander" <mha@sollentuna.net> writes:
> > Might it not be a win to also store "per backend global values" in the
> > shared memory segment? Things like "time of last command", "number of
> > transactions executed in this backend", "backend start time" and other
> > values that are fixed-size?

One thing that is doable in constant size memory and would be enormously
usable for us is counting to-level function calls and storing their
total (and possibly also max) duration.

The resaon being, that our production databases are accessed by clients
using  functions only (with some uninteresting exeptions of course),
that is call in form of "SELECT x,y,z FROM myfunc(i,j,k)"

So reserving N*1.5 slots (N being the number of functions defined at
databse startup) would be ok. If more than N*0.5 functions are defined
in the database lifetime, then the rest are simply ignored (not
counted).

Or maybe a better approach would be to have a conf variable
"number-of-functions-to-track" and a special boolean flag track_me in
pg_functions. In this way you don't accidentally run out of shared mem
by defining lots of new functions and then restarting the cluster.


-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Rethinking stats communication mechanisms

From
Tom Lane
Date:
I wrote:
> PFC <lists@peufeu.com> writes:
>> So, the proposal :
>> On executing a command, Backend stores the command string, then  
>> overwrites the counter with (counter + 1) and with the timestamp of  
>> command start.
>> Periodically, like every N seconds, a separate process reads the counter,  
>> then reads the data, then reads the counter again.

> BTW, I think the writer would actually need to bump the counter twice,
> once before and once after it modifies its stats area.  Else there's
> no way to detect that you've copied a partially-updated stats entry.

Actually, neither of these ideas works: it's possible that the reader
copies the entry between the two increments of the counter.  Then, it
won't see any reason to re-read, but nonetheless it has copied an
inconsistent partially-modified entry.

Anyone know a variant of this that really works?
        regards, tom lane


Re: Rethinking stats communication mechanisms

From
"Bort, Paul"
Date:
>
> > BTW, I think the writer would actually need to bump the
> counter twice,
> > once before and once after it modifies its stats area.
> Else there's
> > no way to detect that you've copied a partially-updated stats entry.
>
> Actually, neither of these ideas works: it's possible that
> the reader copies the entry between the two increments of the
> counter.  Then, it won't see any reason to re-read, but
> nonetheless it has copied an inconsistent partially-modified entry.
>
> Anyone know a variant of this that really works?
>

Here's a theory: If the counter is bumped to an odd number before
modification, and an even number after it's done, then the reader will
know it needs to re-read if the counter is an odd number.

This might be assuming too much about what the writer knows about the
current contents of the counter, but since it's per-back end, I think it
would work.

Regards,
Paul Bort


Re: Rethinking stats communication mechanisms

From
Tom Lane
Date:
"Bort, Paul" <pbort@tmwsystems.com> writes:
>> Anyone know a variant of this that really works?

> Here's a theory: If the counter is bumped to an odd number before
> modification, and an even number after it's done, then the reader will
> know it needs to re-read if the counter is an odd number. 

Great minds think alike ;-) ... I just committed exactly that protocol.
I believe it is correct, because AFAICS there are only four possible
risk cases:

* reader's read starts before and ends after writer's update: reader
will certainly note a change in update counter.

* reader's read starts before and ends within writer's update: reader
will note a change in update counter.

* reader's read starts within and ends after writer's update: reader
will note a change in update counter.

* reader's read starts within and ends within writer's update: reader
will see update counter as odd.

Am I missing anything?

Note in particular that this protocol does not assume atomic update
of the counter, so we don't need to worry about whether int is
sig_atomic_t.  If any of the bytes change, we have what we need.
We could use a counter narrower than int, but then there's some tiny
risk that the counter could wrap all the way around while the reader
is blocked.
        regards, tom lane


Re: Rethinking stats communication mechanisms

From
PFC
Date:
> Great minds think alike ;-) ... I just committed exactly that protocol.
> I believe it is correct, because AFAICS there are only four possible
> risk cases:
Congrats !
For general culture you might be interested in reading this :
http://en.wikipedia.org/wiki/Software_transactional_memoryhttp://libcmt.sourceforge.net/


Re: Rethinking stats communication mechanisms

From
"Magnus Hagander"
Date:
> > Might it not be a win to also store "per backend global
> values" in the
> > shared memory segment? Things like "time of last command",
> "number of
> > transactions executed in this backend", "backend start
> time" and other
> > values that are fixed-size?
>
> I'm including backend start time, command start time, etc
> under the heading of "current status" which'll be in the
> shared memory.  However, I don't believe in trying to count
> events (like transaction commits) that way.  If we do then we
> risk losing events whenever a backend quits and is replaced
> by another.

Well, in many cases that's not a problem. It might be interesting to for
example know that a backend has run nnn transactions before ending up in
the state where it is now (say, idle in transaction and idle for a long
time). The part about this being "transient data" that can go away along
with a backend quit would still hold true.

What were your thoughts about storing bgwriter and archiver statistics
that way? Good or bad idea?


> I haven't yet looked through the stats in detail, but this
> approach basically presumes that we are only going to count
> events per-table and per-database --- I am thinking that the
> background stats collector process won't even keep track of
> individual backends anymore.  (So, we'll fix the old problem
> of loss of backend-exit messages resulting in bogus displays.)

Right.  As I see you have now implemented ;-)

/Magnus


Re: Rethinking stats communication mechanisms

From
"Bort, Paul"
Date:
>
> * reader's read starts before and ends after writer's update: reader
> will certainly note a change in update counter.
>
> * reader's read starts before and ends within writer's update: reader
> will note a change in update counter.
>
> * reader's read starts within and ends after writer's update: reader
> will note a change in update counter.
>
> * reader's read starts within and ends within writer's update: reader
> will see update counter as odd.
>
> Am I missing anything?
>

The only remaining concern would be the possibility of the reader
thrashing because the writer is updating so often that the reader never
gets the same counter twice. IIRC, the reader was only sampling, not
trying to catch every entry, so that will help. But is it enough?

Regards,
Paul Bort


Re: Rethinking stats communication mechanisms

From
Kenneth Marshall
Date:
On Sun, Jun 18, 2006 at 05:26:16PM -0400, Tom Lane wrote:
> I wrote:
> > PFC <lists@peufeu.com> writes:
> >> So, the proposal :
> >> On executing a command, Backend stores the command string, then  
> >> overwrites the counter with (counter + 1) and with the timestamp of  
> >> command start.
> >> Periodically, like every N seconds, a separate process reads the counter,  
> >> then reads the data, then reads the counter again.
> 
> > BTW, I think the writer would actually need to bump the counter twice,
> > once before and once after it modifies its stats area.  Else there's
> > no way to detect that you've copied a partially-updated stats entry.
> 
> Actually, neither of these ideas works: it's possible that the reader
> copies the entry between the two increments of the counter.  Then, it
> won't see any reason to re-read, but nonetheless it has copied an
> inconsistent partially-modified entry.
> 
> Anyone know a variant of this that really works?
> 
>             regards, tom lane
> 

For a single writer, the algorithm is the latch routine that I
have described previously. If you call an odd value of the counter
the latch, then it would be:

Update -
1. Set counter to odd number.
2. Update content.
3. Increment counter which releases the latch.

Read -
1. Copy value of version into a register.
2. Read the content.
3. Read counter. If odd, go to 1.
4. If counter is different then that in register, go to 1.

For multiple writers, you will need an atomic action to set
the "latch".

Ken


Re: Rethinking stats communication mechanisms

From
Kenneth Marshall
Date:
On Sun, Jun 18, 2006 at 11:07:41PM -0400, Tom Lane wrote:
> "Bort, Paul" <pbort@tmwsystems.com> writes:
> >> Anyone know a variant of this that really works?
> 
> > Here's a theory: If the counter is bumped to an odd number before
> > modification, and an even number after it's done, then the reader will
> > know it needs to re-read if the counter is an odd number. 
> 
> Great minds think alike ;-) ... I just committed exactly that protocol.
> I believe it is correct, because AFAICS there are only four possible
> risk cases:
> 
> * reader's read starts before and ends after writer's update: reader
> will certainly note a change in update counter.
> 
> * reader's read starts before and ends within writer's update: reader
> will note a change in update counter.
> 
> * reader's read starts within and ends after writer's update: reader
> will note a change in update counter.
> 
> * reader's read starts within and ends within writer's update: reader
> will see update counter as odd.
> 
> Am I missing anything?
> 
> Note in particular that this protocol does not assume atomic update
> of the counter, so we don't need to worry about whether int is
> sig_atomic_t.  If any of the bytes change, we have what we need.
> We could use a counter narrower than int, but then there's some tiny
> risk that the counter could wrap all the way around while the reader
> is blocked.
> 
>             regards, tom lane
> 

The reader needs to read the counter/version at the beginning of
the read and then again after the read. If the value is the same,
it can assume that the read/data is okay. I am assuming that there
is only one possible writer, otherwise the writer will need to
lock the area from other writers. This can be done with an atomic
instruction such as compare-and-swap or test-and-set.

Ken Marshall