Thread: Performance monitoring (was: [PATCHES] Logging checkpoints and other slowdown causes)

Moving to -hackers.

On Fri, May 11, 2007 at 04:37:44PM +0100, Heikki Linnakangas wrote:
> >If you know when the checkpoint ended, and you know how long each of the
> >pieces took, you can reconstruct the other times easily.  The way you
> >describe this it is true--that the summary is redundant given the
> >detail--but if you put yourself in the shoes of a log file parser the
> >other way around is easier to work with.  Piecing together log entries
> >is a pain, splitting them is easy.
> >
> >If I had to only keep one line out of this, it would be the one with the
> >summary.  It would be nice to have it logged at INFO.
>
> Yeah, if we have the summary line we don't need the other lines and vice
> versa. I have sympathy for parsing log files, I've done that a lot in
> the past and I can see what you mean. Having the individual lines is
> nice when you're monitoring a running system; you don't get the summary
> line until the checkpoint is finished. I suppose we can have both the
> individual lines and the summary, the extra lines shouldn't hurt anyone,
> and you won't get them unless you turn on the new log_checkpoints
> parameter anyway.

Not to beat a dead horse, but do we really want to force folks to be
parsing logs for performance monitoring? Especially if that log parsing
is just going to result in data being inserted into a table anyway?

I know there's concern about performance of the stats system and maybe
that needs to be addressed, but pushing users to log parsing is a lot of
extra effort, non-standard, likely to be overlooked, and doesn't play
well with other tools. It also conflicts with all the existing
statistics framework.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Performance monitoring

From
"Joshua D. Drake"
Date:
> Not to beat a dead horse, but do we really want to force folks to be
> parsing logs for performance monitoring? Especially if that log parsing
> is just going to result in data being inserted into a table anyway?
> 
> I know there's concern about performance of the stats system and maybe
> that needs to be addressed, but pushing users to log parsing is a lot of
> extra effort, non-standard, likely to be overlooked, and doesn't play
> well with other tools. It also conflicts with all the existing
> statistics framework.


One thing that doesn't seemed to be being looked at it is the cost of 
logging. Logging is very expensive. I don't know if it is more expensive 
than the stats system, but you can cut your tps in half by having any 
level of verbose logging on.

Yes that can be offset by pushing the logging to another spindle, and 
being careful about what you are logging but still.

Either way, we are taking the hit, it is just a matter of where. IMO it 
would be better to have the information in the database where it makes 
sense, than pushing out to a log that:

A. Will likely be forgotten
B. Is only accessible if you have shell access to the machine (not as 
common as all of us would like to think)

Sincerely,

Joshua D. Drake


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



On Sat, 12 May 2007, Jim C. Nasby wrote:

> Not to beat a dead horse, but do we really want to force folks to be
> parsing logs for performance monitoring?

All of the really interesting DBA level information about checkpoints is 
now sitting in pg_stat_bgwriter.  There really is no reason I'd expect 
this information to be part of normal performance monitoring setup; for me 
it has fit into troubleshooting and R&D.

When I'm talking about parsing logs in this context, it's for answering 
questions like "how does fsync time correlate with amount of data written 
at checkpoint time?"  Since I've found it educational to sort through 
that, I just didn't want the patch modified so that was harder to do than 
it had to be.

If I thought it worth the trouble to expose that level of information via 
the stats interface, I'd have submitted that instead of this log-based 
patch.  This circles back to the previous discussion of whether this 
particular information is strictly developer-level.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Performance monitoring

From
Greg Smith
Date:
On Sat, 12 May 2007, Joshua D. Drake wrote:

> One thing that doesn't seemed to be being looked at it is the cost of 
> logging.

If any of this executed at something like the query level, sure, that 
would be real important.  The majority of the logging I suggested here is 
of things that happen at checkpoint time.  The presumption is that the 
overhead of that is considerably greater than writing a log line or even 
five.

The least intensive event I would like to be loggable is when a new WAL 
segment is created and cleared, which is again a pretty small bit of log 
compared to the 16MB write.  I wouldn't mind seeing that exposed under 
pg_stats instead, just had more interesting things to statify first.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Performance monitoring

From
Neil Conway
Date:
On Sat, 2007-12-05 at 14:26 -0700, Joshua D. Drake wrote:
> Either way, we are taking the hit, it is just a matter of where. IMO it 
> would be better to have the information in the database where it makes 
> sense, than pushing out to a log

If performance monitoring information is provided as a database object,
what would the right interface be? IMHO the problem with cleanly
presenting monitoring information within a normal database system is
that this sort of data is fundamentally dynamic and continuous: to
determine how the performance of the system changes over time, you need
to repeatedly rescan the table/view/SRF and recompute your analysis
essentially from scratch. Trying to get even simple information like
"queries per second" from pg_stat_activity is an example of how this can
be painful.

<plug>
BTW, if the system included the concept of a continuous data *stream* as
a kind of database object, this problem would be much more tractable :)
In fact, there is some code in a version of TelegraphCQ that exposes
various information about the runtime state of the system as a set of
system-defined data streams -- like any other stream, users could then
use those streams in arbitrary queries.
</plug>

-Neil




Re: Performance monitoring

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> Moving to -hackers.
> 
> On Fri, May 11, 2007 at 04:37:44PM +0100, Heikki Linnakangas wrote:
>>> If you know when the checkpoint ended, and you know how long each of the 
>>> pieces took, you can reconstruct the other times easily.  The way you 
>>> describe this it is true--that the summary is redundant given the 
>>> detail--but if you put yourself in the shoes of a log file parser the 
>>> other way around is easier to work with.  Piecing together log entries 
>>> is a pain, splitting them is easy.
>>>
>>> If I had to only keep one line out of this, it would be the one with the 
>>> summary.  It would be nice to have it logged at INFO.
>> Yeah, if we have the summary line we don't need the other lines and vice 
>> versa. I have sympathy for parsing log files, I've done that a lot in 
>> the past and I can see what you mean. Having the individual lines is 
>> nice when you're monitoring a running system; you don't get the summary 
>> line until the checkpoint is finished. I suppose we can have both the 
>> individual lines and the summary, the extra lines shouldn't hurt anyone, 
>> and you won't get them unless you turn on the new log_checkpoints 
>> parameter anyway.
> 
> Not to beat a dead horse, but do we really want to force folks to be
> parsing logs for performance monitoring? Especially if that log parsing
> is just going to result in data being inserted into a table anyway?
> 
> I know there's concern about performance of the stats system and maybe
> that needs to be addressed, but pushing users to log parsing is a lot of
> extra effort, non-standard, likely to be overlooked, and doesn't play
> well with other tools. It also conflicts with all the existing
> statistics framework.

There is two counters for checkpoints in pgstats, the number of timed 
(triggered by checkpoint_timeout) and requested (triggered by 
checkpoint_segments) checkpoints.

Maybe we should improve the stats system so that we can collect events 
with timestamps and durations, but in my experience log files actually 
are the most reliable and universal way to collect real-time performance 
information. Any serious tool has a generic log parser. The other 
alternative is SNMP. I welcome the efforts on pgsnmpd..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Performance monitoring

From
Magnus Hagander
Date:
Heikki Linnakangas wrote:
>>> Yeah, if we have the summary line we don't need the other lines and
>>> vice versa. I have sympathy for parsing log files, I've done that a
>>> lot in the past and I can see what you mean. Having the individual
>>> lines is nice when you're monitoring a running system; you don't get
>>> the summary line until the checkpoint is finished. I suppose we can
>>> have both the individual lines and the summary, the extra lines
>>> shouldn't hurt anyone, and you won't get them unless you turn on the
>>> new log_checkpoints parameter anyway.
>>
>> Not to beat a dead horse, but do we really want to force folks to be
>> parsing logs for performance monitoring? Especially if that log parsing
>> is just going to result in data being inserted into a table anyway?
>>
>> I know there's concern about performance of the stats system and maybe
>> that needs to be addressed, but pushing users to log parsing is a lot of
>> extra effort, non-standard, likely to be overlooked, and doesn't play
>> well with other tools. It also conflicts with all the existing
>> statistics framework.
> 
> There is two counters for checkpoints in pgstats, the number of timed
> (triggered by checkpoint_timeout) and requested (triggered by
> checkpoint_segments) checkpoints.
> 
> Maybe we should improve the stats system so that we can collect events
> with timestamps and durations, but in my experience log files actually
> are the most reliable and universal way to collect real-time performance
> information. Any serious tool has a generic log parser. The other
> alternative is SNMP. I welcome the efforts on pgsnmpd..

pgsnmpd can't provide any information that's not in the backend. Unless
we'd turn it into a log parser, which is not really something I think is
a good idea.

Log files are great for one kind of thing, "live statistics" through
SNMP or the statistics collector for another kind. It only gets wrong
when you put them in the wrong place. Things you poll regularly makes a
lot more sense in some kind of live view than in a log file.

//Magnus


Re: Performance monitoring

From
"Jim C. Nasby"
Date:
On Sun, May 13, 2007 at 07:54:20AM +0100, Heikki Linnakangas wrote:
> Maybe we should improve the stats system so that we can collect events 
> with timestamps and durations, but in my experience log files actually 
> are the most reliable and universal way to collect real-time performance 
> information. Any serious tool has a generic log parser. The other 
> alternative is SNMP. I welcome the efforts on pgsnmpd..

Having timing information in the stats system would be useful, but I'm
not sure how it could actually be done. But at least if the information
is in the stats system it's easy to programatically collect and process.
SNMP is just one example of that (fwiw I agree with Magnus that it
probably doesn't make sense to turn pgsnmpd into a log parser...)
-- 
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Performance monitoring

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Jim C. Nasby wrote:

> 
> There is two counters for checkpoints in pgstats, the number of timed 
> (triggered by checkpoint_timeout) and requested (triggered by 
> checkpoint_segments) checkpoints.
> 
> Maybe we should improve the stats system so that we can collect events 
> with timestamps and durations, but in my experience log files actually 
> are the most reliable and universal way to collect real-time performance 
> information. Any serious tool has a generic log parser. The other 
> alternative is SNMP. I welcome the efforts on pgsnmpd..

Whats about add some DTrace probes?
    Zdenek