Re: Log levels for checkpoint/bgwriter monitoring - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Log levels for checkpoint/bgwriter monitoring
Date
Msg-id Pine.GSO.4.64.0703280849320.19140@westnet.com
Whole thread Raw
In response to Re: Log levels for checkpoint/bgwriter monitoring  (Magnus Hagander <magnus@hagander.net>)
List pgsql-hackers
On Tue, 27 Mar 2007, Magnus Hagander wrote:

> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?

When I saw your patch recently, I thought to myself "hmmm, the data 
collected here sure looks familiar"--you even made some of the exact same 
code changes I did.  I've been bogged down recently chasing a performance 
issue that, come to find, was mainly caused by the "high CPU usage for 
stats collector" bug.  That caused the background writer to slow to a 
crawl under heavy load, which is why I was having all these checkpoint and 
writer issues that got me monitoring that code in the first place.

With that seemingly resolved, slightly new plan now.  Next I want to take 
the data I've been collecting in my patch, bundle the most important parts 
of that into messages sent to the stats writer the way it was suggested 
you rewrite your patch, then submit the result.  I got log files down and 
have a real good idea what data should be collected, but as this would be 
my first time adding stats I'd certainly love some help with that.

Once that monitoring infrastructure is in place, I then planned to merge 
Itagati's "Load distributed checkpoint" patch (it touches a lot of the 
same code) and test that out under heavy load.  I think it gives a much 
better context to evaluate that patch in if rather than measuring just its 
gross results, you can say something like "with the patch in place the 
average fsync time on my system dropped from 3 seconds to 1.2 seconds when 
writing out more than 100MB at checkpoint time".  That's the direct cause 
of the biggest problem in that area of code, so why not stare right at it 
rather than measuring it indirectly.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


pgsql-hackers by date:

Previous
From: Kenneth Marshall
Date:
Subject: Re: Reduction in WAL for UPDATEs
Next
From: Tom Lane
Date:
Subject: Re: Reduction in WAL for UPDATEs