On Tue, 27 Mar 2007, Magnus Hagander wrote:
> Would not at least some of these numbers be better presented through the
> stats collector, so they can be easily monitored?
> That goes along the line of my way way way away from finished attempt
> earlier, perhaps a combination of these two patches?
When I saw your patch recently, I thought to myself "hmmm, the data
collected here sure looks familiar"--you even made some of the exact same
code changes I did. I've been bogged down recently chasing a performance
issue that, come to find, was mainly caused by the "high CPU usage for
stats collector" bug. That caused the background writer to slow to a
crawl under heavy load, which is why I was having all these checkpoint and
writer issues that got me monitoring that code in the first place.
With that seemingly resolved, slightly new plan now. Next I want to take
the data I've been collecting in my patch, bundle the most important parts
of that into messages sent to the stats writer the way it was suggested
you rewrite your patch, then submit the result. I got log files down and
have a real good idea what data should be collected, but as this would be
my first time adding stats I'd certainly love some help with that.
Once that monitoring infrastructure is in place, I then planned to merge
Itagati's "Load distributed checkpoint" patch (it touches a lot of the
same code) and test that out under heavy load. I think it gives a much
better context to evaluate that patch in if rather than measuring just its
gross results, you can say something like "with the patch in place the
average fsync time on my system dropped from 3 seconds to 1.2 seconds when
writing out more than 100MB at checkpoint time". That's the direct cause
of the biggest problem in that area of code, so why not stare right at it
rather than measuring it indirectly.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD