Thread: Logarithmic change (decrease) in performance
Something interesting is going on. I wish I could show you the graphs, but I'm sure this will not be a surprise to the seasoned veterans. A particular application server I have has been running for over a year now. I've been logging cpu load since mid-april. It took 8 months or more to fall from excellent performance to "acceptable." Then, over the course of about 5 weeks it fell from "acceptable" to "so-so." Then, in the last four weeks it's gone from "so-so" to alarming. I've been working on this performance drop since Friday but it wasn't until I replied to Arnau's post earlier today that I remembered I'd been logging the server load. I grabbed the data and charted it in Excel and to my surprise, the graph of the server's load average looks kind of like the graph of y=x^2. I've got to make a recomendation for a solution to the PHB and my analysis is showing that as the dataset becomes larger, the amount of time the disk spends seeking is increasing. This causes processes to take longer to finish, which causes more processes to pile up, which cuases processes to take longer to finish, which causes more processes to pile up etc. It is this growing dataset that seems to be the source of the sharp decrease in performance. I knew this day would come, but I'm actually quite surprised that when it came, there was little time between the warning and the grande finale. I guess this message is being sent to the list to serve as a warning to other data warehouse admins that when you reach your capacity, the downward spiral happens rather quickly. Crud... Outlook just froze while composing the PHB memo. I've been working on that for an hour. What a bad day. -- Matthew Nuzum www.bearfruit.org
>From: Matthew Nuzum <mattnuzum@gmail.com> >Sent: Sep 28, 2005 4:02 PM >Subject: [PERFORM] Logarithmic change (decrease) in performance > Small nit-pick: A "logarithmic decrease" in performance would be a relatively good thing, being better than either a linear or exponential decrease in performance. What you are describing is the worst kind: an _exponential_ decrease in performance. >Something interesting is going on. I wish I could show you the graphs, >but I'm sure this will not be a surprise to the seasoned veterans. > >A particular application server I have has been running for over a >year now. I've been logging cpu load since mid-april. > >It took 8 months or more to fall from excellent performance to >"acceptable." Then, over the course of about 5 weeks it fell from >"acceptable" to "so-so." Then, in the last four weeks it's gone from >"so-so" to alarming. > >I've been working on this performance drop since Friday but it wasn't >until I replied to Arnau's post earlier today that I remembered I'd >been logging the server load. I grabbed the data and charted it in >Excel and to my surprise, the graph of the server's load average looks >kind of like the graph of y=x^2. > >I've got to make a recomendation for a solution to the PHB and my >analysis is showing that as the dataset becomes larger, the amount of >time the disk spends seeking is increasing. This causes processes to >take longer to finish, which causes more processes to pile up, which >causes processes to take longer to finish, which causes more processes >to pile up etc. It is this growing dataset that seems to be the source >of the sharp decrease in performance. > >I knew this day would come, but I'm actually quite surprised that when >it came, there was little time between the warning and the grande >finale. I guess this message is being sent to the list to serve as a >warning to other data warehouse admins that when you reach your >capacity, the downward spiral happens rather quickly. > Yep, definitely been where you are. Bottom line: you have to reduce the sequential seeking behavior of the system to within an acceptable window and then keep it there. 1= keep more of the data set in RAM 2= increase the size of your HD IO buffers 3= make your RAID sets wider (more parallel vs sequential IO) 4= reduce the atomic latency of your RAID sets (time for Fibre Channel 15Krpm HD's vs 7.2Krpm SATA ones?) 5= make sure your data is as unfragmented as possible 6= change you DB schema to minimize the problem a= overall good schema design b= partitioning the data so that the system only has to manipulate a reasonable chunk of it at a time. In many cases, there's a number of ways to accomplish the above. Unfortunately, most of them require CapEx. Also, ITRW world such systems tend to have this as a chronic problem. This is not a "fix it once and it goes away forever". This is a part of the regular maintenance and upgrade plan(s). Good Luck, Ron
On Wed, Sep 28, 2005 at 06:03:03PM -0400, Ron Peacetree wrote: > 1= keep more of the data set in RAM > 2= increase the size of your HD IO buffers > 3= make your RAID sets wider (more parallel vs sequential IO) > 4= reduce the atomic latency of your RAID sets > (time for Fibre Channel 15Krpm HD's vs 7.2Krpm SATA ones?) > 5= make sure your data is as unfragmented as possible > 6= change you DB schema to minimize the problem > a= overall good schema design > b= partitioning the data so that the system only has to manipulate a > reasonable chunk of it at a time. Note that 6 can easily swamp the rest of these tweaks. A poor schema design will absolutely kill any system. Also of great importance is how you're using the database. IE: are you doing any row-by-row operations? > In many cases, there's a number of ways to accomplish the above. > Unfortunately, most of them require CapEx. > > Also, ITRW world such systems tend to have this as a chronic > problem. This is not a "fix it once and it goes away forever". This > is a part of the regular maintenance and upgrade plan(s). And why DBA's typically make more money that other IT folks. :) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461