Thread: Write workload is causing severe slowdown in Production
Hi, We're running a web-based application powered by PostgreSQL. Recently, we've developed a "new" separate Java-based standalone (daemon process) threaded program that performs both read and write operations heavily on 2 "huge" tables. One table has got 5.4 million records and other has 1.3 million records. Moreover, more than one read and/or write operations may be executing concurrently. The issue that we're facing currently in our Production server is, whenever this "newly" developed Java program is started/run, then immediately the entire web application becomes very slow in response. At this time, I could also see from the output of " iostat -tx" that "%util" is even crossing more than 80%. So, what I could infer here based on my knowledge is, this is creating heavy IO traffic because of write operation. Since it was entirely slowing down web application, we've temporarily stopped running this standalone application. Meantime, I also read about "checkpoint spikes" could be a reason for slow down in "write workload" database. I'm also reading that starting in PostgreSQL 8.3, we can get verbose logging of the checkpoint process by turning on "log_checkpoints". My question is, how do I determine whether "checkpoint" occurrences are the root cause of this slowdown in my case? We're running PostgreSQL v8.2.22 on CentOS5.2 having 35 GB RAM. "log_checkpoints" is not available in PostgreSQL v8.2.22. We want to optimize our Production database to handle both reads and writes, any suggestions/advice/guidelines on this are highly appreciated. Some important "postgresql.conf" parameters are: # - Memory - shared_buffers=1536MB # - Planner Cost Constants - effective_cache_size = 4GB # - Checkpoints - checkpoint_segments=32 checkpoint_timeout=5min checkpoint_warning=270s # - Background writer - bgwriter_delay = 200ms bgwriter_lru_percent = 1.0 bgwriter_lru_maxpages = 5 bgwriter_all_percent = 0.333 bgwriter_all_maxpages = 5 Regards, Gnanam
Check for next messages in your log: LOG: checkpoints are occurring too frequently (ZZZ seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". Best regards, Vitalii Tymchyshyn 22.03.12 09:27, Gnanakumar написав(ла): > Hi, > > We're running a web-based application powered by PostgreSQL. Recently, > we've developed a "new" separate Java-based standalone (daemon process) > threaded program that performs both read and write operations heavily on 2 > "huge" tables. One table has got 5.4 million records and other has 1.3 > million records. Moreover, more than one read and/or write operations may > be executing concurrently. > > The issue that we're facing currently in our Production server is, whenever > this "newly" developed Java program is started/run, then immediately the > entire web application becomes very slow in response. At this time, I could > also see from the output of " iostat -tx" that "%util" is even crossing more > than 80%. So, what I could infer here based on my knowledge is, this is > creating heavy IO traffic because of write operation. Since it was entirely > slowing down web application, we've temporarily stopped running this > standalone application. > > Meantime, I also read about "checkpoint spikes" could be a reason for slow > down in "write workload" database. I'm also reading that starting in > PostgreSQL 8.3, we can get verbose logging of the checkpoint process by > turning on "log_checkpoints". > > My question is, how do I determine whether "checkpoint" occurrences are the > root cause of this slowdown in my case? We're running PostgreSQL v8.2.22 on > CentOS5.2 having 35 GB RAM. "log_checkpoints" is not available in > PostgreSQL v8.2.22.
On 22 Březen 2012, 10:42, Vitalii Tymchyshyn wrote: > Check for next messages in your log: > LOG: checkpoints are occurring too frequently (ZZZ seconds apart) > HINT: Consider increasing the configuration parameter > "checkpoint_segments". > > Best regards, Vitalii Tymchyshyn > > 22.03.12 09:27, Gnanakumar написав(ла): >> Hi, >> >> We're running a web-based application powered by PostgreSQL. Recently, >> we've developed a "new" separate Java-based standalone (daemon process) >> threaded program that performs both read and write operations heavily on >> 2 >> "huge" tables. One table has got 5.4 million records and other has 1.3 >> million records. Moreover, more than one read and/or write operations >> may >> be executing concurrently. >> >> The issue that we're facing currently in our Production server is, >> whenever >> this "newly" developed Java program is started/run, then immediately the >> entire web application becomes very slow in response. At this time, I >> could >> also see from the output of " iostat -tx" that "%util" is even crossing >> more >> than 80%. So, what I could infer here based on my knowledge is, this is >> creating heavy IO traffic because of write operation. Since it was >> entirely >> slowing down web application, we've temporarily stopped running this >> standalone application. I'd say you should investigate what the application actually does. The chances are it's poorly written, issuing a lot of queries and causing a log of IO. And 80% utilization does not mean the operations need to be writes - it's about IO operations, i.e. both reads and writes. >> Meantime, I also read about "checkpoint spikes" could be a reason for >> slow >> down in "write workload" database. I'm also reading that starting in >> PostgreSQL 8.3, we can get verbose logging of the checkpoint process by >> turning on "log_checkpoints". >> >> My question is, how do I determine whether "checkpoint" occurrences are >> the >> root cause of this slowdown in my case? We're running PostgreSQL >> v8.2.22 on >> CentOS5.2 having 35 GB RAM. "log_checkpoints" is not available in >> PostgreSQL v8.2.22. There's a checkpoint_warning option. Set it to 3600 and you should get messages in the log. Correlate those to the issues (do they happen at the same time?). Sadly, 8.2 doesn't have any of the nice statistic views :-( Check this: http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm It talks about improvements in 8.3 but it mentions older version too. If you can, install iotop and watch the processes that cause the I/O. IIRC, the title of the process should say 'CHECKPOINT' etc. But if the issues disappear once the application is stopped, it's unlikely the checkpoints are the issue. What we need is more details about your setup, especially - checkpoint_segments - checkpoint_timeout - shared_buffers also it'd be nice to have samples from the vmstat/iostat and messages from the log. kind regards Tomas
> There's a checkpoint_warning option. Set it to 3600 and you should get > messages in the log. Correlate those to the issues (do they happen at the > same time?). After setting "checkpoint_warning" to 3600, can you explain on how do I correlate with the messages? > If you can, install iotop and watch the processes that cause the I/O. I tried installing "iotop", but it failed to run because it requires Linux >= 2.6.20. Our CentOS5.2 is 2.6.18-8. > What we need is more details about your setup, especially > - checkpoint_segments > - checkpoint_timeout > - shared_buffers # - Memory - shared_buffers=1536MB # - Planner Cost Constants - effective_cache_size = 4GB # - Checkpoints - checkpoint_segments=32 checkpoint_timeout=5min checkpoint_warning=270s # - Background writer - bgwriter_delay = 200ms bgwriter_lru_percent = 1.0 bgwriter_lru_maxpages = 5 bgwriter_all_percent = 0.333 bgwriter_all_maxpages = 5 > also it'd be nice to have samples from the vmstat/iostat and messages from > the log. Unfortunately, I don't have "exact" logs when the problem actually happened
On 22 Březen 2012, 11:32, Gnanakumar wrote: >> There's a checkpoint_warning option. Set it to 3600 and you should get >> messages in the log. Correlate those to the issues (do they happen at >> the >> same time?). > After setting "checkpoint_warning" to 3600, can you explain on how do I > correlate with the messages? Well, you do know when the issues happened, so that you can check the logs and see if there are messages at that time. Try to install the application and watch the logs / IO performance. >> If you can, install iotop and watch the processes that cause the I/O. > I tried installing "iotop", but it failed to run because it requires Linux > >= 2.6.20. Our CentOS5.2 is 2.6.18-8. > >> What we need is more details about your setup, especially >> - checkpoint_segments >> - checkpoint_timeout >> - shared_buffers > # - Memory - > shared_buffers=1536MB > > # - Planner Cost Constants - > effective_cache_size = 4GB So, what else is running on the system? Because if there's 35GB RAM and the shared buffers are 1.5GB, then there's about 33GB for page cache. Something like 16GB would be a conservative setting. I'm not saying this will fix the issues, but maybe it shows that something else is running on the box and maybe that's the culprit, not PostgreSQL? >> also it'd be nice to have samples from the vmstat/iostat and messages >> from >> the log. > Unfortunately, I don't have "exact" logs when the problem actually > happened Then install the app again and collect as much info as possible. Otherwise it's all just wild guesses. Tomas
> So, what else is running on the system? Because if there's 35GB RAM and > the shared buffers are 1.5GB, then there's about 33GB for page cache. > Something like 16GB would be a conservative setting. Yes, you guessed it right. Both Web server and DB server are running in the same machine. > I'm not saying this will fix the issues, but maybe it shows that something > else is running on the box and maybe that's the culprit, not PostgreSQL? Based on my observations and analysis, I'm sure that database write operation "is" causing the slowdown, but not becauseof other applications running in the same server. > Then install the app again and collect as much info as possible. Otherwise > it's all just wild guesses. OK.
> There's a checkpoint_warning option. Set it to 3600 and you should get messages in the log. I've a basic question about setting "checkpoint_warning" configuration. 8.2 doc (http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html)says: "Write a message to the server log if checkpoints caused by the filling of checkpoint segment files happen closer togetherthan this many seconds (which suggests that checkpoint_segments ought to be raised). The default is 30 seconds (30s)." How does increasing the default 30s to 3600s (which is 1 hour or 60 minutes) print messages to the log? Even after readingthe description from above doc, am not able to get this point clearly. Can you help me in understanding this?
On 22 Březen 2012, 13:38, Gnanakumar wrote: >> There's a checkpoint_warning option. Set it to 3600 and you should get >> messages in the log. > > I've a basic question about setting "checkpoint_warning" configuration. > 8.2 doc > (http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html) > says: > > "Write a message to the server log if checkpoints caused by the filling of > checkpoint segment files happen closer together than this many seconds > (which suggests that checkpoint_segments ought to be raised). The default > is 30 seconds (30s)." > > How does increasing the default 30s to 3600s (which is 1 hour or 60 > minutes) print messages to the log? Even after reading the description > from above doc, am not able to get this point clearly. Can you help me in > understanding this? A checkpoint may be triggered for two main reasons: (1) all available WAL segments are filled (you do have 32 of them, i.e. 512MB of WAL data) (2) the checkpoint_timeout runs out (by default 5 mins IIRC) The checkpoint_warning should emmit a 'warning' message whenever the checkpoint happens less than the specified number of seconds since the last one, so setting it high enough should log all checkpoints. The fact that this does not happen (no warning messages) suggests this is not caused by checkpoints. It may be caused by the database, but it seems unlikely it's checkpoints. Tomas
On 22 Březen 2012, 13:10, Gnanakumar wrote: >> So, what else is running on the system? Because if there's 35GB RAM and >> the shared buffers are 1.5GB, then there's about 33GB for page cache. >> Something like 16GB would be a conservative setting. > > Yes, you guessed it right. Both Web server and DB server are running in > the same machine. > >> I'm not saying this will fix the issues, but maybe it shows that >> something >> else is running on the box and maybe that's the culprit, not PostgreSQL? > > Based on my observations and analysis, I'm sure that database write > operation "is" causing the slowdown, but not because of other applications > running in the same server. You haven't provided any clear evidence for such statement so far. Let's wait for the vmstat/iostat logs etc. Tomas
"Gnanakumar" <gnanam@zoniac.com> wrote: > We're running a web-based application powered by PostgreSQL. > Recently, we've developed a "new" separate Java-based standalone > (daemon process) threaded program that performs both read and > write operations heavily on 2 "huge" tables. One table has got > 5.4 million records and other has 1.3 million records. Moreover, > more than one read and/or write operations may be executing > concurrently. We're running a web application using PostgreSQL and Java which has 80 tables with over 1 million records each, the largest of which has 212 million rows. It is updated by replication from 3000 directly attached users at 72 sites, using a multi-threaded Java application. We have one connection pool for the read-only web application, which allows about 30 concurrent requests, and a connection pool for the replication which allows 6. If you want a peek at our performance, you can access the site here: http://wcca.wicourts.gov/ -- if you view a case and click on the "Court Record Events" button, you'll be viewing records in the table with 212 million rows. My point is that you're not asking PostgreSQL to do anything it *can't* handle well. > The issue that we're facing currently in our Production server is, > whenever this "newly" developed Java program is started/run, then > immediately the entire web application becomes very slow in > response. At this time, I could also see from the output of " > iostat -tx" that "%util" is even crossing more than 80%. So, what > I could infer here based on my knowledge is, this is creating > heavy IO traffic because of write operation. Since it was > entirely slowing down web application, we've temporarily stopped > running this standalone application. How are you handling concurrency? (Are you using FOR SHARE on your SELECT statements? Are you explicitly acquiring table locks before modifying data? Etc.) You might be introducing blocking somehow. When things are slow, try running some of the queries show on this page to get more clues: http://wiki.postgresql.org/wiki/Lock_Monitoring In particular, I recommend that you *never* leave transactions open or hold locks while waiting for user response or input. They *will* answer phone calls or go to lunch with things pending, potentially blocking other users for extended periods. > Meantime, I also read about "checkpoint spikes" could be a reason > for slow down in "write workload" database. When you hit that issue, there is not a continual slowdown -- queries which normally run very fast (a small fraction of a second) may periodically all take tens of seconds. Is that the pattern you're seeing? > We're running PostgreSQL v8.2.22 on CentOS5.2 having 35 GB RAM. > "log_checkpoints" is not available in PostgreSQL v8.2.22. > > We want to optimize our Production database to handle both reads > and writes, any suggestions/advice/guidelines on this are highly > appreciated. Support for 8.2 was dropped last year, five years after it was released. PostgreSQL has had a new major release every year since 8.2 was released, many of which have provided dramatic performance improvements. If you want good performance my first suggestion would be to upgrade your version of PostgreSQL to at least 9.0, and preferably 9.1. Because of stricter typing in 8.3 and later, upgrading from 8.2 takes a bit more work than most PostgreSQL major releases. Be sure to test well. > # - Background writer - > bgwriter_delay = 200ms > bgwriter_lru_percent = 1.0 > bgwriter_lru_maxpages = 5 > bgwriter_all_percent = 0.333 > bgwriter_all_maxpages = 5 These settings result in a very slow dribble of dirty buffers out to the OS cache. *If* you're hitting the "checkpoint spikes" issue (see above), you might want to boost the aggressiveness of the background writer. I couldn't recommend settings without knowing a lot more about your storage system and its capabilities. In supported releases of PostgreSQL, the checkpoint system and background writer are much improved, so again -- upgrading would be the most effective way to solve the problem. Besides the outdated PostgreSQL release and possible blocking, I would be concerned if you are using any sort of ORM for the update application. You want to watch that very closely because the default behavior of many of them does not scale well. There's usually a way to get better performance through configuration and/or bypassing automatic query generation for complex data requests. -Kevin
On Thu, Mar 22, 2012 at 10:13 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > In particular, I recommend that you *never* leave transactions open > or hold locks while waiting for user response or input. They *will* > answer phone calls or go to lunch with things pending, potentially > blocking other users for extended periods. This -- transactions aren't meant to intentionally block users but to allow you to ensure data remains in a valid state in the face of concurrent activity. In-transaction locks should be as short as possible while still giving those guarantees and should hopefully never be user facing. If you want user controlled locks, they should be cooperative and non transactional -- advisory locks. http://www.postgresql.org/docs/current/static/explicit-locking.html#ADVISORY-LOCKS merlin
On 22/03/12 20:27, Gnanakumar wrote: > > The issue that we're facing currently in our Production server is, whenever > this "newly" developed Java program is started/run, then immediately the > entire web application becomes very slow in response. At this time, I could > also see from the output of " iostat -tx" that "%util" is even crossing more > than 80%. So, what I could infer here based on my knowledge is, this is > creating heavy IO traffic because of write operation. Since it was entirely > slowing down web application, we've temporarily stopped running this > standalone application. I'd recommend taking a hard took at what the Java app is doing. You might be able to get useful data by adding: log_min_duration_statement = 10000 # queries taking longer than 10 sec into your postgresql.conf. I'd *guess* that locking is not the issue - as that is unlikely to cause high IO load (altho checking for lots of locks with granted= f in pg_locks might be worthwhile just to rule it out). You could also try running the Java app in one of your development environments to see if you can provoke high load behaviour in a more easily studied environment. regards Mark
First off, thank you *so much* for that detailed explanation comparing with a real-time application performance benchmark, which was really enlightening for me. > How are you handling concurrency? (Are you using FOR SHARE on your > SELECT statements? Are you explicitly acquiring table locks before > modifying data? Etc.) You might be introducing blocking somehow. No, actually am not explicitly locking any tables -- all are *simple* select, update, insert statements only. > In particular, I recommend that you *never* leave transactions open > or hold locks while waiting for user response or input. Again, we're not leaving any transaction opens until for any user responses, etc. > When you hit that issue, there is not a continual slowdown -- > queries which normally run very fast (a small fraction of a second) > may periodically all take tens of seconds. Is that the pattern > you're seeing? Yes, you're correct. Queries those normally run fast are becoming slow at the time of this slowdown. > Besides the outdated PostgreSQL release and possible blocking, I > would be concerned if you are using any sort of ORM for the update > application. You want to watch that very closely because the > default behavior of many of them does not scale well. There's > usually a way to get better performance through configuration and/or > bypassing automatic query generation for complex data requests. Am not able to understand above statements (...any sort of ORM for the update application ...) clearly. Can you help me in understanding this?
On 23 Březen 2012, 11:10, Gnanakumar wrote: > First off, thank you *so much* for that detailed explanation comparing > with > a real-time application performance benchmark, which was really > enlightening > for me. > >> How are you handling concurrency? (Are you using FOR SHARE on your >> SELECT statements? Are you explicitly acquiring table locks before >> modifying data? Etc.) You might be introducing blocking somehow. > > No, actually am not explicitly locking any tables -- all are *simple* > select, update, insert statements only. Are those wrapped in a transaction or not? Each transaction forces a fsync when committing, and if each of those INSERT/UPDATE statements stands on it's own it may cause of lot of I/O. >> Besides the outdated PostgreSQL release and possible blocking, I >> would be concerned if you are using any sort of ORM for the update >> application. You want to watch that very closely because the >> default behavior of many of them does not scale well. There's >> usually a way to get better performance through configuration and/or >> bypassing automatic query generation for complex data requests. > > Am not able to understand above statements (...any sort of ORM for the > update application ...) clearly. Can you help me in understanding this? There are tools that claim to remove the object vs. relational discrepancy when accessing the database. They often generate queries on the fly, and some of the queries are pretty awful (depends on how well the ORM model is defined). There are various reasons why this may suck - loading too much data, using lazy fetch everywhere etc. Are you using something like Hibernate, JPA, ... to handle persistence? Tomas
> Are those wrapped in a transaction or not? Each transaction forces a fsync > when committing, and if each of those INSERT/UPDATE statements stands on > it's own it may cause of lot of I/O. Yes, it's wrapped inside a transaction. May be this could be a reason for slowdown, as you've highlighted here. Atleast,we've got some guidance here to troubleshoot in this aspect also. > There are tools that claim to remove the object vs. relational discrepancy > when accessing the database. They often generate queries on the fly, and > some of the queries are pretty awful (depends on how well the ORM model is > defined). There are various reasons why this may suck - loading too much > data, using lazy fetch everywhere etc. Thanks for the clarification. > Are you using something like Hibernate, JPA, ... to handle persistence? No, we're not using any persistence frameworks/libraries as such.
"Gnanakumar" <gnanam@zoniac.com> wrote: >> When you hit that issue, there is not a continual slowdown -- >> queries which normally run very fast (a small fraction of a >> second) may periodically all take tens of seconds. Is that the >> pattern you're seeing? > > Yes, you're correct. Queries those normally run fast are becoming > slow at the time of this slowdown. But the question is -- while the update application is running is performance *usually* good with *brief periods* of high latency, or does it just get bad and stay bad? The *pattern* is the clue as to whether it is likely to be write saturation. Here's something I would recommend as a diagnostic step: run `vmstat 1` (or equivalent, based on your OS) to capture about a minute's worth of activity while things are running well, and also while things are slow. Pick a few lines that are "typical" of each and paste them into a post here. (If there is a lot of variation over the sample, it may be best to attach them to your post in their entirety. Don't just paste in more than a few lines of vmstat output, as the wrapping would make it hard to read.) Also, you should try running queries from this page when things are slow: http://wiki.postgresql.org/wiki/Lock_Monitoring If there is any blocking, that might be interesting. -Kevin