Thread: I/O increase after upgrading to 8.3.5
After upgrading from 8.2 to 8.3.5, the write load on our database server has increased dramatically and inexplicably -- as has the CPU usage. Here's a Munin graph of iostat showing the sudden increase in blocks written/sec: http://purefiction.net/paste/Totals-iostatwrite1-week.png We expected the upgrade to give us better, not worse, performance. How do I diagnose this problem? The application code has not changed, and neither have the usage patterns. The number of tuple inserts/updates/deletes per second has not deviated from the normal. However, CPU usage has increased by around 30%, and the load average has similarly increased. The other performance metrics I monitor, such as transaction load, HTTP traffic, etc., show everything else as either normal or slightly decreased, as you would expect when the database server slows down. The upgrade was done with dump/restore using "pg_dump -Fc". The old database lived on a SAN volume, whereas the new database lives on a local disk volume. OS is Linux 2.6.24 on Intel (x86_64). Alexander.
>>> Alexander Staubo <alex@bengler.no> wrote: > After upgrading from 8.2 to 8.3.5, the write load on our database > server has increased dramatically and inexplicably -- as has the CPU > usage. Did you do a VACUUM ANALYZE of the database after loading it? Without the database VACUUM, the first read of any page causes it to be rewritten to set hint bits. Without an ANALYZE, it might be picking very inefficient plans. I actually run a VACUUM FREEZE ANALYZE after loading a database (as with the upgrade), to prevent massive rewrites of everything in the database at some later date due to freeze activity. It could be something else, but I've seen similar behavior for the above reasons. -Kevin
On Fri, Feb 13, 2009 at 3:46 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >>>> Alexander Staubo <alex@bengler.no> wrote: >> After upgrading from 8.2 to 8.3.5, the write load on our database >> server has increased dramatically and inexplicably -- as has the CPU >> usage. > > Did you do a VACUUM ANALYZE of the database after loading it? Without > the database VACUUM, the first read of any page causes it to be > rewritten to set hint bits. Without an ANALYZE, it might be picking > very inefficient plans. I actually run a VACUUM FREEZE ANALYZE after > loading a database (as with the upgrade), to prevent massive rewrites > of everything in the database at some later date due to freeze > activity. Thanks, the lack of statistics should explain why things are a bit slow. I ran a "vacuum freeze analyze" now, but the I/O level is still quite high. I have verified using pidstat that the I/O is all caused by PostgreSQL. Here's some sample output from iostat, interval 1 second (the 4th column is KB written): sda 1173.00 68.00 149672.00 68 149672 sda 14.00 0.00 1712.00 0 1712 sda 2.00 0.00 336.00 0 336 sda 679.00 344.00 115200.00 344 115200 sda 238.00 0.00 61764.00 0 61764 sda 436.00 0.00 95004.00 0 95004 sda 14.00 0.00 1032.00 0 1032 sda 1882.00 72.00 82380.00 72 82380 sda 173.00 8.00 7936.00 8 7936 What I find odd is that PostgreSQL is only clocking at about 8 tuple modifications per second average (on our user tables). There are around 800 transactions per second, but most of these are only reads. How can 8 tuples/sec result in 115MB writes per second? The database is not large enough and the tuples not large enough to explain those huge amounts of data. At 115MB/s you could rewrite the entire database in 1.5 minutes. Is there any way to determine the *sizes* of the tuple mods, not just the frequency, that PostgreSQL performs? What kinds of tools are available to track down the causes of these writes? Alexander.
On Fri, Feb 13, 2009 at 12:53 PM, Alexander Staubo <alex@bengler.no> wrote: > The upgrade was done with dump/restore using "pg_dump -Fc". The old > database lived on a SAN volume, whereas the new database lives on a > local disk volume. I need to correct myself: The Munin graphs were never set to track the SAN volume where the old database lived. So when the graph goes from "near-zero" to "lots", it's actually correct. When I compare the correct graph, however, it's apparently that I/O writes have, on average, doubled. The new volume uses the same file system and block size as the old one. Alexander.
>>> Alexander Staubo <alex@bengler.no> wrote: > When I compare the correct graph, however, it's apparently that I/O > writes have, on average, doubled. Could you show the non-commented lines from old and new postgresql.conf files, please? -Kevin
On Fri, Feb 13, 2009 at 5:17 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Could you show the non-commented lines from old and new > postgresql.conf files, please? Attached. The differences are not performance-related, as far as I can see, aside from the additional of "synchronous_commit = off". Alexander.
Attachment
>>> Alexander Staubo <alex@bengler.no> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> Could you show the non-commented lines from old and new >> postgresql.conf files, please? > > Attached. The differences are not performance-related, as far as I > can see, aside from the additional of "synchronous_commit = off". You should definitely set effective_cache_size. If you still see the problem after that, I suggest testing different settings for: bgwriter_lru_maxpages bgwriter_lru_multiplier checkpoint_segments checkpoint_timeout checkpoint_completion_target Both the checkpoint process and the background writer changed quite a bit, and these are the main knobs for tuning the new version. It's possible that under 8.2 your hint bit writes were being combined with other writes due to caching, and differences in the timings now have the hint bit writes happening too long after the initial write to get that benefit. If that's the case, the counterintuitive step of making PostgreSQL more aggressive about writing to the OS cache might reduce your disk write I/O. Making it less aggressive might allow them to combine in the PostgreSQL buffers before making to the OS cache. -Kevin
Alexander Staubo wrote: > On Fri, Feb 13, 2009 at 12:53 PM, Alexander Staubo <alex@bengler.no> wrote: >> The upgrade was done with dump/restore using "pg_dump -Fc". The old >> database lived on a SAN volume, whereas the new database lives on a >> local disk volume. > > I need to correct myself: The Munin graphs were never set to track the > SAN volume where the old database lived. So when the graph goes from > "near-zero" to "lots", it's actually correct. > > When I compare the correct graph, however, it's apparently that I/O > writes have, on average, doubled. Is there any chance you had pg_xlog stored separately on your old database, and I/O for it wasn't being recorded? -- Craig Ringer
On Fri, Feb 13, 2009 at 6:35 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > You should definitely set effective_cache_size. Wow -- I set this to 10GB (5GB for shared buffers + another 5GB for cache), and today's average write frequency went from 20MB/sec to just 1MB/sec. The documentation suggests that effective_cache_size is only used for query planning in conjunction with indexes. So how come it affects write access? > If you still see the problem after that, I suggest testing different > settings for: > > bgwriter_lru_maxpages > bgwriter_lru_multiplier > checkpoint_segments > checkpoint_timeout > checkpoint_completion_target > > Both the checkpoint process and the background writer changed quite a > bit, and these are the main knobs for tuning the new version. We are hoping to set up a duplicate instance and play back the SQL log against it so we can experiment with different settings. Until we have such a setup, I'm not sure what to do with the knobs other than frob them wildly. :-) Are there any statistics, either in PostgreSQL proper or in the OS, that I can use as metrics to guide the tuning? For example, is there anything in pg_stat_bgwriter that can help me tune the bgwriter_lru_* settings? Do transactions that only contain query statements end up writing entries to the WAL when they commit? If yes, can we avoid the writes by wrapping our queries in "read only" transactions, and would it be worth the effort? Our web application is handling 30 requests per second at peak time; each request is performing dozens queries in autocommit mode, ie. one transaction per query. Only a minority of those requests actually end up modifying the database. PostgreSQL is committing and fsyncing 600-800 transactions per second, so that's probably a good chunk of disk/CPU usage wasted, right? Alexander.
On Sat, Feb 14, 2009 at 9:49 AM, Craig Ringer <craig@postnewspapers.com.au> wrote: > Is there any chance you had pg_xlog stored separately on your old database, > and I/O for it wasn't being recorded? No, the database files have always been on a single volume. Alexander.
Alexander Staubo <alex@bengler.no> writes: > <Kevin.Grittner@wicourts.gov> wrote: >> You should definitely set effective_cache_size. > Wow -- I set this to 10GB (5GB for shared buffers + another 5GB for > cache), and today's average write frequency went from 20MB/sec to just > 1MB/sec. The documentation suggests that effective_cache_size is only > used for query planning in conjunction with indexes. So how come it > affects write access? It *is* only used for query planning. A plausible theory is that you caused some join queries to change from hash or merge joining involving a temporary hash or sort file to an index nestloop that doesn't use any temporary storage. If the required hash or sort storage exceeded work_mem, which you have set to just 10MB, that would have created some write traffic. Did you happen to notice whether your queries got faster or slower when you did this? Watching only aggregate write traffic is a pretty limited view of what is happening in your database. regards, tom lane
On Sat, Feb 14, 2009 at 8:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Staubo <alex@bengler.no> writes: >> Wow -- I set this to 10GB (5GB for shared buffers + another 5GB for >> cache), and today's average write frequency went from 20MB/sec to just >> 1MB/sec. The documentation suggests that effective_cache_size is only >> used for query planning in conjunction with indexes. So how come it >> affects write access? > > It *is* only used for query planning. A plausible theory is that you > caused some join queries to change from hash or merge joining involving > a temporary hash or sort file to an index nestloop that doesn't use any > temporary storage. If the required hash or sort storage exceeded > work_mem, which you have set to just 10MB, that would have created some > write traffic. Interesting. Is there any statistic available that can tell me whether work_mem is being exceeded? The tools to monitor exactly what PostgreSQL is doing -- especially on a production box -- are fairly limited, especially since Linux does not yet have anything close to DTrace in functionality. > Did you happen to notice whether your queries got faster or slower when > you did this? Watching only aggregate write traffic is a pretty limited > view of what is happening in your database. Unfortunately we don't log SQL queries at the moment. We do log application response times, but they look roughly the same as before the change. I could revert the effective_cache_size setting, turn on SQL logging for a while, then reapply the change and compare. Alexander.
On Sat, 14 Feb 2009, Alexander Staubo wrote: > Are there any statistics, either in PostgreSQL proper or in the OS, that > I can use as metrics to guide the tuning? For example, is there anything > in pg_stat_bgwriter that can help me tune the bgwriter_lru_* settings? http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm goes over this topic, with "Appendix B: pg_stat_bgwriter sample analysis" covering a look at what to do based on a pg_stat_bgwriter snapshot. > Do transactions that only contain query statements end up writing > entries to the WAL when they commit? You need a transactions XID before you can write to the WAL, and quoting from the transaction management docs: "Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one --- typically, insert/update/delete a tuple" The main thing that will cause writes even when reading are hint bit updates, which have been mentioned here already. http://wiki.postgresql.org/wiki/Hint_Bits has an intro to that topic. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 14 Feb 2009, Alexander Staubo wrote: > Is there any statistic available that can tell me whether work_mem is > being exceeded? As of 8.3, log_temp_files puts information about them into your logs; see http://www.postgresql.org/docs/current/static/runtime-config-logging.html -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sun, Feb 15, 2009 at 6:35 PM, Greg Smith <gsmith@gregsmith.com> wrote: > http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm goes over > this topic, with "Appendix B: pg_stat_bgwriter sample analysis" covering a > look at what to do based on a pg_stat_bgwriter snapshot. Wonderful, thank you. Alexander.