Thread: significant vacuum issues - looking for suggestions
Hi List; I've just started working with a client that has been running Postgres (with no DBA) for a few years. They're running version 8.1.4 on 4-way dell boxes with 4Gig of memory on each box attached to RAID-10 disk arrays. Some of their key config settings are here: shared_buffers = 20480 work_mem = 16384 maintenance_work_mem = 32758 wal_buffers = 24 checkpoint_segments = 32 checkpoint_timeout = 300 checkpoint_warning = 30 effective_cache_size = 524288 autovacuum = on autovacuum_naptime = 60 autovacuum_vacuum_threshold = 500 autovacuum_analyze_threshold = 250 autovacuum_vacuum_scale_factor = 0.2 autovacuum_analyze_scale_factor = 0.1 autovacuum_vacuum_cost_delay = -1 autovacuum_vacuum_cost_limit = -1 Currently I've found that they have approx 17 tables that show a significant amount of bloat in the system. The worst one showing over 5million pages worth of dead space. One of the problems is that their system is so busy with activity during the day and massive data load processes at night that they often kill the pid of vacuum processes kicked off by autovacuum because the overall load impact disallows users from logging into the app since the login process includes at least one db query that then seems to hang because there are anywhere from 100 - 300 queries ahead of it at any given time. Normally a user gets logged in with an avg wait of 5 - 10 seconds but when a long running vacuum (sometimes due to a long running update process that's trying to sort/update > 40million rows) is going the system gets to a state where the login queries never get executed until the vacuum process is killed. As a result of this I believe that the biggest table (the one with > 5million pages worth of dead space) has never been vacuumed to completion. I suspect this is the case for a few of the other top dead space tables as well but I can't be sure. My first priority was to get this vacuum scenario cleaned up. First off I added the biggest table into pg_autovacuum and set the enabled column to false ('f'). Then I set vacuum_cost_delay to 10 and in the same session ran "vacuum analyze verbose big_table". This ran for 7.5 hours before we had to kill it due to system load - and to make matters worse the high system load was forcing many of the nightly batch queries that load, update, etc the data to stack up to a point where the system was at less than 2% idle (CPU) for the next 4 hours and barely responding to the command line. To make matters worse I find out this morning that the db is at 85% per used transaction ID's - again since a vacuum on the entire db has never been completed. As far as I can tell, the overall db size is currently 199G of which approx 104G seems to be valid data. Here's my thoughts per how to proceed: ===================================== 1) fix the big table ASAP (probably over the weekend) since it's not only the biggest table but the most active like this: a) run a pg_dump of this table b) restore this dump into a new table (i.e. new_big_table) c) lock the original big_table, sync any changes, inserts, deletes since we did the dump from big_table into new_big_table d) drop big_table e) re-name new_big_table to big_table * I may run through this for a few of the other large, highly active tables that have minimal page density as well. ===================================== The development folks that have been here awhile tell me that it seems like when they have a query (not limited to vacuum processes) that has been running for a long time (i.e. > 5 or 6 hours) that the query sort of "goes crazy" and the entire system gets pegged until they kill that process. - I've not heard of this but I suspect upgrading to 8.2.4 is probably a good plan at this point as well, so for step 2, I'll do this: ===================================== 2) (obviously I'll do this in dev first, then in QA and finally in prod) a) install verson 8.2.4 from source, leaving 8.1.4 in place b) create the new 8.2.4 cluster on a new port c) setup WAL archiving on the 8.1.4 cluster d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 cluster e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it to the directory where we're archiving the 8.1.4 cluster's WAL segments. f) once caught up, bring both clusters down g) copy any final files from the 8.1.4 cluster's pg_xlog directory into the new 8.2.4 pg_xlog dir (is this ok, since I'm moving 8.1.4 version tx logs into an 8.2.4 xlog dir?) h) Change the port on the 8.2.4 cluster to what the original 8.1.4 cluster port was i) bring up the new 8.2.4 system, and actively manage the vacuum needs moving fwd via a combination of autovacuum, cron processes for specififed table vac's (daily, hourly, 15min, 5min, etc), and as needed interactive session vacuums ===================================== The src based install will allow me to setup a robust upgrade CM process capable of supporting multiple concurrent versions on a server if needed, the ability to quickly revert to a previous version, etc however this is a discussion for another day - I only mention it in case the question "why not just use RPM's?" arises... So here's my questions: 1) Does this sound like a good plan? 2) Are there other steps I should be taking, other Issues I should be concerned about short-term, etc? 3) Does anyone have any additional advice for managing either this initial mess, or the system(s) long term? Thanks in advance... /Kevin
In response to Kevin Kempter <kevin@kevinkempterllc.com>: > Hi List; > > I've just started working with a client that has been running Postgres (with > no DBA) for a few years. They're running version 8.1.4 on 4-way dell boxes > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > Some of their key config settings are here: > shared_buffers = 20480 > work_mem = 16384 > maintenance_work_mem = 32758 Before you do any of those other things, bump shared_buffers to about 120000 and maintenance_work_mem to 250000 or so -- unless this box has other applications on it using significant amounts of those 4G of RAM. You may find that these changes alone are enough to get vacuum to complete. You'll need to restart the server for the shared_buffers setting to take effect. Can you do a pg_relation_size() on the tables in question? > wal_buffers = 24 > checkpoint_segments = 32 > checkpoint_timeout = 300 > checkpoint_warning = 30 > effective_cache_size = 524288 > autovacuum = on > autovacuum_naptime = 60 > autovacuum_vacuum_threshold = 500 > autovacuum_analyze_threshold = 250 > autovacuum_vacuum_scale_factor = 0.2 > autovacuum_analyze_scale_factor = 0.1 > autovacuum_vacuum_cost_delay = -1 > autovacuum_vacuum_cost_limit = -1 > > > Currently I've found that they have approx 17 tables that show a significant > amount of bloat in the system. The worst one showing over 5million pages > worth of dead space. One of the problems is that their system is so busy with > activity during the day and massive data load processes at night that they > often kill the pid of vacuum processes kicked off by autovacuum because the > overall load impact disallows users from logging into the app since the login > process includes at least one db query that then seems to hang because there > are anywhere from 100 - 300 queries ahead of it at any given time. Normally a > user gets logged in with an avg wait of 5 - 10 seconds but when a long > running vacuum (sometimes due to a long running update process that's trying > to sort/update > 40million rows) is going the system gets to a state where > the login queries never get executed until the vacuum process is killed. > > As a result of this I believe that the biggest table (the one with > 5million > pages worth of dead space) has never been vacuumed to completion. I suspect > this is the case for a few of the other top dead space tables as well but I > can't be sure. > > My first priority was to get this vacuum scenario cleaned up. First off I > added the biggest table into pg_autovacuum and set the enabled column to > false ('f'). Then I set vacuum_cost_delay to 10 and in the same session > ran "vacuum analyze verbose big_table". This ran for 7.5 hours before we had > to kill it due to system load - and to make matters worse the high system > load was forcing many of the nightly batch queries that load, update, etc the > data to stack up to a point where the system was at less than 2% idle (CPU) > for the next 4 hours and barely responding to the command line. > > To make matters worse I find out this morning that the db is at 85% per used > transaction ID's - again since a vacuum on the entire db has never been > completed. > > As far as I can tell, the overall db size is currently 199G of which approx > 104G seems to be valid data. > > Here's my thoughts per how to proceed: > > ===================================== > 1) fix the big table ASAP (probably over the weekend) since it's not only the > biggest table but the most active like this: > > a) run a pg_dump of this table > > b) restore this dump into a new table (i.e. new_big_table) > > c) lock the original big_table, sync any changes, inserts, deletes since we > did the dump from big_table into new_big_table > > d) drop big_table > > e) re-name new_big_table to big_table > > * I may run through this for a few of the other large, highly active tables > that have minimal page density as well. > ===================================== > > > The development folks that have been here awhile tell me that it seems like > when they have a query (not limited to vacuum processes) that has been > running for a long time (i.e. > 5 or 6 hours) that the query sort of "goes > crazy" and the entire system gets pegged until they kill that process. - I've > not heard of this but I suspect upgrading to 8.2.4 is probably a good plan at > this point as well, so for step 2, I'll do this: > > ===================================== > 2) (obviously I'll do this in dev first, then in QA and finally in prod) > a) install verson 8.2.4 from source, leaving 8.1.4 in place > > b) create the new 8.2.4 cluster on a new port > > c) setup WAL archiving on the 8.1.4 cluster > > d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 > cluster > > e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it to > the directory where we're archiving the 8.1.4 cluster's WAL segments. > > f) once caught up, bring both clusters down > > g) copy any final files from the 8.1.4 cluster's pg_xlog directory into the > new 8.2.4 pg_xlog dir (is this ok, since I'm moving 8.1.4 version tx logs > into an 8.2.4 xlog dir?) > > h) Change the port on the 8.2.4 cluster to what the original 8.1.4 cluster > port was > > i) bring up the new 8.2.4 system, and actively manage the vacuum needs > moving fwd via a combination of autovacuum, cron processes for specififed > table vac's (daily, hourly, 15min, 5min, etc), and as needed interactive > session vacuums > ===================================== > > > The src based install will allow me to setup a robust upgrade CM process > capable of supporting multiple concurrent versions on a server if needed, the > ability to quickly revert to a previous version, etc however this is a > discussion for another day - I only mention it in case the question "why not > just use RPM's?" arises... > > > So here's my questions: > > 1) Does this sound like a good plan? > > 2) Are there other steps I should be taking, other Issues I should be > concerned about short-term, etc? > > 3) Does anyone have any additional advice for managing either this initial > mess, or the system(s) long term? > > Thanks in advance... > > /Kevin > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > > > > > -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023 **************************************************************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ****************************************************************
>>> On Fri, Aug 24, 2007 at 2:57 PM, in message <200708241357.23629.kevin@kevinkempterllc.com>, Kevin Kempter <kevin@kevinkempterllc.com> wrote: > c) setup WAL archiving on the 8.1.4 cluster > > d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 > cluster > > e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it > to > the directory where we're archiving the 8.1.4 cluster's WAL segments. You can't use these techniques for a major version upgrade. Use pg_dump piped to psql. That will also eliminate all bloat. -Kevin
In response to "Kevin Grittner" <Kevin.Grittner@wicourts.gov>: > >>> On Fri, Aug 24, 2007 at 2:57 PM, in message > <200708241357.23629.kevin@kevinkempterllc.com>, Kevin Kempter > <kevin@kevinkempterllc.com> wrote: > > c) setup WAL archiving on the 8.1.4 cluster > > > > d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 > > cluster > > > > e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it > > to > > the directory where we're archiving the 8.1.4 cluster's WAL segments. > > You can't use these techniques for a major version upgrade. > Use pg_dump piped to psql. That will also eliminate all bloat. If you can't afford any downtime, you may be able to use Slony to do your upgrade. However, slony adds overhead, and if this system is tapped out already, it may not tolerate the additional overhead. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
Kevin Kempter <kevin@kevinkempterllc.com> writes: > The development folks that have been here awhile tell me that it seems like > when they have a query (not limited to vacuum processes) that has been > running for a long time (i.e. > 5 or 6 hours) that the query sort of "goes > crazy" and the entire system gets pegged until they kill that > process. - I've not heard of this Me either, but I wonder whether their queries are tickling some memory leak. I could imagine that what they are seeing is the backend process growing slowly until it starts to swap, and then continuing to grow and needing more and more swap activity. Once you get over the knee of that curve, things get real bad real fast. It might not be a bad idea to run the postmaster under a (carefully chosen) ulimit setting to cut such things off before the system starts swapping. Other things to look at: * what exactly gets "pegged" --- is it CPU or I/O bound? Watching "vmstat 1" is usually a good diagnostic since you can see CPU, swap, and regular disk I/O activity at once. * is there really not any pattern to the queries that cause the problem? I don't think 8.1.4 has any widespread leakage problem, but they might be tickling something isolated, in which case 8.2 is not necessarily gonna fix it. If you can produce a test case showing this behavior it'd be time to call in pgsql-hackers. Your other points seem pretty well covered by other replies. regards, tom lane
On Friday 24 August 2007 15:39:22 Tom Lane wrote: > Kevin Kempter <kevin@kevinkempterllc.com> writes: > > The development folks that have been here awhile tell me that it seems > > like when they have a query (not limited to vacuum processes) that has > > been running for a long time (i.e. > 5 or 6 hours) that the query sort of > > "goes crazy" and the entire system gets pegged until they kill that > > process. - I've not heard of this > > Me either, but I wonder whether their queries are tickling some memory > leak. I could imagine that what they are seeing is the backend process > growing slowly until it starts to swap, and then continuing to grow and > needing more and more swap activity. Once you get over the knee of that > curve, things get real bad real fast. It might not be a bad idea to run > the postmaster under a (carefully chosen) ulimit setting to cut such > things off before the system starts swapping. Other things to look at: > > * what exactly gets "pegged" --- is it CPU or I/O bound? Watching > "vmstat 1" is usually a good diagnostic since you can see CPU, swap, > and regular disk I/O activity at once. > > * is there really not any pattern to the queries that cause the problem? > I don't think 8.1.4 has any widespread leakage problem, but they might > be tickling something isolated, in which case 8.2 is not necessarily > gonna fix it. If you can produce a test case showing this behavior it'd > be time to call in pgsql-hackers. > > Your other points seem pretty well covered by other replies. > > regards, tom lane Thanks everyone for the help. I'll first up the memory settings like Bill suggested and then see where I'm at. Moving fwd I'll see if I have a test case that I can re-create, plus I may try constraining the postmaster via a ulimit setting, again based on what I see once the cluster is allowed to use the memory it should have been given up front. /Kevin
On Fri, Aug 24, 2007 at 04:41:44PM -0400, Bill Moran wrote: > In response to Kevin Kempter <kevin@kevinkempterllc.com>: > > > Hi List; > > > > I've just started working with a client that has been running Postgres (with > > no DBA) for a few years. They're running version 8.1.4 on 4-way dell boxes > > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > > > Some of their key config settings are here: > > shared_buffers = 20480 > > work_mem = 16384 > > maintenance_work_mem = 32758 > > Before you do any of those other things, bump shared_buffers to about > 120000 and maintenance_work_mem to 250000 or so -- unless this box > has other applications on it using significant amounts of those 4G of > RAM. You may find that these changes alone are enough to get vacuum > to complete. You'll need to restart the server for the shared_buffers > setting to take effect. For the really bloated table, you might need to go even higher than 250000 for maint_work_mem. IIRC vacuum needs 6 bytes per dead tuple, so that means 43M rows... with 5M dead pages, that means less than 10 rows per page, which is unlikely. Keep in mind that if you do a vacuum verbose, you'll be able to see if vacuum runs out of maintenance_work_mem, because you'll see multiple passes through all the indexes. You could also potentially use this to your benefit. Set maint_work_mem low enough so that vacuum will have to start it's cleaning pass after only an hour or so... depending on how big/bloated the indexes are on the table, it might take another 2-3 hours to clean everything. I believe that as soon as you see it start on the indexes a second time you can kill it... you'll have wasted some work, but more importantly you'll have actually vacuumed part of the table. But all of that's a moot point if they're running the default free space map settings, which are way, way, way to conservative in 8.1. If you've got one table with 5M dead pages, you probably want to set fsm_pages to at least 50000000 as a rough guess, at least until this is under control. Keep in mind that does equate to 286M of memory, though. As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * FROM bloated_table? That would likely be much faster than messing around with pg_dump. What kind of disk hardware is this running on? A good raid 10 array with write caching should be able to handle a 200G database fairly well; at least better than it is from what I'm hearing. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
>>> Decibel! <decibel@decibel.org> 08/27/07 4:00 PM >>> > > > They're running version 8.1.4 > > As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * > FROM bloated_table? That would likely be much faster than messing around > with pg_dump. He wanted to upgrade to 8.2.4. CREATE TABLE AS won't get him there. > > > They're running version 8.1.4 on 4-way dell boxes > > > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > What kind of disk hardware is this running on? A good raid 10 array with > write caching should be able to handle a 200G database fairly well What other details were you looking for? -Kevin
On Monday 27 August 2007 15:56:33 Kevin Grittner wrote: > >>> Decibel! <decibel@decibel.org> 08/27/07 4:00 PM >>> > >>> > > > > They're running version 8.1.4 > > > > As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * > > FROM bloated_table? That would likely be much faster than messing around > > with pg_dump. > > He wanted to upgrade to 8.2.4. CREATE TABLE AS won't get him there. > > > > > They're running version 8.1.4 on 4-way dell boxes > > > > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > > > What kind of disk hardware is this running on? A good raid 10 array with > > write caching should be able to handle a 200G database fairly well > > What other details were you looking for? > > -Kevin > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match I decided to fix this table first - did so by creating a new table, running a select from insert into and renaming the orig table to old_XXX and then renamed the new table to the orig table's name. I'll drop the orig table once I'm sure there are no data issues. I'm planning to setup a new file system layout for the box(es) and try to do a pg_dump | psql for the upgrade.
On Mon, Aug 27, 2007 at 04:56:33PM -0500, Kevin Grittner wrote: > >>> Decibel! <decibel@decibel.org> 08/27/07 4:00 PM >>> > > > > They're running version 8.1.4 > > > > As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * > > FROM bloated_table? That would likely be much faster than messing around > > with pg_dump. > > He wanted to upgrade to 8.2.4. CREATE TABLE AS won't get him there. > > > > > They're running version 8.1.4 on 4-way dell boxes > > > > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > > > What kind of disk hardware is this running on? A good raid 10 array with > > write caching should be able to handle a 200G database fairly well > > What other details were you looking for? How many drives? Write caching? 200G isn't *that* big for good drive hardware, *IF* it's performing the way it should. You'd be surprised how many arrays fall on their face even from a simple dd test. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
On Monday 27 August 2007 15:00:41 you wrote: > On Fri, Aug 24, 2007 at 04:41:44PM -0400, Bill Moran wrote: > > In response to Kevin Kempter <kevin@kevinkempterllc.com>: > > > Hi List; > > > > > > I've just started working with a client that has been running Postgres > > > (with no DBA) for a few years. They're running version 8.1.4 on 4-way > > > dell boxes with 4Gig of memory on each box attached to RAID-10 disk > > > arrays. > > > > > > Some of their key config settings are here: > > > shared_buffers = 20480 > > > work_mem = 16384 > > > maintenance_work_mem = 32758 > > > > Before you do any of those other things, bump shared_buffers to about > > 120000 and maintenance_work_mem to 250000 or so -- unless this box > > has other applications on it using significant amounts of those 4G of > > RAM. You may find that these changes alone are enough to get vacuum > > to complete. You'll need to restart the server for the shared_buffers > > setting to take effect. > > For the really bloated table, you might need to go even higher than > 250000 for maint_work_mem. IIRC vacuum needs 6 bytes per dead tuple, so > that means 43M rows... with 5M dead pages, that means less than 10 rows > per page, which is unlikely. Keep in mind that if you do a vacuum > verbose, you'll be able to see if vacuum runs out of > maintenance_work_mem, because you'll see multiple passes through all the > indexes. > > You could also potentially use this to your benefit. Set maint_work_mem > low enough so that vacuum will have to start it's cleaning pass after > only an hour or so... depending on how big/bloated the indexes are on > the table, it might take another 2-3 hours to clean everything. I > believe that as soon as you see it start on the indexes a second time > you can kill it... you'll have wasted some work, but more importantly > you'll have actually vacuumed part of the table. > > But all of that's a moot point if they're running the default free space > map settings, which are way, way, way to conservative in 8.1. If you've > got one table with 5M dead pages, you probably want to set fsm_pages to > at least 50000000 as a rough guess, at least until this is under > control. Keep in mind that does equate to 286M of memory, though. > > As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * > FROM bloated_table? That would likely be much faster than messing around > with pg_dump. > > What kind of disk hardware is this running on? A good raid 10 array with > write caching should be able to handle a 200G database fairly well; at > least better than it is from what I'm hearing. The memory settings are way low on all their db servers (less than 170Meg for the shared_buffers). I fixed this table via creating a new_** table, select from insert into, and a rename. I'm still working through the memory settings and reviewing their other config settings, the filesystem type/settings and eventually a security audit. It's a new client and theyve been running postgres for a few years on approx 8 db servers with no DBA. The servers are 4-way intel boxes (NOT dual-core) with 4G of memory and running raid-10 arrays.
On Monday 27 August 2007 16:04:39 Decibel! wrote: > On Mon, Aug 27, 2007 at 04:56:33PM -0500, Kevin Grittner wrote: > > >>> Decibel! <decibel@decibel.org> 08/27/07 4:00 PM >>> > > >>> > > > > > They're running version 8.1.4 > > > > > > As for your pg_dump idea... why not just do a CREATE TABLE AS SELECT * > > > FROM bloated_table? That would likely be much faster than messing > > > around with pg_dump. > > > > He wanted to upgrade to 8.2.4. CREATE TABLE AS won't get him there. > > > > > > > They're running version 8.1.4 on 4-way dell boxes > > > > > with 4Gig of memory on each box attached to RAID-10 disk arrays. > > > > > > What kind of disk hardware is this running on? A good raid 10 array > > > with write caching should be able to handle a 200G database fairly well > > > > What other details were you looking for? > > How many drives? Write caching? 200G isn't *that* big for good drive > hardware, *IF* it's performing the way it should. You'd be surprised how > many arrays fall on their face even from a simple dd test. I havent gotten that info yet, the key resources are too busy... I'll have more info next week. Thanks for the replies...