Thread: Recovering a database in danger of transaction wrap-around
I recently inherited a server with a PostgreSQL 8.1.8 database. The reason I inherited it was because "it wasn't working anymore". A quick look in the logfiles showed the following: LOG: transaction ID wrap limit is 2147484146, limited by database "vsa" WARNING: database "vsa" must be vacuumed within 998573 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "vsa". FATAL: database is not accepting commands to avoid wraparound data loss in database "vsa" HINT: Stop the postmaster and use a standalone backend to vacuum database "vsa". I found out quick enough what that means... I did as instructed, and fired up the standalone backend. I then started VACUUM. About four days later, the standalone backend terminated with the message: WARNING: database "vsa" must be vacuumed within 997403 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "vsa". WARNING: database "vsa" must be vacuumed within 997402 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "vsa". WARNING: database "vsa" must be vacuumed within 997401 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "vsa". WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. CONTEXT: writing block 465 of relation 1663/16384/863912 -bash-3.00$ I used lsof to monitor which files the backend was actually working on. It took two of the four days for it to vacuum a single table with 43 one-gigabyte extents. I have one table with over 300 extents. I'm looking at a vacuum process which can ultimately take weeks (if not months) to complete. Is there an easier way of monitoring VACUUM's progress in a standalone backend? As far as I can tell, absolutely no tuning was done to Postgres and autovacuum was in the default "off" state. This explains why it wasn't working anymore. The server itself has 8gb of RAM, but a very poor I/O channel. The physical size of the db on the disk is 856gb. Bottom line. Is there *any* way of faking out the 1 million transaction limit which prevents the postmaster from running, long enough for me to use pg_dump to rescue the data? I found one article which described xidStopLimit, but I have no idea where to go to reset this, or if changing it would have any beneficial effect. Is there any way of rescuing the contents of this database, or do I need to accept inevitability? If I restart VACUUM in the standalone backend, does it have to repeat all the work it did before the abend above? One other possibility I came up with is to use COPY from the standalone back end on each table, but we're talking about 100+ tables for each of 15 schemas. Thanks in advance for any suggestions, advice, or words of condolence you can offer, --- Steve ___________________________________________________________________________________ Steven Rosenstein IT Architect/Developer | IBM Virtual Server Administration Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001 Text Messaging: 6463456978 @ mobile.mycingular.com Email: srosenst @ us.ibm.com "Learn from the mistakes of others because you can't live long enough to make them all yourself." -- Eleanor Roosevelt
Steven Rosenstein <srosenst@us.ibm.com> writes: > I did as instructed, and fired up the standalone backend. I then started > VACUUM. About four days later, the standalone backend terminated with the > message: > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another server process exited > abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > CONTEXT: writing block 465 of relation 1663/16384/863912 Ugh. Something sent the standalone backend a SIGQUIT signal. You need to find out what did that. > I used lsof to monitor which files the backend was actually working on. It > took two of the four days for it to vacuum a single table with 43 > one-gigabyte extents. I have one table with over 300 extents. I'm looking > at a vacuum process which can ultimately take weeks (if not months) to > complete. Yipes. You are just using plain VACUUM, right, not VACUUM FULL? Have you checked that vacuum_cost_delay isn't enabled? > Bottom line. Is there *any* way of faking out the 1 million transaction > limit which prevents the postmaster from running, long enough for me to use > pg_dump to rescue the data? In 8.1 those limits are all hard-wired; you'd need to modify SetTransactionIdLimit() in src/backend/access/transam/varsup.c and recompile. Might be worth doing, if you think these tables have been bloated by a complete lack of vacuuming. regards, tom lane
On Fri, Jan 25, 2008 at 02:10:06PM -0500, Tom Lane wrote: > > I used lsof to monitor which files the backend was actually working on. It > > took two of the four days for it to vacuum a single table with 43 > > one-gigabyte extents. I have one table with over 300 extents. I'm looking > > at a vacuum process which can ultimately take weeks (if not months) to > > complete. > > Yipes. You are just using plain VACUUM, right, not VACUUM FULL? > Have you checked that vacuum_cost_delay isn't enabled? pg_dump/pg_restore may be a lot faster here - we're in an emergency situation anyway and after that, the whole DB will be clean again, all indices rebuilt nicely, no bloat in the tables. Tino. -- www.craniosacralzentrum.de www.spiritualdesign-chemnitz.de Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz
I used plain old VACUUM. Do you think VACUUM FULL might be faster or more effective? Here is the Resource Usage section from the config file (sorry for the mild spam). Everything looks like it is still in a default state. I'm sure upping the shared_buffers and the max_fsm_pages would make a difference, but its been a while since I did any real tuning work on postgres db, and I'm not comfortable in making any changes to a sick database. One other thing I should mention. We allocate 1gb (out of 8gb) of memory to shared memory (/proc/sys/kernel/shmmax=1073741824) however when I look at shared memory (ipcs), the Postgres segments only consume about 12mb. I'm sure that's a tuning parameter somewhere. Do you think increasing shared_buffers and max_fsm_pages might help? #--------------------------------------------------------------------------- # RESOURCE USAGE (except WAL) #--------------------------------------------------------------------------- # - Memory - shared_buffers = 1000 # min 16 or max_connections*2, 8KB each #temp_buffers = 1000 # min 100, 8KB each #max_prepared_transactions = 5 # can be 0 or more # note: increasing max_prepared_transactions costs ~600 bytes of shared memory # per transaction slot, plus lock space (see max_locks_per_transaction). #work_mem = 1024 # min 64, size in KB #maintenance_work_mem = 16384 # min 1024, size in KB #max_stack_depth = 2048 # min 100, size in KB # - Free Space Map - max_fsm_pages = 176928 # min max_fsm_relations*16, 6 bytes each #max_fsm_relations = 1000 # min 100, ~70 bytes each # - Kernel Resource Usage - #max_files_per_process = 1000 # min 25 #preload_libraries = '' # - Cost-Based Vacuum Delay - #vacuum_cost_delay = 0 # 0-1000 milliseconds #vacuum_cost_page_hit = 1 # 0-10000 credits #vacuum_cost_page_miss = 10 # 0-10000 credits #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits ___________________________________________________________________________________ Steven Rosenstein IT Architect/Developer | IBM Virtual Server Administration Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001 Text Messaging: 6463456978 @ mobile.mycingular.com Email: srosenst @ us.ibm.com "Learn from the mistakes of others because you can't live long enough to make them all yourself." -- Eleanor Roosevelt From: Tom Lane <tgl@sss.pgh.pa.us> To: Steven Rosenstein/New York/IBM@IBMUS Cc: pgsql-admin@postgresql.org Date: 01/25/2008 02:11 PM Subject: Re: [ADMIN] Recovering a database in danger of transaction wrap-around Steven Rosenstein <srosenst@us.ibm.com> writes: > I did as instructed, and fired up the standalone backend. I then started > VACUUM. About four days later, the standalone backend terminated with the > message: > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another server process exited > abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > CONTEXT: writing block 465 of relation 1663/16384/863912 Ugh. Something sent the standalone backend a SIGQUIT signal. You need to find out what did that. > I used lsof to monitor which files the backend was actually working on. It > took two of the four days for it to vacuum a single table with 43 > one-gigabyte extents. I have one table with over 300 extents. I'm looking > at a vacuum process which can ultimately take weeks (if not months) to > complete. Yipes. You are just using plain VACUUM, right, not VACUUM FULL? Have you checked that vacuum_cost_delay isn't enabled? > Bottom line. Is there *any* way of faking out the 1 million transaction > limit which prevents the postmaster from running, long enough for me to use > pg_dump to rescue the data? In 8.1 those limits are all hard-wired; you'd need to modify SetTransactionIdLimit() in src/backend/access/transam/varsup.c and recompile. Might be worth doing, if you think these tables have been bloated by a complete lack of vacuuming. regards, tom lane
I am unable to run pg_dump because even though I can successfully start the postmaster, every time I try to do something I receive the warning that the postmaster will not allow any activity because we crossed the minimum transaction threshold. pg_dump dies almost immediately: Starting dump to /vsa/backups/db/20080125170401.vsa_pgsql_db_bak.gz...pg_dump: [archiver (db)] connection to database "vsa" failed: FATAL: database is not accepting commands to avoid wraparound data loss in database "vsa" HINT: Stop the postmaster and use a standalone backend to vacuum database "vsa". The backup completed with the condition coded = 0 Any other suggestions? ___________________________________________________________________________________ Steven Rosenstein IT Architect/Developer | IBM Virtual Server Administration Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001 Text Messaging: 6463456978 @ mobile.mycingular.com Email: srosenst @ us.ibm.com "Learn from the mistakes of others because you can't live long enough to make them all yourself." -- Eleanor Roosevelt From: Tino Schwarze <postgresql@tisc.de> To: pgsql-admin@postgresql.org Date: 01/25/2008 02:27 PM Subject: Re: [ADMIN] Recovering a database in danger of transaction wrap-around On Fri, Jan 25, 2008 at 02:10:06PM -0500, Tom Lane wrote: > > I used lsof to monitor which files the backend was actually working on. It > > took two of the four days for it to vacuum a single table with 43 > > one-gigabyte extents. I have one table with over 300 extents. I'm looking > > at a vacuum process which can ultimately take weeks (if not months) to > > complete. > > Yipes. You are just using plain VACUUM, right, not VACUUM FULL? > Have you checked that vacuum_cost_delay isn't enabled? pg_dump/pg_restore may be a lot faster here - we're in an emergency situation anyway and after that, the whole DB will be clean again, all indices rebuilt nicely, no bloat in the tables. Tino. -- www.craniosacralzentrum.de www.spiritualdesign-chemnitz.de Tino Schwarze * Lortzingstraße 21 * 09119 Chemnitz ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Steven Rosenstein <srosenst@us.ibm.com> writes: > I used plain old VACUUM. Do you think VACUUM FULL might be faster or more > effective? No. I think you probably want to do a dump and reload, but first you have to get past the anti-wraparound check. One possibility I hadn't thought of before is to use a standalone backend to increment the pg_database.datfrozenxid values by a few thousand transactions. This would be a bad idea if you intended to keep using the DB, but if you're just trying to get to a state where you can run pg_dump, it seems acceptable. regards, tom lane
Incrementing pg_database.datfrozenxid sounded like an excellent idea. Before I made any changes I dumped the contents of pg_database to see what it contained. Here is that dump: backend> select * from pg_database; WARNING: database "vsa" must be vacuumed within 997398 transactions HINT: To avoid a database shutdown, execute a full-database VACUUM in "vsa". 1: datname (typeid = 19, len = 64, typmod = -1, byval = f) 2: datdba (typeid = 26, len = 4, typmod = -1, byval = t) 3: encoding (typeid = 23, len = 4, typmod = -1, byval = t) 4: datistemplate (typeid = 16, len = 1, typmod = -1, byval = t) 5: datallowconn (typeid = 16, len = 1, typmod = -1, byval = t) 6: datconnlimit (typeid = 23, len = 4, typmod = -1, byval = t) 7: datlastsysoid (typeid = 26, len = 4, typmod = -1, byval = t) 8: datvacuumxid (typeid = 28, len = 4, typmod = -1, byval = t) 9: datfrozenxid (typeid = 28, len = 4, typmod = -1, byval = t) 10: dattablespace (typeid = 26, len = 4, typmod = -1, byval = t) 11: datconfig (typeid = 1009, len = -1, typmod = -1, byval = f) 12: datacl (typeid = 1034, len = -1, typmod = -1, byval = f) ---- 1: datname = "postgres" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datdba = "10" (typeid = 26, len = 4, typmod = -1, byval = t) 3: encoding = "8" (typeid = 23, len = 4, typmod = -1, byval = t) 4: datistemplate = "f" (typeid = 16, len = 1, typmod = -1, byval = t) 5: datallowconn = "t" (typeid = 16, len = 1, typmod = -1, byval = t) 6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t) 7: datlastsysoid = "10792" (typeid = 26, len = 4, typmod = -1, byval = t) 8: datvacuumxid = "2146484345" (typeid = 28, len = 4, typmod = -1, byval = t) 9: datfrozenxid = "1072742522" (typeid = 28, len = 4, typmod = -1, byval = t) 10: dattablespace = "1663" (typeid = 26, len = 4, typmod = -1, byval = t) ---- 1: datname = "vsa" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datdba = "10" (typeid = 26, len = 4, typmod = -1, byval = t) 3: encoding = "0" (typeid = 23, len = 4, typmod = -1, byval = t) 4: datistemplate = "f" (typeid = 16, len = 1, typmod = -1, byval = t) 5: datallowconn = "t" (typeid = 16, len = 1, typmod = -1, byval = t) 6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t) 7: datlastsysoid = "10792" (typeid = 26, len = 4, typmod = -1, byval = t) 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 10: dattablespace = "1663" (typeid = 26, len = 4, typmod = -1, byval = t) ---- 1: datname = "template1" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datdba = "10" (typeid = 26, len = 4, typmod = -1, byval = t) 3: encoding = "8" (typeid = 23, len = 4, typmod = -1, byval = t) 4: datistemplate = "t" (typeid = 16, len = 1, typmod = -1, byval = t) 5: datallowconn = "t" (typeid = 16, len = 1, typmod = -1, byval = t) 6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t) 7: datlastsysoid = "10792" (typeid = 26, len = 4, typmod = -1, byval = t) 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 10: dattablespace = "1663" (typeid = 26, len = 4, typmod = -1, byval = t) 12: datacl = "{postgres=CT/postgres}" (typeid = 1034, len = -1, typmod = -1, byval = f) ---- 1: datname = "template0" (typeid = 19, len = 64, typmod = -1, byval = f) 2: datdba = "10" (typeid = 26, len = 4, typmod = -1, byval = t) 3: encoding = "8" (typeid = 23, len = 4, typmod = -1, byval = t) 4: datistemplate = "t" (typeid = 16, len = 1, typmod = -1, byval = t) 5: datallowconn = "f" (typeid = 16, len = 1, typmod = -1, byval = t) 6: datconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t) 7: datlastsysoid = "10792" (typeid = 26, len = 4, typmod = -1, byval = t) 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, byval = t) 10: dattablespace = "1663" (typeid = 26, len = 4, typmod = -1, byval = t) 12: datacl = "{postgres=CT/postgres}" (typeid = 1034, len = -1, typmod = -1, byval = f) ---- The database which contains all the working schemas and tables is the one called "vsa". If you look at datvacuumxid and datfrozenxid, they both seem perfectly reasonable in the vsa database. However, the same values in the "postgres" database are approaching what appear to be the hard limits. Before I charged ahead and made any changes I wanted to confirm that it was "datfrozenxid" in the "postgres" database which I should *increment* by a couple of thousand, and not *datvacuumxid" in the "postgres" database which should be *decrement* by a couple of thousands. My intuition says the value should be decreased so that that the absolute value is more than 1,000,000 less than 2^31, but of course I might be wrong. I have no idea what the "postgres" database is, where it came from, or why the transaction IDs are so out of skew. I don't think it is created at database creation. The person responsible for installing Postgres left the company a few weeks ago and is not available to ask. I checked on another server with a similar configuration. It has a "postgres" database, but the values for datvacuumxid and datfrozenxid is the same as the other three databases: 499 each. ___________________________________________________________________________________ Steven Rosenstein IT Architect/Developer | IBM Virtual Server Administration Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001 Text Messaging: 6463456978 @ mobile.mycingular.com Email: srosenst @ us.ibm.com "Learn from the mistakes of others because you can't live long enough to make them all yourself." -- Eleanor Roosevelt From: Tom Lane <tgl@sss.pgh.pa.us> To: Steven Rosenstein/New York/IBM@IBMUS Cc: pgsql-admin@postgresql.org Date: 01/25/2008 04:14 PM Subject: Re: [ADMIN] Recovering a database in danger of transaction wrap-around Steven Rosenstein <srosenst@us.ibm.com> writes: > I used plain old VACUUM. Do you think VACUUM FULL might be faster or more > effective? No. I think you probably want to do a dump and reload, but first you have to get past the anti-wraparound check. One possibility I hadn't thought of before is to use a standalone backend to increment the pg_database.datfrozenxid values by a few thousand transactions. This would be a bad idea if you intended to keep using the DB, but if you're just trying to get to a state where you can run pg_dump, it seems acceptable. regards, tom lane
Steven Rosenstein <srosenst@us.ibm.com> writes: > 1: datname = "postgres" (typeid = 19, len = 64, typmod = > 8: datvacuumxid = "2146484345" (typeid = 28, len = 4, typmod = -1, > byval = t) > 9: datfrozenxid = "1072742522" (typeid = 28, len = 4, typmod = -1, > byval = t) > ---- > 1: datname = "vsa" (typeid = 19, len = 64, typmod = -1, byval > = f) > 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) > 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) > ---- > 1: datname = "template1" (typeid = 19, len = 64, typmod = > -1, byval = f) > 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) > 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) > ---- > 1: datname = "template0" (typeid = 19, len = 64, typmod = > -1, byval = f) > 8: datvacuumxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) > 9: datfrozenxid = "499" (typeid = 28, len = 4, typmod = -1, > byval = t) Apparently, "postgres" is the only one of these that has ever had a database-wide VACUUM done on it :-(. A look at an 8.1 database here confirms that 499 is what would be in those fields immediately after initdb, so it's never been changed. > If you look at datvacuumxid and datfrozenxid, they both seem > perfectly reasonable in the vsa database. No, they aren't, at least not for an installation that's existed awhile. > However, the same values in the > "postgres" database are approaching what appear to be the hard limits. "postgres" is approaching the wraparound point, which is by no means a hard limit. The internal transaction counter (which you could check on with pg_controldata) is presumably even closer to the wrap point. > Before I charged ahead and made any changes I wanted to confirm that it was > "datfrozenxid" in the "postgres" database which I should *increment* by a > couple of thousand, and not *datvacuumxid" in the "postgres" database which > should be *decrement* by a couple of thousands. No, you need to leave "postgres" alone and increment the other ones, to make it look like they got vacuumed sometime closer to current time. > I have no idea what the "postgres" database is, where it came from, or why > the transaction IDs are so out of skew. I don't think it is created at > database creation. http://www.postgresql.org/docs/8.1/static/manage-ag-createdb.html > The person responsible for installing Postgres left the > company a few weeks ago and is not available to ask. I checked on another > server with a similar configuration. It has a "postgres" database, but the > values for datvacuumxid and datfrozenxid is the same as the other three > databases: 499 each. Then it's not being managed properly either ... regards, tom lane