Thread: Please HELP - URGENT - transaction wraparound error
Oh my god!.... DB is pg 7.4.6 on linux 2005-10-27 05:55:55 WARNING: some databases have not been vacuumed in 2129225822 transactions HINT: Better vacuum them within 18257825 transactions, or you may have a wraparound failure. 2005-10-28 05:56:58 WARNING: some databases have not been vacuumed in over 2 billion transactions DETAIL: You may have already suffered transaction-wraparound data loss. We have cronscripts that perform FULL vacuums # vacuum template1 every sunday 35 2 * * 7 /usr/local/pgsql/bin/vacuumdb --analyze --verbose template1 # vacuum live DB every day 35 5 * * * /usr/local/bin/psql -c "vacuum verbose analyze" -d bp_live -U postgres --output /home/postgres/cronscripts/live/vacuumfull.log Questions: 1) Why do have we data corruption? I thought we were doing everything we needed to stop any wraparound... Are the pg docs inadequate, or did I misunderstand what needed to be done? 2) What can I do to recover the data? I have full daily backups from midnight each day using /usr/local/pgsql/bin/pg_dump $DATABASE > $BACKUPFILE plus I have this database replicated using Slon 1.1.0 to another 7.4.6 database. I can failover to the slave server, but what do I need to do to rebuild the original database? Should I failover now?!! And then start rebuilding the old master database (using slon, I presume)? How do I stop this EVER happening again??!!! Thanks for help John
On Sun, Oct 30, 2005 at 08:50:18AM +0000, John Sidney-Woollett wrote: > Oh my god!.... > > DB is pg 7.4.6 on linux Firstly, check pg_database, it should tell you which databases need to be vacuumed. Any database you regularly vacuumed is fine so maybe the corruption is in some other database you don't remember? > 1) Why do have we data corruption? I thought we were doing everything we > needed to stop any wraparound... Are the pg docs inadequate, or did I > misunderstand what needed to be done? You *may* have corruption. Anything you vacuumed recently should be fine. > 2) What can I do to recover the data? Check whether anything is lost first. > How do I stop this EVER happening again??!!! Have you read this: http://www.postgresql.org/docs/7.4/static/maintenance.html#VACUUM-FOR-WRAPAROUND Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Martin, thanks for the feedback. I had a look around and couldn't see any data loss (but wasn't really sure where to start looking). I decided to switch over to the slave which is now our live database. the old master with the problem has already been re-inited (although I have a cold backup of the data dir), plus dump files that I can restore from. I checked pg_database (on the new master) and I don't really understand what it is saying. Is the datvacuumxid vs 3553547043 the significant information? I see in our new database: datname | datdba | encoding | datistemplate --------------+--------+----------+--------------- bp_live | 1 | 6 | f bp_test | 1 | 6 | f template1 | 1 | 0 | t template0 | 1 | 0 | t datname | datallowconn | datlastsysoid | datvacuumxid --------------+--------------+---------------+-------------- bp_live | t | 17140 | 332321570 bp_test | t | 17140 | 332265474 template1 | t | 17140 | 332241177 template0 | f | 17140 | 464 datname | datfrozenxid | datpath | datconfig --------------+--------------+---------+----------- bp_live | 3553547043 | | bp_test | 3553490947 | | template1 | 3553466653 | | template0 | 464 | | datname | datacl --------------+-------------------------- bp_live | bp_test | template1 | {postgres=C*T*/postgres} template0 | {postgres=C*T*/postgres} Are we going to get the same problem with this database? What's also worrying me is that the warning message is in fact misleading!!?? 2005-10-28 05:56:58 WARNING: some databases have not been vacuumed in over 2 billion transactions DETAIL: You may have already suffered transaction-wraparound data loss. And I'm wondering if I have in fact destroyed a perfectly good database and data set... I read the link you gave (before) but found it hard to work out what you actually need to do to protect yourself. We DID vacuum the databases nightly, and template1 once a week. So I still don't understand why we got this error. Can someone explain in simple language? Can someone also give me a detailed "you need to do this, and this and this..." explanation to prevent this happening again (either on our master or slave databases). For example, must you do a vacuum full instead of a vacuum analyze on a 7.4.x database to prevent wraparound issues? BTW, for those not using **Slony** - you should check it out. It has saved my bacon three times this year! Due to: 1) server failure - hardware crash, needed BIOS flash, complete OS reinstall etc 2) disk full - corrupted pg data 3) oid wraparound (today's problem) Any further help that anyone can give is much appreciated. Thanks John Martijn van Oosterhout wrote: > On Sun, Oct 30, 2005 at 08:50:18AM +0000, John Sidney-Woollett wrote: > >>Oh my god!.... >> >>DB is pg 7.4.6 on linux > > > Firstly, check pg_database, it should tell you which databases need to > be vacuumed. Any database you regularly vacuumed is fine so maybe the > corruption is in some other database you don't remember? > > >>1) Why do have we data corruption? I thought we were doing everything we >>needed to stop any wraparound... Are the pg docs inadequate, or did I >>misunderstand what needed to be done? > > > You *may* have corruption. Anything you vacuumed recently should be > fine. > > >>2) What can I do to recover the data? > > > Check whether anything is lost first. > > >>How do I stop this EVER happening again??!!! > > > Have you read this: > > http://www.postgresql.org/docs/7.4/static/maintenance.html#VACUUM-FOR-WRAPAROUND > > Hope this helps,
John Sidney-Woollett <johnsw@wardbrook.com> writes: > I decided to switch over to the slave which is now our live database. > the old master with the problem has already been re-inited (although I > have a cold backup of the data dir), plus dump files that I can restore > from. You panicked much too quickly and destroyed the evidence ... unless by "cold backup" you mean a filesystem backup, in which case what you should do is restore that and take a look at what's in its pg_database. I think there's no question that there is some omission in your vacuuming procedures, and you need to find out what it is. regards, tom lane
Hi Tom You're not wrong about panicking! This is the worst Sunday I've had in a while... No sunday lunch or time with the kids... :( This database supports a (normally 24/7) website and we couldn't tolerate any possibility of data corruption. I had to make a judgement call on preventing any/further data loss or corruption, and switching over to the slave seemed the safest thing to do (based on my ignorance of the wraparound problem). I can restore the file system backup of pgsql/data to another database server and then get the info from pg_database. Or I can import a dump file from 15 minutes before I re-inited the database... What exactly am I looking for though? We don't use OIDs when creating tables... Could Slon 1.1.0 be causing a problem for us? It must be creating and deleting bucket loads of records as part of its regular activity... What am I likely to have missed in my vacuuming? Because whatever I did wrong is going to break our current live database at some point soon. Thanks John Tom Lane wrote: > John Sidney-Woollett <johnsw@wardbrook.com> writes: > >>I decided to switch over to the slave which is now our live database. >>the old master with the problem has already been re-inited (although I >>have a cold backup of the data dir), plus dump files that I can restore >>from. > > > You panicked much too quickly and destroyed the evidence ... unless by > "cold backup" you mean a filesystem backup, in which case what you > should do is restore that and take a look at what's in its pg_database. > I think there's no question that there is some omission in your vacuuming > procedures, and you need to find out what it is. > > regards, tom lane
John Sidney-Woollett <johnsw@wardbrook.com> writes: > I can restore the file system backup of pgsql/data to another database > server and then get the info from pg_database. Or I can import a dump > file from 15 minutes before I re-inited the database... Importing a dump will tell you nothing at all, as all the data will be freshly loaded. > What exactly am I looking for though? SELECT datname, age(datfrozenxid) FROM pg_database; where the second column approaches 2 billion. Alternatively, wait a few weeks and note which entries in your live database are increasing rather than staying near 1 billion. regards, tom lane
OK, I restored the pgsql/data to another server and started up postgres and this is what I got: SELECT datname, age(datfrozenxid) FROM pg_database; datname | age --------------+------------- mail_lxtreme | -2074187459 bp_live | 1079895636 template1 | 1076578064 template0 | -2074187459 (4 rows) mail_lxtreme is a test mail db and I don't care about it. So it could have been deleted without any worries... Which databases are a problem? Is it template0 or bp_live and template1? Thanks John Tom Lane wrote: > John Sidney-Woollett <johnsw@wardbrook.com> writes: > >>I can restore the file system backup of pgsql/data to another database >>server and then get the info from pg_database. Or I can import a dump >>file from 15 minutes before I re-inited the database... > > > Importing a dump will tell you nothing at all, as all the data will be > freshly loaded. > > >>What exactly am I looking for though? > > > SELECT datname, age(datfrozenxid) FROM pg_database; > > where the second column approaches 2 billion. > > Alternatively, wait a few weeks and note which entries in your live > database are increasing rather than staying near 1 billion. > > regards, tom lane
John Sidney-Woollett <johnsw@wardbrook.com> writes: > OK, I restored the pgsql/data to another server and started up postgres > and this is what I got: > SELECT datname, age(datfrozenxid) FROM pg_database; > datname | age > --------------+------------- > mail_lxtreme | -2074187459 > bp_live | 1079895636 > template1 | 1076578064 > template0 | -2074187459 > (4 rows) > mail_lxtreme is a test mail db and I don't care about it. So it could > have been deleted without any worries... > Which databases are a problem? Is it template0 or bp_live and template1? mail_lxtreme is exactly the problem. You weren't vacuuming it... (template0 is a special case and can be ignored.) regards, tom lane
Hmm. I'm pretty sure that database mail_lxtreme was unused (no connections/activity) - I didn't think that it would need to be vacuumed at all... Just out of curiousity would the wraparound error (for mail_lxtreme) actually have affected data in bp_live? Could I just have deleted mail_lxtreme and then continued to use bp_live as though nothing had happened? Or had database bp_live already been damaged by the wraparound? Thanks for your great help/advice - it's much appreciated. John Tom Lane wrote: > John Sidney-Woollett <johnsw@wardbrook.com> writes: > >>OK, I restored the pgsql/data to another server and started up postgres >>and this is what I got: > > >> SELECT datname, age(datfrozenxid) FROM pg_database; >> datname | age >>--------------+------------- >> mail_lxtreme | -2074187459 >> bp_live | 1079895636 >> template1 | 1076578064 >> template0 | -2074187459 >>(4 rows) > > >>mail_lxtreme is a test mail db and I don't care about it. So it could >>have been deleted without any worries... > > >>Which databases are a problem? Is it template0 or bp_live and template1? > > > mail_lxtreme is exactly the problem. You weren't vacuuming it... > > (template0 is a special case and can be ignored.) > > regards, tom lane
On Sun, Oct 30, 2005 at 06:41:45PM +0000, John Sidney-Woollett wrote: > Hmm. I'm pretty sure that database mail_lxtreme was unused (no > connections/activity) - I didn't think that it would need to be vacuumed > at all... A database that is never used still needs to be vacuumed. The only exception is if you VACUUM FREEZE which puts the entire database in a frozen state which will never need vacuuming. This is how template0 is configured. Ofcourse, once you make changes... > Just out of curiousity would the wraparound error (for mail_lxtreme) > actually have affected data in bp_live? I doubt it but (thinking shared tables) I'll have to defer to someone more knowledgable. > Could I just have deleted mail_lxtreme and then continued to use bp_live > as though nothing had happened? > > Or had database bp_live already been damaged by the wraparound? Everything would probably have been fine. BTW, I would have thought this message would have been appearing the last billion transactions or so, didn't anyone notice? To solve this forever, setup a cronjob for once a month: vacuumdb -a This will vacuum every database, even if you don't know the names or where they came from. AIUI when you vacuum a database whose transactions are over billion transactions old it automatically puts it in "frozen" state. If someone had happened to run "vacuumdb -a" anytime in the last few months, you might never have noticed the wraparound... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
John Sidney-Woollett <johnsw@wardbrook.com> writes: > Just out of curiousity would the wraparound error (for mail_lxtreme) > actually have affected data in bp_live? > Could I just have deleted mail_lxtreme and then continued to use bp_live > as though nothing had happened? No, and yes, which is why panicking was not warranted ;-) Martijn's advice to be using "vacuumdb -a" every so often is well given, though. You could also consider switching over to autovacuum, particularly as of 8.1. (I'm not sure how much I trust the contrib version that exists in 8.0, and 7.4's is definitely pretty buggy, but I believe 8.1's can be relied on to prevent this sort of thing.) regards, tom lane
Martijn Thanks for the answers/thoughts... Vacuumuming the databases hammers the server so the vacuums are spread out at different times during the night/morning. Plus template1 is vacuumed once a week. I guess I was unlucky to have missed the vacuum on that unused database (due to my misunderstanding), and not to have been scanning the serverlog more frequently (if at all recently!). My solution is to create a nagios script that scans entries in serverlog loking for WARN or ERROR messages in the past xx minutes. With this in place, I would have caught this error weeks ago before it bit me in the ass! Stressful day, but learnt a lot... Thanks for everyone for their input - great product and great support! John Martijn van Oosterhout wrote: > On Sun, Oct 30, 2005 at 06:41:45PM +0000, John Sidney-Woollett wrote: > >>Hmm. I'm pretty sure that database mail_lxtreme was unused (no >>connections/activity) - I didn't think that it would need to be vacuumed >>at all... > > > A database that is never used still needs to be vacuumed. The only > exception is if you VACUUM FREEZE which puts the entire database in a > frozen state which will never need vacuuming. This is how template0 is > configured. Ofcourse, once you make changes... > > >>Just out of curiousity would the wraparound error (for mail_lxtreme) >>actually have affected data in bp_live? > > > I doubt it but (thinking shared tables) I'll have to defer to someone > more knowledgable. > > >>Could I just have deleted mail_lxtreme and then continued to use bp_live >>as though nothing had happened? >> >>Or had database bp_live already been damaged by the wraparound? > > > Everything would probably have been fine. > > BTW, I would have thought this message would have been appearing the > last billion transactions or so, didn't anyone notice? > > To solve this forever, setup a cronjob for once a month: > > vacuumdb -a > > This will vacuum every database, even if you don't know the names or > where they came from. AIUI when you vacuum a database whose > transactions are over billion transactions old it automatically puts it > in "frozen" state. If someone had happened to run "vacuumdb -a" anytime > in the last few months, you might never have noticed the wraparound... > > Hope this helps,
"Panic" - that's my middle name. ;) Had I known how to identify the database at fault, and that it would have had no effect on the other databases, then I would have handled this episode differently. In the event, things seem to be OK. Our old slave db is now acting as master and the old master rebuilt as the new slave ... courtesy of slon. I'd like to move to 8.1 but I'm waiting for a quiet period when there's less development/fire fighting so that I can test all the java components of our webapp and then manage the upgrade properly. Maybe suppressing other vacuums once a month, and running the "vacuumdb -a" option instead wouldn't be a bad idea... Many thanks for all your support and advice - you've been great help (and comfort). John Tom Lane wrote: > John Sidney-Woollett <johnsw@wardbrook.com> writes: > >>Just out of curiousity would the wraparound error (for mail_lxtreme) >>actually have affected data in bp_live? >>Could I just have deleted mail_lxtreme and then continued to use bp_live >>as though nothing had happened? > > > No, and yes, which is why panicking was not warranted ;-) > > Martijn's advice to be using "vacuumdb -a" every so often is well given, > though. > > You could also consider switching over to autovacuum, particularly as of > 8.1. (I'm not sure how much I trust the contrib version that exists in > 8.0, and 7.4's is definitely pretty buggy, but I believe 8.1's can be > relied on to prevent this sort of thing.) > > regards, tom lane
At 07:48 PM 10/30/2005 +0000, John Sidney-Woollett wrote: >"Panic" - that's my middle name. ;) > >Had I known how to identify the database at fault, and that it would have >had no effect on the other databases, then I would have handled this >episode differently. Wonder if it would be a good idea for the error messages to identify which databases might have lost data. However if you have a fair number of databases involved you might get a fair number of log messages. Still, I think I wouldn't mind 100 lines in the logs if I had 100 databases at risk... :) Link.
Lincoln Yeoh said: > At 07:48 PM 10/30/2005 +0000, John Sidney-Woollett wrote: > >>"Panic" - that's my middle name. ;) >> >>Had I known how to identify the database at fault, and that it would have >>had no effect on the other databases, then I would have handled this >>episode differently. > > Wonder if it would be a good idea for the error messages to identify which > databases might have lost data. > > However if you have a fair number of databases involved you might get a > fair number of log messages. Still, I think I wouldn't mind 100 lines in > the logs if I had 100 databases at risk... > Agreed! John
On Mon, Oct 31, 2005 at 05:27:15PM -0000, John Sidney-Woollett wrote: > > Wonder if it would be a good idea for the error messages to identify which > > databases might have lost data. > > > > However if you have a fair number of databases involved you might get a > > fair number of log messages. Still, I think I wouldn't mind 100 lines in > > the logs if I had 100 databases at risk... FWIW, the 8.1beta has the following code: ereport(WARNING, (errmsg("database \"%s\" must be vacuumed within %u transactions", NameStr(oldest_datname), (MaxTransactionId >> 1) - age), errhint("To avoid a database shutdown, execute a full-database VACUUM in \"%s\".", NameStr(oldest_datname)))); Should be enough I think... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.