Thread: Please HELP - URGENT - transaction wraparound error

Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
Oh my god!....

DB is pg 7.4.6 on linux

2005-10-27 05:55:55 WARNING:  some databases have not been vacuumed in
2129225822 transactions
HINT:  Better vacuum them within 18257825 transactions, or you may have
a wraparound failure.


2005-10-28 05:56:58 WARNING:  some databases have not been vacuumed in
over 2 billion transactions
DETAIL:  You may have already suffered transaction-wraparound data loss.

We have cronscripts that perform FULL vacuums

# vacuum template1 every sunday
35 2 * * 7 /usr/local/pgsql/bin/vacuumdb --analyze --verbose template1

# vacuum live DB every day
35 5 * * * /usr/local/bin/psql -c "vacuum verbose analyze" -d bp_live -U
postgres --output /home/postgres/cronscripts/live/vacuumfull.log

Questions:

1) Why do have we data corruption? I thought we were doing everything we
needed to stop any wraparound... Are the pg docs inadequate, or did I
misunderstand what needed to be done?

2) What can I do to recover the data?

I have full daily backups from midnight each day using
/usr/local/pgsql/bin/pg_dump $DATABASE > $BACKUPFILE

plus I have this database replicated using Slon 1.1.0 to another 7.4.6
database.

I can failover to the slave server, but what do I need to do to rebuild
the original database?

Should I failover now?!! And then start rebuilding the old master
database (using slon, I presume)?

How do I stop this EVER happening again??!!!

Thanks for help

John

Re: Please HELP - URGENT - transaction wraparound error

From
Martijn van Oosterhout
Date:
On Sun, Oct 30, 2005 at 08:50:18AM +0000, John Sidney-Woollett wrote:
> Oh my god!....
>
> DB is pg 7.4.6 on linux

Firstly, check pg_database, it should tell you which databases need to
be vacuumed. Any database you regularly vacuumed is fine so maybe the
corruption is in some other database you don't remember?

> 1) Why do have we data corruption? I thought we were doing everything we
> needed to stop any wraparound... Are the pg docs inadequate, or did I
> misunderstand what needed to be done?

You *may* have corruption. Anything you vacuumed recently should be
fine.

> 2) What can I do to recover the data?

Check whether anything is lost first.

> How do I stop this EVER happening again??!!!

Have you read this:

http://www.postgresql.org/docs/7.4/static/maintenance.html#VACUUM-FOR-WRAPAROUND

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
Martin, thanks for the feedback.

I had a look around and couldn't see any data loss (but wasn't really
sure where to start looking).

I decided to switch over to the slave which is now our live database.
the old master with the problem has already been re-inited (although I
have a cold backup of the data dir), plus dump files that I can restore
from.

I checked pg_database (on the new master) and I don't really understand
what it is saying. Is the datvacuumxid vs 3553547043 the significant
information? I see in our new database:

    datname    | datdba | encoding | datistemplate
--------------+--------+----------+---------------
  bp_live      |      1 |        6 | f
  bp_test      |      1 |        6 | f
  template1    |      1 |        0 | t
  template0    |      1 |        0 | t


    datname    | datallowconn | datlastsysoid | datvacuumxid
--------------+--------------+---------------+--------------
  bp_live      | t            |         17140 |    332321570
  bp_test      | t            |         17140 |    332265474
  template1    | t            |         17140 |    332241177
  template0    | f            |         17140 |          464


    datname    | datfrozenxid | datpath | datconfig
--------------+--------------+---------+-----------
  bp_live      |   3553547043 |         |
  bp_test      |   3553490947 |         |
  template1    |   3553466653 |         |
  template0    |          464 |         |


    datname    |          datacl
--------------+--------------------------
  bp_live      |
  bp_test      |
  template1    | {postgres=C*T*/postgres}
  template0    | {postgres=C*T*/postgres}


Are we going to get the same problem with this database?

What's also worrying me is that the warning message is in fact
misleading!!??

2005-10-28 05:56:58 WARNING:  some databases have not been vacuumed in
over 2 billion transactions
DETAIL:  You may have already suffered transaction-wraparound data loss.

And I'm wondering if I have in fact destroyed a perfectly good database
and data set...

I read the link you gave (before) but found it hard to work out what you
actually need to do to protect yourself.

We DID vacuum the databases nightly, and template1 once a week. So I
still don't understand why we got this error. Can someone explain in
simple language?

Can someone also give me a detailed "you need to do this, and this and
this..." explanation to prevent this happening again (either on our
master or slave databases).

For example, must you do a vacuum full instead of a vacuum analyze on a
7.4.x database to prevent wraparound issues?

BTW, for those not using **Slony** - you should check it out. It has
saved my bacon three times this year! Due to:

1) server failure - hardware crash, needed BIOS flash, complete OS
reinstall etc
2) disk full - corrupted pg data
3) oid wraparound (today's problem)

Any further help that anyone can give is much appreciated.

Thanks

John

Martijn van Oosterhout wrote:
> On Sun, Oct 30, 2005 at 08:50:18AM +0000, John Sidney-Woollett wrote:
>
>>Oh my god!....
>>
>>DB is pg 7.4.6 on linux
>
>
> Firstly, check pg_database, it should tell you which databases need to
> be vacuumed. Any database you regularly vacuumed is fine so maybe the
> corruption is in some other database you don't remember?
>
>
>>1) Why do have we data corruption? I thought we were doing everything we
>>needed to stop any wraparound... Are the pg docs inadequate, or did I
>>misunderstand what needed to be done?
>
>
> You *may* have corruption. Anything you vacuumed recently should be
> fine.
>
>
>>2) What can I do to recover the data?
>
>
> Check whether anything is lost first.
>
>
>>How do I stop this EVER happening again??!!!
>
>
> Have you read this:
>
> http://www.postgresql.org/docs/7.4/static/maintenance.html#VACUUM-FOR-WRAPAROUND
>
> Hope this helps,

Re: Please HELP - URGENT - transaction wraparound error

From
Tom Lane
Date:
John Sidney-Woollett <johnsw@wardbrook.com> writes:
> I decided to switch over to the slave which is now our live database.
> the old master with the problem has already been re-inited (although I
> have a cold backup of the data dir), plus dump files that I can restore
> from.

You panicked much too quickly and destroyed the evidence ... unless by
"cold backup" you mean a filesystem backup, in which case what you
should do is restore that and take a look at what's in its pg_database.
I think there's no question that there is some omission in your vacuuming
procedures, and you need to find out what it is.

            regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
Hi Tom

You're not wrong about panicking! This is the worst Sunday I've had in a
while... No sunday lunch or time with the kids... :(

This database supports a (normally 24/7) website and we couldn't
tolerate any possibility of data corruption. I had to make a judgement
call on preventing any/further data loss or corruption, and switching
over to the slave seemed the safest thing to do (based on my ignorance
of the wraparound problem).

I can restore the file system backup of pgsql/data to another database
server and then get the info from pg_database. Or I can import a dump
file from 15 minutes before I re-inited the database...

What exactly am I looking for though?

We don't use OIDs when creating tables...

Could Slon 1.1.0 be causing a problem for us? It must be creating and
deleting bucket loads of records as part of its regular activity...

What am I likely to have missed in my vacuuming? Because whatever I did
wrong is going to break our current live database at some point soon.

Thanks

John



Tom Lane wrote:
> John Sidney-Woollett <johnsw@wardbrook.com> writes:
>
>>I decided to switch over to the slave which is now our live database.
>>the old master with the problem has already been re-inited (although I
>>have a cold backup of the data dir), plus dump files that I can restore
>>from.
>
>
> You panicked much too quickly and destroyed the evidence ... unless by
> "cold backup" you mean a filesystem backup, in which case what you
> should do is restore that and take a look at what's in its pg_database.
> I think there's no question that there is some omission in your vacuuming
> procedures, and you need to find out what it is.
>
>             regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
Tom Lane
Date:
John Sidney-Woollett <johnsw@wardbrook.com> writes:
> I can restore the file system backup of pgsql/data to another database
> server and then get the info from pg_database. Or I can import a dump
> file from 15 minutes before I re-inited the database...

Importing a dump will tell you nothing at all, as all the data will be
freshly loaded.

> What exactly am I looking for though?

SELECT datname, age(datfrozenxid) FROM pg_database;

where the second column approaches 2 billion.

Alternatively, wait a few weeks and note which entries in your live
database are increasing rather than staying near 1 billion.

            regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
OK, I restored the pgsql/data to another server and started up postgres

and this is what I got:

  SELECT datname, age(datfrozenxid) FROM pg_database;
    datname    |     age
--------------+-------------
  mail_lxtreme | -2074187459
  bp_live      |  1079895636
  template1    |  1076578064
  template0    | -2074187459
(4 rows)

mail_lxtreme is a test mail db and I don't care about it. So it could
have been deleted without any worries...

Which databases are a problem? Is it template0 or bp_live and template1?

Thanks

John

Tom Lane wrote:
> John Sidney-Woollett <johnsw@wardbrook.com> writes:
>
>>I can restore the file system backup of pgsql/data to another database
>>server and then get the info from pg_database. Or I can import a dump
>>file from 15 minutes before I re-inited the database...
>
>
> Importing a dump will tell you nothing at all, as all the data will be
> freshly loaded.
>
>
>>What exactly am I looking for though?
>
>
> SELECT datname, age(datfrozenxid) FROM pg_database;
>
> where the second column approaches 2 billion.
>
> Alternatively, wait a few weeks and note which entries in your live
> database are increasing rather than staying near 1 billion.
>
>             regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
Tom Lane
Date:
John Sidney-Woollett <johnsw@wardbrook.com> writes:
> OK, I restored the pgsql/data to another server and started up postgres
> and this is what I got:

>   SELECT datname, age(datfrozenxid) FROM pg_database;
>     datname    |     age
> --------------+-------------
>   mail_lxtreme | -2074187459
>   bp_live      |  1079895636
>   template1    |  1076578064
>   template0    | -2074187459
> (4 rows)

> mail_lxtreme is a test mail db and I don't care about it. So it could
> have been deleted without any worries...

> Which databases are a problem? Is it template0 or bp_live and template1?

mail_lxtreme is exactly the problem.  You weren't vacuuming it...

(template0 is a special case and can be ignored.)

            regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
Hmm. I'm pretty sure that database mail_lxtreme was unused (no
connections/activity) - I didn't think that it would need to be vacuumed
at all...

Just out of curiousity would the wraparound error (for mail_lxtreme)
actually have affected data in bp_live?

Could I just have deleted mail_lxtreme and then continued to use bp_live
as though nothing had happened?

Or had database bp_live already been damaged by the wraparound?

Thanks for your great help/advice - it's much appreciated.

John

Tom Lane wrote:
> John Sidney-Woollett <johnsw@wardbrook.com> writes:
>
>>OK, I restored the pgsql/data to another server and started up postgres
>>and this is what I got:
>
>
>>  SELECT datname, age(datfrozenxid) FROM pg_database;
>>    datname    |     age
>>--------------+-------------
>>  mail_lxtreme | -2074187459
>>  bp_live      |  1079895636
>>  template1    |  1076578064
>>  template0    | -2074187459
>>(4 rows)
>
>
>>mail_lxtreme is a test mail db and I don't care about it. So it could
>>have been deleted without any worries...
>
>
>>Which databases are a problem? Is it template0 or bp_live and template1?
>
>
> mail_lxtreme is exactly the problem.  You weren't vacuuming it...
>
> (template0 is a special case and can be ignored.)
>
>             regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
Martijn van Oosterhout
Date:
On Sun, Oct 30, 2005 at 06:41:45PM +0000, John Sidney-Woollett wrote:
> Hmm. I'm pretty sure that database mail_lxtreme was unused (no
> connections/activity) - I didn't think that it would need to be vacuumed
> at all...

A database that is never used still needs to be vacuumed. The only
exception is if you VACUUM FREEZE which puts the entire database in a
frozen state which will never need vacuuming. This is how template0 is
configured. Ofcourse, once you make changes...

> Just out of curiousity would the wraparound error (for mail_lxtreme)
> actually have affected data in bp_live?

I doubt it but (thinking shared tables) I'll have to defer to someone
more knowledgable.

> Could I just have deleted mail_lxtreme and then continued to use bp_live
> as though nothing had happened?
>
> Or had database bp_live already been damaged by the wraparound?

Everything would probably have been fine.

BTW, I would have thought this message would have been appearing the
last billion transactions or so, didn't anyone notice?

To solve this forever, setup a cronjob for once a month:

vacuumdb -a

This will vacuum every database, even if you don't know the names or
where they came from. AIUI when you vacuum a database whose
transactions are over billion transactions old it automatically puts it
in "frozen" state. If someone had happened to run "vacuumdb -a" anytime
in the last few months, you might never have noticed the wraparound...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Please HELP - URGENT - transaction wraparound error

From
Tom Lane
Date:
John Sidney-Woollett <johnsw@wardbrook.com> writes:
> Just out of curiousity would the wraparound error (for mail_lxtreme)
> actually have affected data in bp_live?
> Could I just have deleted mail_lxtreme and then continued to use bp_live
> as though nothing had happened?

No, and yes, which is why panicking was not warranted ;-)

Martijn's advice to be using "vacuumdb -a" every so often is well given,
though.

You could also consider switching over to autovacuum, particularly as of
8.1.  (I'm not sure how much I trust the contrib version that exists in
8.0, and 7.4's is definitely pretty buggy, but I believe 8.1's can be
relied on to prevent this sort of thing.)

            regards, tom lane

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
Martijn

Thanks for the answers/thoughts...

Vacuumuming the databases hammers the server so the vacuums are spread
out at different times during the night/morning. Plus template1 is
vacuumed once a week.

I guess I was unlucky to have missed the vacuum on that unused database
(due to my misunderstanding), and not to have been scanning the
serverlog more frequently (if at all recently!).

My solution is to create a nagios script that scans entries in serverlog
loking for WARN or ERROR messages in the past xx minutes.

With this in place, I would have caught this error weeks ago before it
bit me in the ass!

Stressful day, but learnt a lot...

Thanks for everyone for their input - great product and great support!

John

Martijn van Oosterhout wrote:
> On Sun, Oct 30, 2005 at 06:41:45PM +0000, John Sidney-Woollett wrote:
>
>>Hmm. I'm pretty sure that database mail_lxtreme was unused (no
>>connections/activity) - I didn't think that it would need to be vacuumed
>>at all...
>
>
> A database that is never used still needs to be vacuumed. The only
> exception is if you VACUUM FREEZE which puts the entire database in a
> frozen state which will never need vacuuming. This is how template0 is
> configured. Ofcourse, once you make changes...
>
>
>>Just out of curiousity would the wraparound error (for mail_lxtreme)
>>actually have affected data in bp_live?
>
>
> I doubt it but (thinking shared tables) I'll have to defer to someone
> more knowledgable.
>
>
>>Could I just have deleted mail_lxtreme and then continued to use bp_live
>>as though nothing had happened?
>>
>>Or had database bp_live already been damaged by the wraparound?
>
>
> Everything would probably have been fine.
>
> BTW, I would have thought this message would have been appearing the
> last billion transactions or so, didn't anyone notice?
>
> To solve this forever, setup a cronjob for once a month:
>
> vacuumdb -a
>
> This will vacuum every database, even if you don't know the names or
> where they came from. AIUI when you vacuum a database whose
> transactions are over billion transactions old it automatically puts it
> in "frozen" state. If someone had happened to run "vacuumdb -a" anytime
> in the last few months, you might never have noticed the wraparound...
>
> Hope this helps,

Re: Please HELP - URGENT - transaction wraparound error

From
John Sidney-Woollett
Date:
"Panic" - that's my middle name. ;)

Had I known how to identify the database at fault, and that it would
have had no effect on the other databases, then I would have handled
this episode differently.

In the event, things seem to be OK. Our old slave db is now acting as
master and the old master rebuilt as the new slave ... courtesy of slon.

I'd like to move to 8.1 but I'm waiting for a quiet period when there's
less development/fire fighting so that I can test all the java
components of our webapp and then manage the upgrade properly.

Maybe suppressing other vacuums once a month, and running the "vacuumdb
-a" option instead wouldn't be a bad idea...

Many thanks for all your support and advice - you've been great help
(and comfort).

John


Tom Lane wrote:
> John Sidney-Woollett <johnsw@wardbrook.com> writes:
>
>>Just out of curiousity would the wraparound error (for mail_lxtreme)
>>actually have affected data in bp_live?
>>Could I just have deleted mail_lxtreme and then continued to use bp_live
>>as though nothing had happened?
>
>
> No, and yes, which is why panicking was not warranted ;-)
>
> Martijn's advice to be using "vacuumdb -a" every so often is well given,
> though.
>
> You could also consider switching over to autovacuum, particularly as of
> 8.1.  (I'm not sure how much I trust the contrib version that exists in
> 8.0, and 7.4's is definitely pretty buggy, but I believe 8.1's can be
> relied on to prevent this sort of thing.)
>
>             regards, tom lane

Re: Please HELP - URGENT - transaction wraparound

From
Lincoln Yeoh
Date:
At 07:48 PM 10/30/2005 +0000, John Sidney-Woollett wrote:

>"Panic" - that's my middle name. ;)
>
>Had I known how to identify the database at fault, and that it would have
>had no effect on the other databases, then I would have handled this
>episode differently.

Wonder if it would be a good idea for the error messages to identify which
databases might have lost data.

However if you have a fair number of databases involved you might get a
fair number of log messages. Still, I think I wouldn't mind 100 lines in
the logs if I had 100 databases at risk...

:)

Link.


Re: Please HELP - URGENT - transaction wraparound error

From
"John Sidney-Woollett"
Date:
Lincoln Yeoh said:
> At 07:48 PM 10/30/2005 +0000, John Sidney-Woollett wrote:
>
>>"Panic" - that's my middle name. ;)
>>
>>Had I known how to identify the database at fault, and that it would have
>>had no effect on the other databases, then I would have handled this
>>episode differently.
>
> Wonder if it would be a good idea for the error messages to identify which
> databases might have lost data.
>
> However if you have a fair number of databases involved you might get a
> fair number of log messages. Still, I think I wouldn't mind 100 lines in
> the logs if I had 100 databases at risk...
>

Agreed!

John


Re: Please HELP - URGENT - transaction wraparound error

From
Martijn van Oosterhout
Date:
On Mon, Oct 31, 2005 at 05:27:15PM -0000, John Sidney-Woollett wrote:
> > Wonder if it would be a good idea for the error messages to identify which
> > databases might have lost data.
> >
> > However if you have a fair number of databases involved you might get a
> > fair number of log messages. Still, I think I wouldn't mind 100 lines in
> > the logs if I had 100 databases at risk...

FWIW, the 8.1beta has the following code:

                ereport(WARNING,
                   (errmsg("database \"%s\" must be vacuumed within %u transactions",
                                   NameStr(oldest_datname),
                                   (MaxTransactionId >> 1) - age),
                        errhint("To avoid a database shutdown, execute a full-database VACUUM in \"%s\".",
                                        NameStr(oldest_datname))));


Should be enough I think...

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment