Thread: Database Disappeared
Some time ago (May 2004), we were making changes to a Postgresql server to allow a host to connect (pg_hba.conf). When we restarted the server, and restarted the AOLserver process, there was only one table left. All the ACS tables were gone. As we were troubleshooting the problem, we found the last known good dump (I pg_dump every four hours) was a month previous (sometime in April), when we made a change to ttracker_tickets's table. From that time in April up unitl we last had this problem, openacs worked fine, but the dumps were missing all ACS related tables. We have encountered this exact same problem again today. Originally, we thought it was due to a fault in our procedures in making the changes, however, this time, no one was altering anything. This appears to have happened sans any human intervention. The following is a link to the last post we had with this same problem: http://openacs.org/forums/message-view?message_id=185837 The following is a link to the post we made last friday about the current problem: http://openacs.org/forums/message-view?message_id=206092 We thought that this might be an AOLserver problem, however, through our investigation and comments from posts to the OpenACS board, we believe this is an error with PostgresSQL. We are hoping that someone might have some ideas or any experience with this. Thanks!!
>We thought that this might be an AOLserver problem, however, through our >investigation and comments from posts to the OpenACS board, we believe >this is an error with PostgresSQL. > > >We are hoping that someone might have some ideas or any experience with >this. Thanks!! > > > Are you vacuuming? How often? Sincerely, Joshua D. Drake >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
"Roberts, Adam" <Adam.Roberts@echostar.com> writes: > The following is a link to the last post we had with this same problem: > http://openacs.org/forums/message-view?message_id=185837 Given the comment about 500M pg_clog, I would venture that you aren't doing adequate vacuuming and have therefore run into a transaction ID wraparound failure. 500Mb clog corresponds to 2 billion transactions which is right about where things would blow up if you'd not done the necessary database-wide vacuums; and the fact that clog hasn't gotten truncated is pretty strong evidence that you didn't. Please read the administrator's guide: http://www.postgresql.org/docs/7.4/static/maintenance.html (adjust for your PG version as needed) regards, tom lane
Tom, Thanks so much for your reply. In fact we were not vacuuming this particular db at all! While we have other postgres db's on this same host, this was a "recently" added db and it was never added to the individual db list to be vacuumed. We have since modified the cronjob to perform a "vacuumdb -a -z" to avoid such an omission in the future. However, I would like to make doubly sure that the behavior we saw is explainable by a trans id wraparound failure and that we don't have something else lurking. We originally discovered the problem last Friday because the nightly pg_dumpall did not export any rows from any of the tables associated with this instance. From our failure in May, we learned to monitor this backup for problems. Also, when connecting to the instance, (either by the OpenACS application or via psql) selecting, add/del/modifying data from tables in the db all acted completely normal. But, if we attempted to execute a "\d" or "\df" from psql connected either as postgres or as the db owner (nsadmin), we would get a "No relations found." error. So, in order to preserve the data as much as possible, we actually performed a "copy to..." stmt for each individual table based on our known list of tables so that we could re-import the data back into a completely rebuilt instance (we dropped and recreated the instance in order to restore all functionality). The behavior remained consistent even after bouncing the postmaster and aolserver. So, my main question is, is it reasonable to say that a trans id wraparound failure could create a situation in which you could use/manipulate user data tables if you refer to the data tables directly but if you tried to use a util (such as pgdump) or an internal psql query like \d or \df it would appear that you have no data tables? Thanks again for your help - we are reviewing all routine maint tasks thoroughly for our environment. On Mon, 2004-10-18 at 16:23, Tom Lane wrote: > "Roberts, Adam" <Adam.Roberts@echostar.com> writes: > > The following is a link to the last post we had with this same problem: > > http://openacs.org/forums/message-view?message_id=185837 > > Given the comment about 500M pg_clog, I would venture that you aren't > doing adequate vacuuming and have therefore run into a transaction ID > wraparound failure. 500Mb clog corresponds to 2 billion transactions > which is right about where things would blow up if you'd not done the > necessary database-wide vacuums; and the fact that clog hasn't gotten > truncated is pretty strong evidence that you didn't. > > Please read the administrator's guide: > http://www.postgresql.org/docs/7.4/static/maintenance.html > (adjust for your PG version as needed) > > regards, tom lane >
"Roberts, Adam" <Adam.Roberts@echostar.com> writes: > So, my main question is, is it reasonable to say that a trans id > wraparound failure could create a situation in which you could > use/manipulate user data tables if you refer to the data tables directly > but if you tried to use a util (such as pgdump) or an internal psql > query like \d or \df it would appear that you have no data tables? Yeah, it is possible, because the system's internal catalog fetches use SnapshotNow rules, which only look to see if a row's inserting/deleting transaction(s) committed or not; they don't apply any comparison to the current transaction ID. So what you've got is a situation where the tables' pg_class rows have wrapped around and become invisible to SQL queries, but the system's internal operations are still happy. regards, tom lane