Thread: Database Disappeared

Database Disappeared

From
"Roberts, Adam"
Date:
Some time ago (May 2004), we were making changes to a Postgresql server
to allow a host to connect (pg_hba.conf).  When we restarted the server,
and restarted the AOLserver process, there was only one table left. All
the ACS tables were gone.

As we were troubleshooting the problem, we found the last known good
dump (I pg_dump every four hours) was a month previous (sometime in
April), when we made a change to ttracker_tickets's table.

From that time in April up unitl we last had this problem, openacs
worked fine, but the dumps were missing all ACS related tables.

We have encountered this exact same problem again today.  Originally, we
thought it was due to a fault in our procedures in making the changes,
however, this time, no one was altering anything.  This appears to have
happened sans any human intervention.

The following is a link to the last post we had with this same problem:
http://openacs.org/forums/message-view?message_id=185837

The following is a link to the post we made last friday about the
current problem:
http://openacs.org/forums/message-view?message_id=206092


We thought that this might be an AOLserver problem, however, through our
investigation and comments from posts to the OpenACS board, we believe
this is an error with PostgresSQL.


We are hoping that someone might have some ideas or any experience with
this.  Thanks!!


Re: Database Disappeared

From
"Joshua D. Drake"
Date:
>We thought that this might be an AOLserver problem, however, through our
>investigation and comments from posts to the OpenACS board, we believe
>this is an error with PostgresSQL.
>
>
>We are hoping that someone might have some ideas or any experience with
>this.  Thanks!!
>
>
>
Are you vacuuming? How often?

Sincerely,

Joshua D. Drake




>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>
>


--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Re: Database Disappeared

From
Tom Lane
Date:
"Roberts, Adam" <Adam.Roberts@echostar.com> writes:
> The following is a link to the last post we had with this same problem:
> http://openacs.org/forums/message-view?message_id=185837

Given the comment about 500M pg_clog, I would venture that you aren't
doing adequate vacuuming and have therefore run into a transaction ID
wraparound failure.  500Mb clog corresponds to 2 billion transactions
which is right about where things would blow up if you'd not done the
necessary database-wide vacuums; and the fact that clog hasn't gotten
truncated is pretty strong evidence that you didn't.

Please read the administrator's guide:
http://www.postgresql.org/docs/7.4/static/maintenance.html
(adjust for your PG version as needed)

            regards, tom lane

Re: Database Disappeared

From
"Roberts, Adam"
Date:
Tom,
Thanks so much for your reply.  In fact we were not vacuuming this
particular db at all! While we have other postgres db's on this same
host, this was a "recently" added db and it was never added to the
individual db list to be vacuumed.  We have since modified the cronjob
to perform a "vacuumdb -a -z" to avoid such an omission in the future.

However, I would like to make doubly sure that the behavior we saw is
explainable by a trans id wraparound failure and that we don't have
something else lurking.  We originally discovered the problem last
Friday because the nightly pg_dumpall did not export any rows from any
of the tables associated with this instance. From our failure in May, we
learned to monitor this backup for problems.  Also, when connecting to
the instance, (either by the OpenACS application or via psql) selecting,
add/del/modifying data from tables in the db all acted completely
normal.  But, if we attempted to execute a "\d" or "\df" from psql
connected either as postgres or as the db owner (nsadmin), we would get
a "No relations found." error.  So, in order to preserve the data as
much as possible, we actually performed a "copy to..." stmt for each
individual table based on our known list of tables so that we could
re-import the data back into a completely rebuilt instance (we dropped
and recreated the instance in order to restore all functionality).  The
behavior remained consistent even after bouncing the postmaster and
aolserver.

So, my main question is, is it reasonable to say that a trans id
wraparound failure could create a situation in which you could
use/manipulate user data tables if you refer to the data tables directly
but if you tried to use a util (such as pgdump) or an internal psql
query like \d or \df it would appear that you have no data tables?

Thanks again for your help - we are reviewing all routine maint tasks
thoroughly for our environment.


On Mon, 2004-10-18 at 16:23, Tom Lane wrote:
> "Roberts, Adam" <Adam.Roberts@echostar.com> writes:
> > The following is a link to the last post we had with this same
problem:
> > http://openacs.org/forums/message-view?message_id=185837
>
> Given the comment about 500M pg_clog, I would venture that you aren't
> doing adequate vacuuming and have therefore run into a transaction ID
> wraparound failure.  500Mb clog corresponds to 2 billion transactions
> which is right about where things would blow up if you'd not done the
> necessary database-wide vacuums; and the fact that clog hasn't gotten
> truncated is pretty strong evidence that you didn't.
>
> Please read the administrator's guide:
> http://www.postgresql.org/docs/7.4/static/maintenance.html
> (adjust for your PG version as needed)
>
>                       regards, tom lane
>

Re: Database Disappeared

From
Tom Lane
Date:
"Roberts, Adam" <Adam.Roberts@echostar.com> writes:
> So, my main question is, is it reasonable to say that a trans id
> wraparound failure could create a situation in which you could
> use/manipulate user data tables if you refer to the data tables directly
> but if you tried to use a util (such as pgdump) or an internal psql
> query like \d or \df it would appear that you have no data tables?

Yeah, it is possible, because the system's internal catalog fetches use
SnapshotNow rules, which only look to see if a row's inserting/deleting
transaction(s) committed or not; they don't apply any comparison to the
current transaction ID.  So what you've got is a situation where the
tables' pg_class rows have wrapped around and become invisible to SQL
queries, but the system's internal operations are still happy.

            regards, tom lane