Slony uninstall info/warning - Mailing list pgsql-general

From John Sidney-Woollett
Subject Slony uninstall info/warning
Date
Msg-id 4211C7C5.2070903@wardbrook.com
Whole thread Raw
Responses Re: Slony uninstall info/warning
List pgsql-general
Hopefully this will prevent data loss or problems for others using slony
1.0.5 and pg 7.4.6...

We just got bitten by something we didn't foresee when completely
uninstalling a slony replication cluster from the master and slave...

MAKE SURE YOU STOP YOUR APPLICATION RUNNING AGAINST YOUR MASTER DATABASE
WHEN REMOVING THE WHOLE SLONY CLUSTER, or at least re-cycle all your
open connections after the event!

The connections appear to "remember" or refer to objects which are
removed by the uninstall node script. And you get lots of errors as a
result (and possible data loss??)...

Question: Why do our database objects still refer to removed slony
objects after they are removed?

John Sidney-Woollett

More info...

Our system is a web application which runs against a postgres 7.4.6
database. Tomcat is restarted at 5am each day.

Last Friday afternoon we uninstalled the slony cluster (1 master + 1
slave) so that we could add a new second slave. (I find it easier to
uninstall the cluster and then reapply a new setup, subscribe script
rather than trying to add the single node.)

The cluster was successfully removed, and then rebuilt with 1 master and
2 slave nodes.

However, we didn't stop and start our web application which uses Tomcat
connection pool and continued to run against the master (during the
uninstall and rebuild). Everything appeared fine.

Only today while checking something else did I notice lots of
ERROR:  relation with OID 1036711 does not exist
errors in the postgres serverlog

In our tomcat application we also saw lots of errors like
org.postgresql.util.PSQLException: ERROR: relation with OID 1036711 does
not exist
at
org.postgresql.util.PSQLException.parseServerError(PSQLException.java:139)
at org.postgresql.core.QueryExecutor.executeV3(QueryExecutor.java:154)
at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:101)
at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:43)
...

Basically these errors started when the cluster was uninstalled and
continued until the web application was restarted (at 5am the following
morning).

I'm not sure what object had OID 1036711 - maybe the slony replication
schema?? Is there a way to find out after the event?

I do have daily full backups of the master each day going back 1 month -
I could load one into another database and lookup the OID, if someone
could explain how to do that. And if that would be useful?

But the net effect is that we got bizarre failures in our application,
and large gaps in primary key values (using sequences) in certain tables
within the database. On the whole the app seemed to be behaving fine,
with the odd error that we put down to user error... (bad assumption, I
guess in hindsight)

At the moment I'm trying to figure out exactly what kind of damage our
data has suffered.

If anyone has any ideas or suggestions as to what went wrong or what to
check for, I'd appreciate hearing.

Thanks

John

pgsql-general by date:

Previous
From: "Rey Arqueza"
Date:
Subject: Re: /usr/sbin/useradd is needed by postgresql-server-8.0.1-PGDG
Next
From: Konstantin Danilov
Date:
Subject: How to view the list of tables?