Thread: another failover testing question

another failover testing question

From
"David Parker"
Date:
Something that we end up doing sometimes in our failover testing is removing slony replication from an "active" (data provider) server. Because this involves removing triggers from tables, we end up with currently connected clients getting a bunch of "OID 123 not found" errors, where the OID is that of the recently removed trigger.
 
Is there any way short of cycling all client connections to have the server processes clean that information out of their cache when an object disappears like this from the database?
 
(I'm posting here rather than the slony list because it seems like a general question....)
 
Thanks.

- DAP
----------------------------------------------------------------------------------
David Parker    Tazz Networks    (401) 709-5130
 

 

Re: another failover testing question

From
Tom Lane
Date:
"David Parker" <dparker@tazznetworks.com> writes:
> Something that we end up doing sometimes in our failover testing is
> removing slony replication from an "active" (data provider) server.
> Because this involves removing triggers from tables, we end up with
> currently connected clients getting a bunch of "OID 123 not found"
> errors, where the OID is that of the recently removed trigger.

> Is there any way short of cycling all client connections to have the
> server processes clean that information out of their cache when an
> object disappears like this from the database?

AFAICS, there already *is* adequate interlocking for this.  What PG
version are you testing, and can you provide a self-contained test case?

            regards, tom lane

Re: another failover testing question

From
"David Parker"
Date:
Sorry, neglected the version yet again: 7.4.5. What happens is that we
have active connections accessing tables that are being replicated by
slony. Then somebody does an uninstall of slony, which removes the slony
trigger from those tables. Then we start getting the OID error.

If this should indeed not be an issue in 7.4.5, I will try to come up
with a test case independent of a slony install.

Thanks.

- DAP

>-----Original Message-----
>From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>Sent: Thursday, May 26, 2005 4:30 PM
>To: David Parker
>Cc: postgres general
>Subject: Re: [GENERAL] another failover testing question
>
>"David Parker" <dparker@tazznetworks.com> writes:
>> Something that we end up doing sometimes in our failover testing is
>> removing slony replication from an "active" (data provider) server.
>> Because this involves removing triggers from tables, we end up with
>> currently connected clients getting a bunch of "OID 123 not found"
>> errors, where the OID is that of the recently removed trigger.
>
>> Is there any way short of cycling all client connections to have the
>> server processes clean that information out of their cache when an
>> object disappears like this from the database?
>
>AFAICS, there already *is* adequate interlocking for this.
>What PG version are you testing, and can you provide a
>self-contained test case?
>
>            regards, tom lane
>

Re: another failover testing question

From
Tom Lane
Date:
"David Parker" <dparker@tazznetworks.com> writes:
> Sorry, neglected the version yet again: 7.4.5. What happens is that we
> have active connections accessing tables that are being replicated by
> slony. Then somebody does an uninstall of slony, which removes the slony
> trigger from those tables. Then we start getting the OID error.
> If this should indeed not be an issue in 7.4.5, I will try to come up
> with a test case independent of a slony install.

It should not be ... at least, assuming that Slony is using the standard
DROP TRIGGER operation, rather than playing directly with the system
catalogs ...

            regards, tom lane

Re: another failover testing question

From
"David Parker"
Date:
>It should not be ... at least, assuming that Slony is using
>the standard DROP TRIGGER operation, rather than playing
>directly with the system catalogs ...

AFAICS, the slony uninstall command is not doing anything exotic, though
it DOES do a little bit of fiddling with pg_catalog to RESTORE
previously disabled triggers. Otherwise it is using plain vanilla drop
trigger.

I found a slony list thread from a few months ago that discussed this
issue: http://archives.postgresql.org/pgsql-general/2005-02/msg00813.php

The discussion there centered around cached plans causing the "no
relation with OID" problem. The area of our code that experiences these
problems is calling libpq - we have a wrapper for it that plugs into our
Tcl environment - but it is not using prepared statements, and the
commands it is executing are not calls to stored procedures, etc.

I cannot repro this problem simply using psql, so it must have something
to do with the way we are using libpq, but I have no idea what object(s)
we are holding onto that reference slony OIDs.

- DAP

Re: another failover testing question

From
"David Parker"
Date:
I know better what is happening now. I had the scenario slightly wrong.

Slony creates a trigger on all replicated tables that calls into a
shared library. The _Slony_I_logTrigger method in this library
establishes a saved plan for inserts into its transaction log table
sl_log_1. I can create the missing OID error with:

1) configure replication
2) establish a client connection, perform operations on replicated
tables
3) remove replication (drops sl_log_1 table)
4) operations on replicated tables on client connection are still fine
5) re-configure replication (re-creates sl_log_1 table)
6) now the OID error appears in the client connection. The OID refers
to the previous version of the sl_log_1 table

I was pawing through our code to figure out where we might be saving a
prepared statement, and was forgetting that the slony1_funcs library
does this. This saved plan is executed with SPI_execp, and the
documentation states:

"If one of the objects (a table, function, etc.) referenced by the
prepared plan is dropped during the session then the results of
SPI_execp for this plan will be unpredictable."

I'm pretty sure I understand the problem now (corrections appreciated),
but I'm left with the operational question of how I get around this
issue. Is there any way short of PQreset to get a postgres process to
refresh its saved plans? I can generally avoid the
drop-replication/re-configure replication thing happening in our
procedures, but I can't prevent it completely....

- DAP

>> Sorry, neglected the version yet again: 7.4.5. What happens
>is that we
>> have active connections accessing tables that are being
>replicated by
>> slony. Then somebody does an uninstall of slony, which removes the
>> slony trigger from those tables. Then we start getting the OID error.
>> If this should indeed not be an issue in 7.4.5, I will try
>to come up
>> with a test case independent of a slony install.
>
>It should not be ... at least, assuming that Slony is using
>the standard DROP TRIGGER operation, rather than playing
>directly with the system catalogs ...
>
>            regards, tom lane
>

Re: another failover testing question

From
Tom Lane
Date:
"David Parker" <dparker@tazznetworks.com> writes:
> I can create the missing OID error with:

> 1) configure replication
> 2) establish a client connection, perform operations on replicated
> tables
> 3) remove replication (drops sl_log_1 table)
> 4) operations on replicated tables on client connection are still fine
> 5) re-configure replication (re-creates sl_log_1 table)
> 6) now the OID error appears in the client connection. The OID refers
> to the previous version of the sl_log_1 table

> I was pawing through our code to figure out where we might be saving a
> prepared statement, and was forgetting that the slony1_funcs library
> does this.

I think this is essentially a bug in the Slony library --- it ought to
provide a way to flush its internally cached plan(s).

In the longer term there may be infrastructure for automatic rebuilding
of invalidated plans, but I wouldn't hold my breath waiting for this.
(Even if it existed now, most likely the Slony code would have to change
to take advantage of it ...)

            regards, tom lane