Thread: 7.2.3 vacuum bug

7.2.3 vacuum bug

From
Rod Taylor
Date:
ERROR:  RelationClearRelation: relation 11584078 deleted while still in
use 

I've been unable to come up with a test case that will cause the
problem, seems to be timing related.  The queries that are currently
running when these errors occur do a lot or work with temp tables that
are frequently truncated.

--  Rod Taylor



Re: 7.2.3 vacuum bug

From
Neil Conway
Date:
Rod Taylor <rbt@rbt.ca> writes:
> ERROR:  RelationClearRelation: relation 11584078 deleted while still in
> use 

I was going to report a similar error that arises in a different
situation:

client 1:

CREATE TABLE a (b int);
BEGIN;
DROP TABLE a;
-- wait

client 2:

SELECT * FROM a;

client 1:

COMMIT;

Now, client 2 will receive "RelationClearRelation: relation 25172
deleted while still in use", rather than "Relation "a" does not
exist", as you might expect. Not sure if it's the same bug, or just a
different problem...

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC



Re: 7.2.3 vacuum bug

From
Rod Taylor
Date:
On Wed, 2002-10-30 at 15:38, Neil Conway wrote:
> Rod Taylor <rbt@rbt.ca> writes:
> > ERROR:  RelationClearRelation: relation 11584078 deleted while still in
> > use 
> 
> I was going to report a similar error that arises in a different
> situation:

Probably a different look at the same problem.

--  Rod Taylor



Re: 7.2.3 vacuum bug

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> client 1:

> CREATE TABLE a (b int);
> BEGIN;
> DROP TABLE a;
> -- wait

> client 2:

> SELECT * FROM a;

> client 1:

> COMMIT;

> Now, client 2 will receive "RelationClearRelation: relation 25172
> deleted while still in use", rather than "Relation "a" does not
> exist", as you might expect.

But relation "a" *does* exist at the start of client 2's operation.
While I'm not here to defend the exact phrasing of this error message,
it does seem to me that it's appropriate to give a different error
message than what appears when the table wasn't found at all.

An example of why the two cases shouldn't be folded together: suppose
that client 2's schema search path is "myschema, public", and that
client 1 creates/drops myschema.a while there is also a public.a.
client 2 will locate myschema.a as the meaning of "a", and one way or
another it is going to error out when myschema.a gets dropped from
underneath it --- it will not (and shouldn't IMHO) go back and repeat
the schema search to find public.a.  But a user who gets a "Relation "a"
does not exist" error message in such a scenario would be justifiably
confused.
        regards, tom lane


Re: 7.2.3 vacuum bug

From
Neil Conway
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> But relation "a" *does* exist at the start of client 2's operation.
> While I'm not here to defend the exact phrasing of this error message,
> it does seem to me that it's appropriate to give a different error
> message than what appears when the table wasn't found at all.

Ok, fair enough -- I agree that we should treat the two cases
differently. But one thing I think we should do in any case is improve
the wording of the error message.

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC



Re: 7.2.3 vacuum bug

From
Tom Lane
Date:
Rod Taylor <rbt@rbt.ca> writes:
> ERROR:  RelationClearRelation: relation 11584078 deleted while still in
> use 

> I've been unable to come up with a test case that will cause the
> problem, seems to be timing related.  The queries that are currently
> running when these errors occur do a lot or work with temp tables that
> are frequently truncated.

Hm.  vacuum.c tries to avoid this class of problem:
   /*    * Race condition -- if the pg_class tuple has gone away since the    * last time we saw it, we don't need to
vacuumit.    */   if (!SearchSysCacheExists(RELOID,                             ObjectIdGetDatum(relid),
            0, 0, 0))   {       CommitTransactionCommand(true);       return true;            /* okay 'cause no data
there*/   }
 
   ...
   onerel = relation_open(relid, lmode);

but on reflection it's clear that this doesn't really prevent a race
condition.  If the table is already exclusive-locked by a DROP TABLE
that hasn't committed yet (eg, the implicit DROP that happens when temp
tables are cleared out at backend exit), then the syscache lookup will
go fine, but the relation_open() routine blocks waiting for lock and
eventually fails.

What would probably work better is to first lock the relation OID,
then see if we can open the relation or not.

Thinking further, it's really kinda bogus that LockRelation() works on
an already-opened Relation; if possible we should acquire the lock
before attempting to create a relcache entry.  (We only need to know the
OID and the relisshared status before we can make a locktag, so it'd be
possible to acquire the lock using only the contents of the pg_class row.)
Not sure how much code restructuring might be involved to make this
happen, but it'd be worth thinking about for 7.4.
        regards, tom lane


Re: 7.2.3 vacuum bug

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Ok, fair enough -- I agree that we should treat the two cases
> differently. But one thing I think we should do in any case is improve
> the wording of the error message.

Got a suggestion?
        regards, tom lane


Re: 7.2.3 vacuum bug

From
"scott.marlowe"
Date:
On Thu, 31 Oct 2002, Tom Lane wrote:

> Neil Conway <neilc@samurai.com> writes:
> > Ok, fair enough -- I agree that we should treat the two cases
> > differently. But one thing I think we should do in any case is improve
> > the wording of the error message.
> 
> Got a suggestion?

Change:  RelationClearRelation: relation 25172 deleted while still in use
to: RelationClearRelation: a relation (id: 25172) was deleted while still 
in use



Re: 7.2.3 vacuum bug

From
Rod Taylor
Date:
Found another:

ERROR:  cannot find attribute 2 of relation pg_temp_12100_0

On Thu, 2002-10-31 at 11:33, scott.marlowe wrote:
> On Thu, 31 Oct 2002, Tom Lane wrote:
> 
> > Neil Conway <neilc@samurai.com> writes:
> > > Ok, fair enough -- I agree that we should treat the two cases
> > > differently. But one thing I think we should do in any case is improve
> > > the wording of the error message.
> > 
> > Got a suggestion?
> 
> Change:  RelationClearRelation: relation 25172 deleted while still in use
> to: RelationClearRelation: a relation (id: 25172) was deleted while still 
> in use
> 
> 
--  Rod Taylor



Re: 7.2.3 vacuum bug

From
Tom Lane
Date:
Rod Taylor <rbt@rbt.ca> writes:
> Found another:
> ERROR:  cannot find attribute 2 of relation pg_temp_12100_0

Can you reproduce that?

It could be that this just represents someone's temp table deletion
committing while VACUUM is partway through trying to build a relcache
entry to open the relation.  If so, it is only another manifestation
of the should-lock-before-relation-open problem.
        regards, tom lane


Re: 7.2.3 vacuum bug

From
Rod Taylor
Date:
On Thu, 2002-10-31 at 13:03, Tom Lane wrote:
> Rod Taylor <rbt@rbt.ca> writes:
> > Found another:
> > ERROR:  cannot find attribute 2 of relation pg_temp_12100_0
> 
> Can you reproduce that?
> 
> It could be that this just represents someone's temp table deletion
> committing while VACUUM is partway through trying to build a relcache
> entry to open the relation.  If so, it is only another manifestation
> of the should-lock-before-relation-open problem.

Yes, but not easily (very timing dependent), takes a lot of worker
processes to throw it. So it's likely a part of the locking issue.

--  Rod Taylor



Re: 7.2.3 vacuum bug

From
Bruce Momjian
Date:
Is this a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Rod Taylor <rbt@rbt.ca> writes:
> > ERROR:  RelationClearRelation: relation 11584078 deleted while still in
> > use 
> 
> > I've been unable to come up with a test case that will cause the
> > problem, seems to be timing related.  The queries that are currently
> > running when these errors occur do a lot or work with temp tables that
> > are frequently truncated.
> 
> Hm.  vacuum.c tries to avoid this class of problem:
> 
>     /*
>      * Race condition -- if the pg_class tuple has gone away since the
>      * last time we saw it, we don't need to vacuum it.
>      */
>     if (!SearchSysCacheExists(RELOID,
>                               ObjectIdGetDatum(relid),
>                               0, 0, 0))
>     {
>         CommitTransactionCommand(true);
>         return true;            /* okay 'cause no data there */
>     }
> 
>     ...
> 
>     onerel = relation_open(relid, lmode);
> 
> but on reflection it's clear that this doesn't really prevent a race
> condition.  If the table is already exclusive-locked by a DROP TABLE
> that hasn't committed yet (eg, the implicit DROP that happens when temp
> tables are cleared out at backend exit), then the syscache lookup will
> go fine, but the relation_open() routine blocks waiting for lock and
> eventually fails.
> 
> What would probably work better is to first lock the relation OID,
> then see if we can open the relation or not.
> 
> Thinking further, it's really kinda bogus that LockRelation() works on
> an already-opened Relation; if possible we should acquire the lock
> before attempting to create a relcache entry.  (We only need to know the
> OID and the relisshared status before we can make a locktag, so it'd be
> possible to acquire the lock using only the contents of the pg_class row.)
> Not sure how much code restructuring might be involved to make this
> happen, but it'd be worth thinking about for 7.4.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: 7.2.3 vacuum bug

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Is this a TODO?

Yes.  Maybe

* Acquire lock on a relation before building a relcache entry for it

I'm not quite sure yet how this should interact with the case where
you already have a relcache entry, but certainly the existing behavior
of "build the whole entry and then acquire lock" is not good.
        regards, tom lane


Re: 7.2.3 vacuum bug

From
Bruce Momjian
Date:
Added to TODO.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Is this a TODO?
> 
> Yes.  Maybe
> 
> * Acquire lock on a relation before building a relcache entry for it
> 
> I'm not quite sure yet how this should interact with the case where
> you already have a relcache entry, but certainly the existing behavior
> of "build the whole entry and then acquire lock" is not good.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073