Fixing Simms' vacuum problems - Mailing list pgsql-hackers

From Tom Lane
Subject Fixing Simms' vacuum problems
Date
Msg-id 15133.937072666@sss.pgh.pa.us
Whole thread Raw
Responses Re: Fixing Simms' vacuum problems  (Michael Simms <grim@argh.demon.co.uk>)
Re: [HACKERS] Fixing Simms' vacuum problems  (The Hermit Hacker <scrappy@hub.org>)
Re: [HACKERS] Fixing Simms' vacuum problems  (Tatsuo Ishii <t-ishii@sra.co.jp>)
List pgsql-hackers
Michael Simms was kind enough to give me login privileges on his system
to poke at his problems with vacuum running concurrently with table
create/drop operations.  I am not sure why his setup seems to display
the problem easier than mine does, but it's certainly true that crashes
occur very easily there, whereas it often takes many tries for me.

Anyway, I am now convinced that his symptoms are indeed explained by the
locking and cache-invalidation problems we have been discussing.  I saw
a number of different failures, but they all seemed to trace back to one
of two common themes:

(1) The non-vacuuming backend crashes because of accessing a
system-relation tuple that isn't in the same place anymore: the tuple
is found in the local syscache, but the item location recorded there is
stale because vacuum has moved the tuple, and the non-vacuum process
hasn't noticed the SI update message for it yet.

(2) The vacuuming backend can fail because of trying to vacuum a
relation that's already been deleted.  This can be blamed on the known
bug that DROP TABLE releases its exclusive lock on the target table
before end of transaction.

I expect there are also failures due to the lack-of-lock problems that
Hiroshi recently identified, but I didn't happen to see any of those in
the limited number of cases that I watched with the debugger.

So, it looks like a solution involves two components: first, being more
careful to lock system relations appropriately, and second, being sure
that SI messages are seen soon enough.  I think the read-SI-messages-
at-lock-time code that's already in place for 6.6 will be sufficient for
the second point, if we are religious about acquiring appropriate locks.
(BTW, I think that in most cases an appropriate lock on a system table
will be less strong than AccessExclusiveLock --- Vadim, do you agree?)

Once we have the changes, the next question is do we want to risk
back-patching them into 6.5.2?  I can see several ways that we could
proceed:
1. Back-patch into REL6_5, and postpone 6.5.2 release for a while  for beta-testing.
2. Put out 6.5.2 now (since it already has several other useful fixes),  then back-patch, and release 6.5.3 after a
beta-testinginterval.
 
3. Leave these changes out of 6.5.*, and try to get 6.6 out the door  soon instead.

I am not eager to hurry 6.6 along --- I have a lot of half-done work
in the planner/optimizer that I'd like to finish for 6.6.  Perhaps
choice #2 is the way to go.  Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] serial type
Next
From: Michael Simms
Date:
Subject: Re: Fixing Simms' vacuum problems