Re: Concurrent VACUUM: first results - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Concurrent VACUUM: first results
Date
Msg-id 1964.943758264@sss.pgh.pa.us
Whole thread Raw
In response to Concurrent VACUUM: first results  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses RE: [HACKERS] Re: Concurrent VACUUM: first results  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-hackers
I have committed the code change to remove pg_vlock locking from VACUUM.
It turns out the problems I was seeing initially were all due to minor
bugs in the lock manager and vacuum itself.

> 1. You can run concurrent "VACUUM" this way, but concurrent "VACUUM
> ANALYZE" blows up.  The problem seems to be that "VACUUM ANALYZE"'s
> first move is to delete all available rows in pg_statistic.

The real problem was that VACUUM ANALYZE tried to delete those rows
*while it was outside of any transaction*.  If there was a concurrent
VACUUM inserting tuples into pg_statistic, the new VACUUM would end up
calling XactLockTableWait() with an invalid XID, which caused a failure
inside lock.c --- and the failure path neglected to release the spinlock
on the lock table.  This was compounded by lmgr.c not bothering to check
the return code from LockAcquire().  So the lock apparently succeeded,
and then all the backends would die with "stuck spinlock" next time they
tried to do any lockmanager operations.

I have fixed the simpler aspects of the problem by adding missing
SpinRelease() calls to lock.c, making lmgr.c test for failure, and
altering VACUUM to not do the bogus row deletion.  But I suspect that
there is more to this that I don't understand.  Why does calling
XactLockTableWait() with an already-committed XID cause the following
code in lock.c to trigger?  Is this evidence of a logic bug in lock.c,
or at least of inadequate checks for bogus input?
       /*        * Check the xid entry status, in case something in the ipc        * communication doesn't work
correctly.       */       if (!((result->nHolding > 0) && (result->holders[lockmode] > 0)))       {
XID_PRINT_AUX("LockAcquire:INCONSISTENT ", result);           LOCK_PRINT_AUX("LockAcquire: INCONSISTENT ", lock,
lockmode);          /* Should we retry ? */           SpinRelease(masterLock);   <<<<<<<<<<<< just added by me
return FALSE;       }
 

> 3. I tried running VACUUMs in parallel with the regress tests, and saw
> a lot of messages like
> NOTICE:  Rel tenk1: TID 1/31: InsertTransactionInProgress 29737 - can't shrink relation

Hiroshi pointed out that this was probably due to copy.c releasing the
lock prematurely on the table that is the destination of a COPY.  With
that fixed, I get many fewer of these messages, and they're all for
system relations --- which is to be expected.  Since we don't hold locks
for system relations until xact commit, it's possible for VACUUM to see
uncommitted tuples when it is vacuuming a system relation.  So I think
an occasional message like the above is OK as long as it mentions a
system relation.

I have been running multiple concurrent vacuums in parallel with the
regress tests, and things seem to mostly work.  Quite a few regress
tests erratically "fail" under this load because they emit results in
different orders than the expected output shows --- not too surprising
if a VACUUM has come by and reordered a table.

I am still seeing occasional glitches; for example, one vacuum failed
with

NOTICE:  FlushRelationBuffers(onek, 24): block 40 is referenced (private 0, global 1)
FATAL 1:  VACUUM (vc_rpfheap): FlushRelationBuffers returned -2
pqReadData() -- backend closed the channel unexpectedly.

I believe that these errors are not specifically caused by concurrent
vacuums, but are symptoms of locking-related bugs that we still have
to flush out and fix (cf. discussions on pg_hackers around 9/19/99).
So I've gone ahead and committed the change to allow concurrent
vacuums.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Lamar Owen
Date:
Subject: Bugfix RPM release available for immediate download
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] Re: [PATCHES] A bag of psql goodies