RE: [HACKERS] Re: Concurrent VACUUM: first results - Mailing list pgsql-hackers

From Hiroshi Inoue
Subject RE: [HACKERS] Re: Concurrent VACUUM: first results
Date
Msg-id 001601bf3a01$47d0ae60$2801007e@cadzone.tpf.co.jp
Whole thread Raw
In response to Re: Concurrent VACUUM: first results  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Re: Concurrent VACUUM: first results  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
> 
> I have committed the code change to remove pg_vlock locking from VACUUM.
> It turns out the problems I was seeing initially were all due to minor
> bugs in the lock manager and vacuum itself.
> 
> > 1. You can run concurrent "VACUUM" this way, but concurrent "VACUUM
> > ANALYZE" blows up.  The problem seems to be that "VACUUM ANALYZE"'s
> > first move is to delete all available rows in pg_statistic.
> 
> The real problem was that VACUUM ANALYZE tried to delete those rows
> *while it was outside of any transaction*.  If there was a concurrent
> VACUUM inserting tuples into pg_statistic, the new VACUUM would end up
> calling XactLockTableWait() with an invalid XID, which caused a failure

Hmm,what I could have seen here was always LockRelation(..,RowExclu
siveLock).  But the cause may be same.
We couldn't get xids of not running *transaction*s because its proc->xid
is set to 0(InvalidTransactionId). So blocking transaction couldn' find an
xidLookupEnt in xidTable corresponding to the not running *transaction*
when it tries to LockResolveConflicts() in LockReleaseAll() and couldn't
GrantLock() to XidLookupEnt corresponding to the not running *transac
tion*.  After all LockAcquire() from not running *transaction* always fails
once it is blocked.

> I have fixed the simpler aspects of the problem by adding missing
> SpinRelease() calls to lock.c, making lmgr.c test for failure, and
> altering VACUUM to not do the bogus row deletion.  But I suspect that
> there is more to this that I don't understand.  Why does calling
> XactLockTableWait() with an already-committed XID cause the following

It's seems strange.  Isn't it waiting for a being deleted tuple by vc_upd
stats() in vc_vacone() ?

> code in lock.c to trigger?  Is this evidence of a logic bug in lock.c,
> or at least of inadequate checks for bogus input?
> 
>         /*
>          * Check the xid entry status, in case something in the ipc
>          * communication doesn't work correctly.
>          */
>         if (!((result->nHolding > 0) && (result->holders[lockmode] > 0)))
>         {
>             XID_PRINT_AUX("LockAcquire: INCONSISTENT ", result);
>             LOCK_PRINT_AUX("LockAcquire: INCONSISTENT ", lock, lockmode);
>             /* Should we retry ? */
>             SpinRelease(masterLock);   <<<<<<<<<<<< just added by me
>             return FALSE;
>         }
>

This is the third time I came here and it was always caused by
other bugs. 

Regards,

Hiroshi Inoue
Inoue@tpf.co.jp



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] How to get OID from INSERT in PL/PGSQL?
Next
From: Vince Vielhaber
Date:
Subject: Re: BOUNCE pgsql-ports@postgreSQL.org: Non-member submission from [Joe Brenner ] (fwd)