Re: foreign key locks, 2nd attempt - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: foreign key locks, 2nd attempt
Date
Msg-id 1331060319-sup-2769@alvh.no-ip.org
Whole thread Raw
In response to Re: foreign key locks, 2nd attempt  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: foreign key locks, 2nd attempt  (Simon Riggs <simon@2ndQuadrant.com>)
Re: foreign key locks, 2nd attempt  (Robert Haas <robertmhaas@gmail.com>)
Re: foreign key locks, 2nd attempt  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
Excerpts from Simon Riggs's message of lun mar 05 16:34:10 -0300 2012:

> It does however, illustrate my next review comment which is that the
> comments and README items are sorely lacking here. It's quite hard to
> see how it works, let along comment on major design decisions. It
> would help myself and others immensely if we could improve that.

Here's a first attempt at a README illustrating this.  I intend this to
be placed in src/backend/access/heap/README.tuplock; the first three
paragraphs are stolen from the comment in heap_lock_tuple, so I'd remove
those from there, directing people to this new file instead.  Is there
something that you think should be covered more extensively (or at all)
here?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Locking tuples
--------------

Because the shared-memory lock table is of finite size, but users could
reasonably want to lock large numbers of tuples, we do not rely on the
standard lock manager to store tuple-level locks over the long term.  Instead,
a tuple is marked as locked by setting the current transaction's XID as its
XMAX, and setting additional infomask bits to distinguish this usage from the
more normal case of having deleted the tuple.  When multiple transactions
concurrently lock a tuple, a MultiXact is used; see below.

When it is necessary to wait for a tuple-level lock to be released, the basic
delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
the tuple's XMAX.  However, that mechanism will release all waiters
concurrently, so there would be a race condition as to which waiter gets the
tuple, potentially leading to indefinite starvation of some waiters.  The
possibility of share-locking makes the problem much worse --- a steady stream
of share-lockers can easily block an exclusive locker forever.  To provide
more reliable semantics about who gets a tuple-level lock first, we use the
standard lock manager.  The protocol for waiting for a tuple-level lock is
really
    LockTuple()    XactLockTableWait()    mark tuple as locked by me    UnlockTuple()

When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple().  However, at most one tuple-level lock will
be held or awaited per backend at any time, so we don't risk overflow
of the lock table.  Note that incoming share-lockers are required to
do LockTuple as well, if there is any conflict, to ensure that they don't
starve out waiting exclusive-lockers.  However, if there is not any active
conflict for a tuple, we don't incur any extra overhead.

We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
super-exclusive locking (used to delete tuples and more generally to update
tuples modifying the values of the columns that make up the key of the tuple);
SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
UPDATE.  This last mode implements a mode just strong enough to implement RI
checks, i.e. it ensures that tuples do not go away from under a check, without
blocking when some other transaction that want to update the tuple without
changing its key.

The conflict table is:
               KEY UPDATE        UPDATE        SHARE        KEY SHARE
KEY UPDATE       conflict        conflict      conflict      conflict
UPDATE           conflict        conflict      conflict
SHARE            conflict        conflict
KEY SHARE        conflict

When there is a single locker in a tuple, we can just store the locking info
in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
setting hint bits specifying the locking strength.  There is one exception
here: since hint bit space is limited, we do not provide a separate hint bit
for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
presumably more commonly used due to being the standards-mandated locking
mechanism, or heavily used by the RI code, so we want to provide fast paths
for those.)

MultiXacts
----------

A tuple header provides very limited space for storing information about tuple
locking and updates: there is room only for a single Xid and a small number of
hint bits.  Whenever we need to store more than one lock, we replace the first
locker's Xid with a new MultiXactId.  Each MultiXact provides extended locking
data; it comprises an array of Xids plus some flags bits for each one.  The
flags are currently used to store the locking strength of each member
transaction.  (The flags also distinguish a pure locker from an actual
updater.)

In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
locked in shared mode by multiple transactions.  This is no longer the case; a
MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
in a transaction do not conflict with other tuple locks in the same
transaction, so it's possible to have otherwise conflicting locks in a
MultiXact if they belong to the same transaction).

Note that each lock is attributed to the subtransaction that acquires it.
This means that a subtransaction that aborts is seen as though it releases the
locks it acquired; concurrent transactions can then proceed without having to
wait for the main transaction to finish.  It also means that a subtransaction
can upgrade to a stronger lock level than an earlier transaction had, and if
the subxact aborts, the earlier, weaker lock is kept.

The possibility of having an update within a MultiXact means that they must
persist across crashes and restarts: a future reader of the tuple needs to
figure out whether the update committed or aborted.  So we have a requirement
that pg_multixact needs to retain pages of its data until we're certain that
the MultiXacts in them are no longer of interest.

VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
This works in the same way that pg_clog segments are removed: we have a
pg_class column that stores the earliest multixact that could possibly be
stored in the table; the minimum of all such values is stored in a pg_database
column.  VACUUM computes the minimum across all pg_database values, and
removes pg_multixact segments older than the minimum.

Hint Bits
---------

The following hint bits are applicable:

- HEAP_XMAX_INVALID Any tuple with this hint bit set does not have a valid value stored in XMAX.

- HEAP_XMAX_IS_MULTI This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a regular TransactionId).

- HEAP_XMAX_LOCK_ONLY This bit is set when the XMAX is a locker only; that is, if it's a multixact, it does not contain
anupdate among its members.  It's set when the XMAX is a plain Xid that locked the tuple, as well. 

- HEAP_XMAX_KEYSHR_LOCK
- HEAP_XMAX_EXCL_LOCK These bits indicate the strength of the lock acquired; they are useful when the XMAX is not a
MultiXactId. If it's a multi, the info is to be found in the member flags.  If HEAP_XMAX_IS_MULTI is not set and
HEAP_XMAX_LOCK_ONLYis set, then one of these *must* be set as well. Note there is no hint bit for a SELECT FOR SHARE
lock. Also there is no separate bit for a SELECT FOR KEY UPDATE lock; this is implemented by the
HEAP_UPDATE_KEY_REVOKEDbit. 

- HEAP_UPDATE_KEY_REVOKED This bit lives in t_infomask2.  If set, indicates that a transaction updated this tuple and
changedthe key values, or a transaction deleted the tuple. It's set regardless of whether the XMAX is a TransactionId
ora MultiXactId. 

We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.


pgsql-hackers by date:

Previous
From: Marko Kreen
Date:
Subject: Re: [9.2] Confusion over CacheRegisterSyscacheCallback
Next
From: Simon Riggs
Date:
Subject: Re: foreign key locks, 2nd attempt