Again I come back to this topic. Now I have a better hold on an idea
that is hopefully implementable and doesn't have expensive performance
Forget the idea of using the regular lock manager directly on tuples.
It's folly because we can't afford to have that many locks. Instead we
will lock on the transactions that are involved in a shared row lock.
We will have a cluster-wide counter, call it MultiXactId for the sake of
this explanation. When someone wants to share lock an (unlocked) tuple,
he creates a new MultiXactId, a bit is set in t_infomask
(HEAP_XMAX_FOR_SHARE, say), and the tuple's Xmax is set to the new
We will have two additional SLRU areas; one will contain offsets to the
other, so that the second can store variable length arrays. In the
first (call it pg_offsets) we will use the MultiXactId as key, and the
"last offset" of the second as value. The second SLRU area (call it
pg_multixact_members) is accessed using the offset from pg_offsets, and
contains the members of the respective MultiXactId.
So, returning to the locker: it grabbed the MultiXactId; it records an
offset of (previous offset + 1) in pg_offsets, and its own TransactionId
in that offset of pg_multixact_members.
This was a locker of an unlocked tuple. But if it finds that the tuple
is already locked, it gets the MultiXactId from the tuple, goes to
pg_offsets and from there to pg_multixact_members; it then knows what
transactions are locking this tuple. It can check which of those are
still running and discard those that aren't. So at this point it has a
list of TransactionIds, to which it adds itself, and stores it as a new
MultiXactId following the procedure outlined above.
Possible optimization: keep a list of the MultiXactIds the current
TransactionId is member of (this can be done in local memory). If at
some point it is going to create a new MultiXactId first check this list
to see if the same members can be found, and reuse the MultiXactId if
so. Thus we save a MultiXactId (useful trick for when somebody wants to
lock a whole table, for example).
Sleeping is done using XactLockTableWait() on each of the members of the
MultiXactId. If it wakes after sleeping on all of them only to find
that the MultiXactId has changed, it will have to sleep again. This can
cause starvation if more shared lockers come while it is sleeping. Not
sure how to solve this (and I think it's an important problem to solve.)
One way would be locking the MultiXactId itself so that nobody can
Note that we do this only for shared locks. For exclusive lock, using
the TransactionId as is done in the current code is OK.
At transaction start, the current value of the current MultiXactId
variable is saved. For cleanup, we can truncate the pg_offsets table at
the MultiXactId previous to the youngest one saved; and
pg_multixact_members, at whatever pg_offsets has in the truncating
Because we can't reuse MultiXactIds at system crash (else we risk taking
an Id which is already stored in some tuple), we need to XLog it. Not
at the locking operation, because we don't want to log that one (too
expensive.) We can log the current value of MultiXactId counter once in
a while; say, one each (say) 1000 acquisitions. Following a crash, the
value is restored to the last one logged + 1000. (I think this can be a
problem if we log one acquisition, then write some tuples, and then
crash, without flushing the acquisition logged. Maybe a solution is to
force a flush after logging an acquisition.)
Comments are welcome.
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)