Shared row locking, revisited - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Shared row locking, revisited |
Date | |
Msg-id | 20050407021731.GA25320@dcc.uchile.cl Whole thread Raw |
List | pgsql-hackers |
Hackers, Again I come back to this topic. Now I have a better hold on an idea that is hopefully implementable and doesn't have expensive performance penalties. Forget the idea of using the regular lock manager directly on tuples. It's folly because we can't afford to have that many locks. Instead we will lock on the transactions that are involved in a shared row lock. We will have a cluster-wide counter, call it MultiXactId for the sake of this explanation. When someone wants to share lock an (unlocked) tuple, he creates a new MultiXactId, a bit is set in t_infomask (HEAP_XMAX_FOR_SHARE, say), and the tuple's Xmax is set to the new MultiXactId. We will have two additional SLRU areas; one will contain offsets to the other, so that the second can store variable length arrays. In the first (call it pg_offsets) we will use the MultiXactId as key, and the "last offset" of the second as value. The second SLRU area (call it pg_multixact_members) is accessed using the offset from pg_offsets, and contains the members of the respective MultiXactId. So, returning to the locker: it grabbed the MultiXactId; it records an offset of (previous offset + 1) in pg_offsets, and its own TransactionId in that offset of pg_multixact_members. This was a locker of an unlocked tuple. But if it finds that the tuple is already locked, it gets the MultiXactId from the tuple, goes to pg_offsets and from there to pg_multixact_members; it then knows what transactions are locking this tuple. It can check which of those are still running and discard those that aren't. So at this point it has a list of TransactionIds, to which it adds itself, and stores it as a new MultiXactId following the procedure outlined above. Possible optimization: keep a list of the MultiXactIds the current TransactionId is member of (this can be done in local memory). If at some point it is going to create a new MultiXactId first check this list to see if the same members can be found, and reuse the MultiXactId if so. Thus we save a MultiXactId (useful trick for when somebody wants to lock a whole table, for example). Sleeping is done using XactLockTableWait() on each of the members of the MultiXactId. If it wakes after sleeping on all of them only to find that the MultiXactId has changed, it will have to sleep again. This can cause starvation if more shared lockers come while it is sleeping. Not sure how to solve this (and I think it's an important problem to solve.) One way would be locking the MultiXactId itself so that nobody can change it. Note that we do this only for shared locks. For exclusive lock, using the TransactionId as is done in the current code is OK. At transaction start, the current value of the current MultiXactId variable is saved. For cleanup, we can truncate the pg_offsets table at the MultiXactId previous to the youngest one saved; and pg_multixact_members, at whatever pg_offsets has in the truncating position. Because we can't reuse MultiXactIds at system crash (else we risk taking an Id which is already stored in some tuple), we need to XLog it. Not at the locking operation, because we don't want to log that one (too expensive.) We can log the current value of MultiXactId counter once in a while; say, one each (say) 1000 acquisitions. Following a crash, the value is restored to the last one logged + 1000. (I think this can be a problem if we log one acquisition, then write some tuples, and then crash, without flushing the acquisition logged. Maybe a solution is to force a flush after logging an acquisition.) Comments are welcome. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "El miedo atento y previsor es la madre de la seguridad" (E. Burke)
pgsql-hackers by date: