Shared row locking - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Shared row locking |
Date | |
Msg-id | 20041217005414.GA6180@surnet.cl Whole thread Raw |
Responses |
Re: Shared row locking
Re: Shared row locking Re: Shared row locking Re: Shared row locking |
List | pgsql-hackers |
Hi, I've been thinking on how to do shared row locking. There are some very preliminar ideas on this issue. Please comment; particularly if any part of it sounds unworkable or too incomplete. There are several problems to be solved here: the grammar, the internal SelectStmt representation, how to store and share the info between backends, how to clean up at transaction end, and how to clean up at backend crash. The Grammar =========== The SQL spec does not say anything on this respect (that I can find). It only talks of "FOR UPDATE" and "FOR READ ONLY". However, because the FK code uses SPI to do the locking, we definitely have to expose the funcionality through SQL. So I think we need a new clause, which I propose to be "FOR SHARE". The Parser and SelectStmt ========================= The parser uses for_update_clause and opt_for_update_clause nonterminals. I assume it's best to change them to (new) locking_clause, which can in turn be for_update_clause or (new) for_share_clause. SelectStmt currently has a forUpdate field (a List to the to-be-locked tables, or an empty list meaning all of them). We could simply add another list, say forShare, or use a common list and a flag saying that it's one or the other. I prefer adding a new list. (Same with the Query node.) How to Store the Info ===================== This is the really interesting part. I have two ideas, one mine (btree) and other Bruce's (bitmap). Using a B-tree -------------- When a backend wants to lock a tuple, it set a bit in its infomask. Then it inserts to a btree in a special tablespace, using RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId as value; actually, an array with a single element containing those two values. When a backend wants to lock a tuple that is already locked, it goes to the btree and inserts itself into the array. To do this, it inserts a new index item (the enlarged array) and delete the previous one. No other backend may want to insert simultaneously (thus causing an ugly race condition), because we hold an exclusive lock on the tuple's heap page's buffer. At transaction end, nothing special happens (tuples are not unlocked explicitly). When someone wants to know if the tuple is locked (to mark it FOR UPDATE, or to delete it), it checks the infomask. If it says it's locked, it goes to check the btree. If the array contains only BackendId-TransactionId pairs that are no longer running, then the tuple is not locked and can be deleted/marked (and the btree can be cleaned up). Else, it will have to wait, using XactLockTableWait, for the first transaction in the array that is still running. We can be sure that no one will try to share-lock the tuple while we check the btree because we hold an exclusive lock on the tuple's heap page's buffer. Note that to check whether a transaction is running we need to lock SInvalLock. To minimize the time we hold it, we save the BackendId so we don't have to scan the whole shmInvalBuffer->procState array, only the item that we need to look at. Another possibility would be to use stock TransactionIdIsInProgress and save the extra 4 bytes of storage. At server restart, the btree is created empty (or just deleted). There is one btree per database. Using a Bitmap -------------- First we create a counter called shared lock row counter. Then we create a file like pg_clog, and each counter slot has a bit for every backend. When we want to shared lock a row we increment the counter and put that counter value on the row, and set our backend bit in the new file. We also store that counter value in our backend local memory. On commit we go through that local memory list and reset all our bits for those counters. When a row has all zeros, it can be recycled like we do with pg_clog. Problems and random comments ============================ There is possibility of starvation, if somebody wants to lock exclusively a tuple and shared lockers are coming all the time. Not sure how to solve this. The wakeup mechanism is not discussed ... is there a special need for something beyond what we can do with XactLockTable{Insert,Wait} ? Thanks to tablespaces, it's very easy to create special Relations that can be dealt with by standard buffer and storage manager, etc. The btree idea: - does not need crash recovery. Maybe we could use a stripped down version of nbtree. This could cause a maintanibilitynightmare. - can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page). I doubt thisis a real problem. - could have problems (excessive storage requirements) in the long run because of empty or almost-empty pages. The bitmap idea: - seems to have lower overhead - can use the same lazy cleanup mechanism exposed for the btree idea (in which case we don't need the list in local memory). - What can happen in presence of large max_connections settings? Is this a real problem? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) Oh, oh, las chicas galacianas, lo harán por las perlas, ¡Y las de Arrakis por el agua! Pero si buscas damas Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)
pgsql-hackers by date: