Thread: Notes on lock table spilling
Hackers, I'm seeing what can I do about spilling the lock table to disk I welcome comments on the ideas outlined here. If anyone sees a showstopper please let me know. Notes on Spilling the Lock Table to Disk ---------------------------------------- The problem being solved here is the inherent limitation of the lock table to grow in size beyond whatever fits into shared memory. This has not being a problem until now, but will be as soon as we implement locking tuples using it rather than marking the tuples on disk. At the same time solving this problem will remove the max_locks_per_transaction limit. On this new design there are two hash tables in shared memory. The first one is the heavy, complete table holding all information about locks (the table we have now). In this table, the key is LOCKTAG (4 bytes) and the value is LOCK (132 bytes total on my system). The second table is a "placeholder table", with LOCKTAG as key and a pointer to where the entry can be found on disk as value (8 bytes total on my system). This place on disk would be based on the slru.c code that is already used in pg_clog and pg_subtrans. A would-be locker which needs to check whether the lock has been taken, first searches in the main hash table. If the lock is found, the second table needs not be scanned. On the other hand, if the lock is not found, it checks a boolean that indicates whether there is something on the second table, and searches that if needed. If it finds the lock there, it must be extracted, deleted from the secondary table and put into the main table. One problem is that the current shmem allocator is unable to return memory to the shared memory pool. So if we fill the shared memory with the primary table, there won't be space for the secondary table when we need it. A way around this is preallocating space for both tables (or maybe only the second). Another would be creating a freelist for shmem, but it hasn't been worth the trouble all these years ... Memory Requirements ------------------- The current default settings allow for 6400 locks (64 locks per transaction, 100 max connections). This is about 850 kB on my system, or around 103 BLCKSZ pages. With the new design we would use 8 BLCKSZ pages for the SLRU subsystem, 64 kB for the secondary table (which allows for ~ 8192 placeholders) and 640 kB for the main table (which allows for ~ 5000 locks). So it would use the same amount of memory as now, and it could hold around 13000 locks. Or if we could use 128 kB for the secondary table and 576 kB for the main table, and it will hold 20000 locks. This is without increasing the current memory requirements; on a normal system we could use three or four times that amount without much problem. Maybe it can be made configurable. SLRU Handling ------------- On handling the SLRU system, we can keep a bitmap at the start of each page indicating which entries are used. Thus we don't have to scan the whole page to look for an empty place to put a new entry. We can keep track of pages with free space with a single page number in memory and another page number at the start of each page (which records what the page number in memory was when an item was deleted from this page). So the page number in memory always has space, and when it fills, its pointer goes to the page that was previously being filled. We still need a way to compact this, so the log can be truncated. It's not clear to me how to do that however, because we have to move the entry on disk and at the same time change the pointer in the secondary hash table. Strategy -------- One further question is what locks to move out of the main table. It seems the easiest way is using LRU. I think we need to use two LRU queues, one for tuple locks and other for the rest (and moving tuple locks first). Thus heavy deletes don't block the rest of the operations. Further Problems ---------------- We have a problem as soon as somebody tries to delete a lot of rows from a big table. We cannot possibly extend the memory requirements forever, so we need to spill to disk without having an in-shared-memory index. This is bad, because performance will suck. One idea would be to store the remaining tuple locks on a file on disk, one file per relation. This way, a transaction doing a heavy DELETE does not need to make other concurrent transactions come to a halt (but it will cause trouble to other transactions working on the same table). I'm open to better ideas on this. The locking strategy for SLRU and the secondary hash table handling should be explained in more detail. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) FOO MANE PADME HUM
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > We have a problem as soon as somebody tries to delete a lot of rows from > a big table. We cannot possibly extend the memory requirements forever, > so we need to spill to disk without having an in-shared-memory index. Yes. I'm not sure that I see the point of the in-memory index at all... there is some intermediate regime where it would improve performance, but it surely does not solve the basic problem that shared memory is finite. Maybe something involving lossy storage would work? Compare recent discussions about lossy bitmaps generated from index scans. regards, tom lane
On Thu, Mar 31, 2005 at 12:19:08AM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > We have a problem as soon as somebody tries to delete a lot of rows from > > a big table. We cannot possibly extend the memory requirements forever, > > so we need to spill to disk without having an in-shared-memory index. > > Yes. I'm not sure that I see the point of the in-memory index at all... > there is some intermediate regime where it would improve performance, > but it surely does not solve the basic problem that shared memory is > finite. My idea was that it would help eliminate I/O. But probably you are right that it may be the wrong idea; probably it's better to have an on-disk index for the on-disk storage of LOCK (we don't want sequential scanning of an on-disk lock array, do we?). If it's cached in memory, then no I/O would be needed until we started recording a lot of locks, thus achieving the same effect with simpler code and better degradation. I'm thinking in sketching some sort of simple btree on top of slru pages. Nothing concrete yet. > Maybe something involving lossy storage would work? Compare recent > discussions about lossy bitmaps generated from index scans. Hmm. Some problems come to mind: - How to decide when to use lossy storage. Perhaps if the executor (or rather, the planner) thinks it is about to acquirelots of tuple locks, then it could hint the lock manager. - How to unlock? (or: how to know when the lock is released for sure?) I think this can be solved by counting lockers/unlockers. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)
On Wed, 2005-03-30 at 18:09 -0400, Alvaro Herrera wrote: > I'm seeing what can I do about spilling the lock table to disk > I welcome comments on the ideas outlined here. If anyone sees a > showstopper please let me know. Perhaps a little delayed, but yes, I have major reservations about the whole concept of spilling the lock table to disk. If you implement this, I would very much like a switch to be able to turn it off, somehow. All else you've done about locking sounds great and I don't want my comments here to detract from my general appreciation of your work and the project as a whole. IIRC there is not another major system that spills locks to disk and there's a big reason: performance is very poor. Other systems accept some limitations in order to avoid that. Oracle takes locks and holds them within the block itself, which then need to have block cleanup performed on them next time they're read. DB2 has a lock table which doesn't grow in size or spill to disk, ever. [It does do lock escalation, which isn't too great either - but this will only be granted if no other users conflict, so its not too bad a solution.] DB2 even goes to great lengths to avoid this by offering additional locking modes of Cursor Stability (CS) - which only locks the rows currently being viewed or on which a cursor is currently placed. DB2 would call locking everything Repeatable Read mode (RR) which performs so badly in most situations people don't use it - and thats even without spilling to disk. Oracle limits the number of blocks that will have copies of them held in the Undo tablespace, so there is an effective limit on number of locks that can be taken there also. I would be more inclined to implement a rolling window lock similar to CS, so you never run out of lock table space. That can give you issues, sure, but not performance ones. That idea may have flaws also: I do not wish at this stage to back an alternative, just to say that the spilling to disk idea sounds like it has big problems, IMHO. If you did go down the route of spilling to disk, I'd would like to make very sure that only the person that asked for hundreds of locks would be effected by all those locks. i.e. only their locks were spilled to disk, so the lock cache is effectively "scan resistant". Other users with modest locking requests would be uneffected. Not sure how you'd do that with the SLRU code as it is. There are no easy answers to this conundrum, but I'm not sure spilling to disk is the answer. Best Regards, Simon Riggs
> On Wed, 2005-03-30 at 18:09 -0400, Alvaro Herrera wrote: > > I'm seeing what can I do about spilling the lock table to disk > > I welcome comments on the ideas outlined here. If anyone sees a > > showstopper please let me know. > > Perhaps a little delayed, but yes, I have major reservations about the > whole concept of spilling the lock table to disk. If you implement this, > I would very much like a switch to be able to turn it off, somehow. me too. I very much like the current behavior, so that a transaction is aborted when lock table space is exhausted. This is pretty flexible because you can always manipulate the lock table size. However, Alvaro's stuff would still be nice when doing massive database changes that are not performance sensitive. By the way Alvaro, are you still planning to reorganize the lock union? Merlin
Simon Riggs <simon@2ndquadrant.com> writes: > DB2 even goes to great lengths to avoid this by offering additional > locking modes of Cursor Stability (CS) - which only locks the rows > currently being viewed or on which a cursor is currently placed. DB2 > would call locking everything Repeatable Read mode (RR) which performs > so badly in most situations people don't use it - and thats even without > spilling to disk. Oracle limits the number of blocks that will have > copies of them held in the Undo tablespace, so there is an effective > limit on number of locks that can be taken there also. I think you're mixing up different things. DB2's Cursor Stability vs Repeatable Read modes are more akin to READ COMMITTED vs SERIALIZABLE. Postgres doesn't need to lock records at all for simply ensuring consistent snapshots. The locks he's working on are the ones used to prevent updates to parent records while an in-progress transaction is inserting or updating a child record in a table with a foreign key dependence. That said, it would be interesting to have an option to set on foreign keys to indicate that no locks are necessary. Perhaps in the form of an option on the parent table that would prevent any primary keys from ever being deleted. -- greg
On Mon, Apr 04, 2005 at 07:08:11PM +0100, Simon Riggs wrote: > IIRC there is not another major system that spills locks to disk and > there's a big reason: performance is very poor. Other systems accept > some limitations in order to avoid that. Oracle takes locks and holds > them within the block itself, which then need to have block cleanup > performed on them next time they're read. DB2 has a lock table which > doesn't grow in size or spill to disk, ever. We can have the in-memory hash table grow to be "enough for most applications"; say, have it store 10000 locks without resorting to spilling. As a secondary storage there would be the in-memory buffers for the spill area; comfortably another 5000 locks. Because there is no need to have the lock table persist across system crashes (except in case we implement 2PC), there's no need to fsync() or XLog this information. So you would not touch disk until you have acquired 15000 locks or so. This doesn't really address your concern, because for spilling logic to work we may need to evict any transaction's lock, not just our own. Thus we can't guarantee that only the heavy locker's locks are spilled (spilt?) Now, I talked to Bruce via IM the other day and he doesn't like the idea of using the regular lock manager to lock tuples. It's too heavy, he says. So he proposed using "phantom Xids" instead. (Some of you may remember the phantom CommandIds back in the subtransactions day; these are quite different, so they have none of the problems the others had.) This idea is explained below. In that chat, we also commented on a third possible solution: using a lock manager process, instead of using shared memory for locks. I'm not sure if this is really workable; there needs to be some means of communication beyond simple signals for this to be effective, I think. Bruce didn't seems very fond of this idea; at least he seemed way more attracted to the phantoms. So, we now have four ideas for solving the shared-row locking problem: A. Using the lock manager to lock tuples. This requires some sort of spill-to-disk logic. B. Not using the lock manager to lock tuples. This can be done by using Bruce's Phantom Xids idea. C. Using a hybrid approach. This can be done using the unused space in heap pages, as proposed by Paul Tillotson, and fallingback to the lock manager. Thus we still need some spilling logic. I'm still not clear how would this work giventhe current page layout and XLog requirements. D. Using a lock manager process. With a separate process, we don't need any of this, because it can use the amount of memoryit sees fit. We may have some communicational overhead which may make the idea unfeasible. But of course, we wouldretain a shared memory area for "caching" (so we wouldn't need to touch the process until we are holding lots of locks.)Does anyone thinks this is an idea worth pursuing? Or rather, does anyone see huge showstoppers here? Using Phantom Xids ================== The idea here is to use an approach similar to what we use now: mark the tuples with an Xid when it is locked. A phantom Xid is a sort-of Xid, with multiple real Xids associated to it. So we mark the tuple with the regular Xid the first time the share lock is acquired; if a second transaction wants to lock the tuple, it creates a new phantom Xid which "contains" the original Xid in the tuple and its own Xid, insert it into the phantom Xid table, and mark the tuple with that as Xmax. Thus, the operations we need to implement this are: 1. Create new phantom Xid with two real Xids 2. Find the phantom Xid which has some combination of real Xids 3. Add a real Xid to a phantom Xid 4. Cleanup the whole thing Bruce proposes using two hash tables for this: one with the phantom Xid as the key, and other with XOR(real xids) as key. Thus we could implement 1, 2 and 3 easily. Cleanup would happen when somebody else visits the phantom Xid and finds it populated with non-running transactions (it can clean the Xmax marking as well, if the phantom Xid ends up empty), and by VACUUM so that we don't have problems with Xid wraparound. Comments? -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "La vida es para el que se aventura"
On Mon, Apr 04, 2005 at 02:50:30PM -0400, Merlin Moncure wrote: > > Perhaps a little delayed, but yes, I have major reservations about > > the whole concept of spilling the lock table to disk. If you > > implement this, I would very much like a switch to be able to turn > > it off, somehow. > me too. I very much like the current behavior, so that a transaction is > aborted when lock table space is exhausted. This is pretty flexible > because you can always manipulate the lock table size. Hmm. I'll keep that in consideration (if we come to implement lock table spilling.) > By the way Alvaro, are you still planning to reorganize the lock union? Not right now. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) Este mail se entrega garantizadamente 100% libre de sarcasmo.
On Mon, 2005-04-04 at 16:27 -0400, Alvaro Herrera wrote: > On Mon, Apr 04, 2005 at 07:08:11PM +0100, Simon Riggs wrote: > > > IIRC there is not another major system that spills locks to disk and > > there's a big reason: performance is very poor. Other systems accept > > some limitations in order to avoid that. Oracle takes locks and holds > > them within the block itself, which then need to have block cleanup > > performed on them next time they're read. DB2 has a lock table which > > doesn't grow in size or spill to disk, ever. > > We can have the in-memory hash table grow to be "enough for most > applications"; say, have it store 10000 locks without resorting to > spilling. As a secondary storage there would be the in-memory buffers > for the spill area; comfortably another 5000 locks. Because there is no > need to have the lock table persist across system crashes (except in > case we implement 2PC), there's no need to fsync() or XLog this > information. So you would not touch disk until you have acquired 15000 > locks or so. This doesn't really address your concern, because for > spilling logic to work we may need to evict any transaction's lock, not > just our own. Thus we can't guarantee that only the heavy locker's > locks are spilled (spilt?) Well, thanks but still not happy. I'd like to find a way to penalise only the heavy locker (and anybody unlucky enough to request rows that have been locked by them). One idea: if we had an in-memory hash, then an on-disk b-tree we'd be able to limit the number of locks any backend could have in memory but allow locks to spill to disk for them. > So, we now have four ideas for solving the shared-row locking > problem: Five > > A. Using the lock manager to lock tuples. > This requires some sort of spill-to-disk logic. I'm worried about extra LWlock contention which we really don't need. > B. Not using the lock manager to lock tuples. > This can be done by using Bruce's Phantom Xids idea. > > C. Using a hybrid approach. > This can be done using the unused space in heap pages, as proposed by > Paul Tillotson, and falling back to the lock manager. Thus we still > need some spilling logic. I'm still not clear how would this work > given the current page layout and XLog requirements. Well, that might work. Each lock is associated with a transaction, so we can test the lock applicability the same way we read row visibility. Setting a lock would not alter the dirty flag on the block. If a block does get written, we can "vacuum" the old locks next time the lock table is used. If the on-block lock table is full, we know we need to check the on-disk lock structure. > D. Using a lock manager process. With a separate process, we don't need > any of this, because it can use the amount of memory it sees fit. We > may have some communicational overhead which may make the idea > unfeasible. But of course, we would retain a shared memory area for > "caching" (so we wouldn't need to touch the process until we are holding > lots of locks.) Does anyone thinks this is an idea worth pursuing? Or > rather, does anyone see huge showstoppers here? (Another process: hmmm, getting crowded isn't it?) E. Implement a lock mode that doesn't lock all the rows that have been touched, just the current rolling window: less locks, less need to spill to disk, so less need to find really good solution. Perhaps its time to revisit what the benefits are of doing *any* of the above are again. Maybe there's a different way to do just 80% of those requirements with much less pain all round. Best Regards, Simon Riggs
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > Using Phantom Xids > ================== > The idea here is to use an approach similar to what we use now: mark the > tuples with an Xid when it is locked. A phantom Xid is a sort-of Xid, > with multiple real Xids associated to it. So we mark the tuple with the > regular Xid the first time the share lock is acquired; if a second > transaction wants to lock the tuple, it creates a new phantom Xid which > "contains" the original Xid in the tuple and its own Xid, insert it into > the phantom Xid table, and mark the tuple with that as Xmax. That sounds like a fancy way to describe "make a linked list of lockers". -- greg
On Mon, Apr 04, 2005 at 11:32:47PM -0400, Greg Stark wrote: > > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > > Using Phantom Xids > > ================== > > The idea here is to use an approach similar to what we use now: mark the > > tuples with an Xid when it is locked. A phantom Xid is a sort-of Xid, > > with multiple real Xids associated to it. So we mark the tuple with the > > regular Xid the first time the share lock is acquired; if a second > > transaction wants to lock the tuple, it creates a new phantom Xid which > > "contains" the original Xid in the tuple and its own Xid, insert it into > > the phantom Xid table, and mark the tuple with that as Xmax. > > That sounds like a fancy way to describe "make a linked list of lockers". Yeah, that's the idea :-) However I was trying to describe how it would be stored. We have room for only one Xid in the tuple, so in order to store multiple lockers, we invent the concept of an Xid that's tied to multiple transactions. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)