FlexLocks - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | FlexLocks |
Date | |
Msg-id | CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com Whole thread Raw |
Responses |
Re: FlexLocks
Re: FlexLocks Re: FlexLocks |
List | pgsql-hackers |
I've been noodling around with various methods of reducing ProcArrayLock contention and (after many false starts) I think I've finally found one that works really well. I apologize in advance if this makes your head explode; I think that the design I have hear is solid, but it represents a significant and invasive overhaul of the LWLock system - I think for the better, but you'll have to be the judge. I'll start with the performance numbers (from that good ol' 32-core Nate Boley system), where I build from commit f1585362856d4da17113ba2e4ba46cf83cba0cf2, with and without the attached patches, and then ran pgbench on logged and unlogged tables with various numbers of clients, with shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. The numbers below are (as usual) the median of three-five minute runs at scale factor 100. The lines starting with "m" and a number are that number of clients on unpatched master, and the lines starting with "f" are that number of clients with this patch set. The really big win here is unlogged tables at 32 clients, where throughput has *doubled* and now scales *better than linearly* as compared with the single-client results. == Unlogged Tables == m01 tps = 679.737639 (including connections establishing) f01 tps = 668.275270 (including connections establishing) m08 tps = 4771.757193 (including connections establishing) f08 tps = 4867.520049 (including connections establishing) m32 tps = 10736.232426 (including connections establishing) f32 tps = 21303.295441 (including connections establishing) m80 tps = 7829.989887 (including connections establishing) f80 tps = 19835.231438 (including connections establishing) == Permanent Tables == m01 tps = 634.424125 (including connections establishing) f01 tps = 633.450405 (including connections establishing) m08 tps = 4544.781551 (including connections establishing) f08 tps = 4556.298219 (including connections establishing) m32 tps = 9902.844302 (including connections establishing) f32 tps = 11028.745881 (including connections establishing) m80 tps = 7467.437442 (including connections establishing) f80 tps = 11909.738232 (including connections establishing) A couple of other interesting things buried in these numbers: 1. Permanent tables don't derive nearly as much benefit as unlogged tables. I believe that this is because, for permanent tables, the major bottleneck is WALInsertLock. Fixing ProcArrayLock squeezes out a healthy 10%, but we'll have to make significant performance on WALInsertLock to get anywhere close to linear scaling. 2. The drop-off between 32 clients and 80 clients is greatly reduced with this patch set; indeed, for permanent tables, tps increased slightly between 32 and 80 clients. I believe that small decrease for unlogged tables is likely due to the fact that by 80 tables, WALInsertLock starts to become a contention point, due to the need to insert the commit records. In terms of the actual patches, it's been bugging me for a while that the LWLock code contains a lot of infrastructure that's not easily reusable by other parts of the system. So the first of the two attached patches, flexlock-v1.patch, separates the LWLock code into an upper layer and a lower layer. The lower layer I called "FlexLocks", and it's designed to allow a variety of locking implementations to be built on top of it and reuse as much of the basic infrastructure as I could figure out how to make reusable without hurting performance too much. LWLocks become the anchor client of the FlexLock system; in essence, most of flexlock.c is code that was removed from lwlock.c. The second patch, procarraylock.c, uses that infrastructure to define a new type of FlexLock specifically for ProcArrayLock. It basically works like a regular LWLock, except that it has a special operation to optimize ProcArrayEndTransaction(). In the uncontended case, instead of acquiring and releasing the lock, it just grabs the lock, observes that there is no contention, clears the critical PGPROC fields (which isn't noticeably slower than updating the state of the lock would be) and releases the spin lock. There's then no need to reacquire the spinlock to "release" the lock; we're done. In the contended case, the backend wishing to end adds itself to a queue of ending transactions. When ProcArrayLock is released, the last person out clears the PGPROC structures for all the waiters and wakes them all up; they don't need to reacquire the lock, because the work they wished to perform while holding it is already done. Thus, in the *worst* case, ending transactions only need to acquire the spinlock protecting ProcArrayLock half as often (once instead of twice), and in the best case (where backends have to keep retrying only to repeatedly fail to get the lock) it's far better than that. Of course, there are ways that this could be implemented without the FlexLock stuff, if people don't like this solution. Myself, I find it quite elegant (though there are certainly arguable points in there where the code could probably be improved), but then again, I wrote it. For what it's worth, I believe that there are other places where the FlexLock infrastructure could be helpful. In this case, the new ProcArrayLock is very specific to what ProcArrayLock actually does, and can't be really reused for anything else. But I've had a thought that we might want to have a type of FlexLock that contains an LSN. The lock holder advances the LSN and can then release everyone who was waiting for a value <= that LSN without them needing to reacquire the lock. This could be useful for things like WALWriteLock, and sync rep. Also, I think there might be interesting applications for buffer locks, perhaps by having a lock type that manages both content locks and pins. Alternatively, if we want to support CRCs, it might be useful to have a third buffer lock mode in between shared and exclusive. SX would conflict with itself and with exclusive but not with shared, and would be required to write out the page or set hint bits but not just to examine tuples; this could be used to ensure that the page doesn't change (thus invalidating the CRC) while the write is in progress. I'm not necessarily saying that any of these particular things are what we want to do, just throwing out the idea that we may want a variety of lock types that are similar to lightweight locks but with subtly different behavior, yet with common infrastructure for error handling and wait queue management. Anyway, this is all up for discussion, argument, etc. - but here are the patches. Comments, idea, thoughts, code review, and/or testing are appreciated. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: