Thread: dynamic shared memory and locks
One of the things that you might want to do with dynamic shared memory is store a lock in it. In fact, my bet is that almost everything that uses dynamic shared memory will want to do precisely that, because, of course, it's dynamic *shared* memory, which means that it is concurrently accessed by multiple processes, which tends to require locking. Typically, what you're going to want are either spinlocks (for very short critical sections) or lwlocks (for longer ones). It doesn't really make sense to talk about storing heavyweight locks in dynamic shared memory, because we're talking about storing locks with the data structures that they protect, and heavyweight locks are used to protect database or shared objects, not shared memory structures. Of course, someone might think of trying to provide a mechanism for the heavyweight lock manager to overflow to dynamic shared memory, but that's a different thing altogether and not what I'm talking about here. Right now, storing spinlocks in dynamic shared memory *almost* works, but there are problems with --disable-spinlocks. In that configuration, we use semaphores to simulate spinlocks. Every time someone calls SpinLockInit(), it's going to allocate a new semaphore which will never be returned to the operating system, so you're pretty quickly going to run out. There are a couple of things we could do about this: 1. Decide we don't care. If you compile with --disable-spinlocks, and then you try to use dynamic shared memory, it's going to leak semaphores until none remain, and then start failing from there until the postmaster is restarted. If you don't like that, provide a working spinlock implementation for your platform. 2. Forbid the use of dynamic shared memory when compiling with --disable-spinlocks. This is a more polite version of #1. It seems likely to me that nearly every piece of code that uses dynamic shared memory will require locking. Instead of letting people allocate dynamic shared memory segments anyway and then having them start failing shortly after postmaster startup, we could just head the problem off at the pass by denying the request for dynamic shared memory in the first place. Dynamic shared memory allocation can always fail (e.g. because we're out of memory) and also has an explicit off switch that will make all requests fail (dynamic_shared_memory_type=none), so any code that uses dynamic shared memory has to be prepared for a failure at that point, whereas a failure in SpinLockInit() might be more surprising. 3. Provide an inverse for SpinLockInit, say SpinLockDestroy, and require all code written for dynamic shared memory to invoke this function on every spinlock before the shared memory segment is destroyed. I initially thought that this could be done using the on_dsm_detach infrastructure, but it turns out that doesn't really work. The on_dsm_detach infrastructure is designed to make sure that you *release* all of your locks when detaching - i.e. those hooks get invoked for each process that detaches. For this, you'd need an on_dsm_final_detach callback that gets called only for the very last detach (and after prohibiting any other processes from attaching). I can certainly engineer all that, but it's a decent amount of extra work for everyone who wants to use dynamic shared memory to write the appropriate callback, and because few people actually use --disable-spinlocks, I think those callbacks will tend to be rather lightly tested and thus a breeding ground for marginal bugs that nobody's terribly excited about fixing. 4. Drop support for --disable-spinlocks. For what it's worth, my vote is currently for #2. I can't think of many interesting to do with dynamic shared memory without having at least spinlocks, so I don't think we'd be losing much. #1 seems needlessly unfriendly, #3 seems like a lot of work for not much, and #4 seems excessive at least as a solution to this particular problem, though there may be other arguments for it. What do others think? I think we're also going to want to be able to create LWLocks in dynamic shared memory: some critical sections won't be short enough or safe enough to be protected by spinlocks. At some level this seems easy: change LWLockAcquire and friends to accept an LWLock * rather than an LWLockId, and similarly change held_lwlocks[] to hold LWLock pointers rather than LWLockIds. One tricky point is that you'd better try not to detach a shared memory segment while you're holding lwlocks inside that segment, but I think just making that a coding rule shouldn't cause any great problem, and conversely you'd better release all lwlocks in the segment before detaching it, but this seems mostly OK: throwing an error will call LWLockReleaseAll before doing the resource manager cleanups that will unmap the dynamic shared memory segment, so that's probably OK too. There may be corner cases I haven't thought about, though. A bigger problem is that I think we want to avoid having a large amount of notational churn. The obvious way to do that is to get rid of the LWLockId array and instead declare each fixed LWLock separately as e.g. LWLock *ProcArrayLock. However, creating a large number of new globals that will need to be initialized in every new EXEC_BACKEND process seems irritating. So maybe a better idea is to do something like this: #define BufFreelistLock (&fixedlwlocks[0]) #define ShmemIndexLock (&fixedlwlocks[1]) ... #define AutoFileLock (&fixedlwlocks[36]) #define NUM_FIXED_LWLOCKS 37 Comments, suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-01-05 12:56:05 -0500, Robert Haas wrote: > Right now, storing spinlocks in dynamic shared memory *almost* works, > but there are problems with --disable-spinlocks. In that > configuration, we use semaphores to simulate spinlocks. Every time > someone calls SpinLockInit(), it's going to allocate a new semaphore > which will never be returned to the operating system, so you're pretty > quickly going to run out. There are a couple of things we could do > about this: > 4. Drop support for --disable-spinlocks. I very strongly vote 4). I think we're going to hit this more and more often and it's a facility that benefits almost nobody. Just about every new platform will be/is on gcc or clang and you can just duplicate the compiler provided generic implementation we have for arm for there. The atomics implementation make this an automatic fallback if there's no compiler specific variant around. > I think we're also going to want to be able to create LWLocks in > dynamic shared memory: some critical sections won't be short enough or > safe enough to be protected by spinlocks. Agreed. > At some level this seems easy: change LWLockAcquire and friends to > accept an LWLock * rather than an LWLockId, and similarly change > held_lwlocks[] to hold LWLock pointers rather than LWLockIds. My primary reason isn't dsm TBH but wanting to embed the buffer lwlocks in the bufferdesc, on the same cacheline as the buffer headers spinlock. All the embedded ones can be allocated without padding, while the relatively low number of non-embedded ones can be padded to the full cacheline size. > A bigger problem is that I think we > want to avoid having a large amount of notational churn. The obvious > way to do that is to get rid of the LWLockId array and instead declare > each fixed LWLock separately as e.g. LWLock *ProcArrayLock. However, > creating a large number of new globals that will need to be > initialized in every new EXEC_BACKEND process seems irritating. So > maybe a better idea is to do something like this: > #define BufFreelistLock (&fixedlwlocks[0]) > #define ShmemIndexLock (&fixedlwlocks[1]) > ... > #define AutoFileLock (&fixedlwlocks[36]) > #define NUM_FIXED_LWLOCKS 37 > > Comments, suggestions? My idea here was to just have two APIs, a legacy one that works like the current one, and a new one that locks the lwlocks passed in by a pointer. After all, LWLockAssign() which you use in extensions currently returns a LWLockId. Seems ugly to turn that into a pointer. But perhaps your idea is better anyway, no matter the hackery of turning LWLockId into a pointer. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > For what it's worth, my vote is currently for #2. I can't think of > many interesting to do with dynamic shared memory without having at > least spinlocks, so I don't think we'd be losing much. #1 seems > needlessly unfriendly, #3 seems like a lot of work for not much, and > #4 seems excessive at least as a solution to this particular problem, > though there may be other arguments for it. What do others think? I agree with this position. There may be some good reason to drop --disable-spinlocks altogether in future, but DSM isn't a sufficient excuse. > I think we're also going to want to be able to create LWLocks in > dynamic shared memory: some critical sections won't be short enough or > safe enough to be protected by spinlocks. At some level this seems > easy: change LWLockAcquire and friends to accept an LWLock * rather > than an LWLockId, and similarly change held_lwlocks[] to hold LWLock > pointers rather than LWLockIds. I seem to recall that there was some good reason for keeping all the LWLocks in an array, back when the facility was first designed. I'm too lazy to research the point right now, but you might want to go back and look at the archives around when lwlock.c was written. > creating a large number of new globals that will need to be > initialized in every new EXEC_BACKEND process seems irritating. This might've been the good reason, but not sure --- I think LWLocks predate our Windows support. In the end, though, that decision was taken before we were concerned about being able to add new LWLocks on the fly, which is what this is really about (whether they're stored in DSM or not is a secondary point). The pressure for that has gotten strong enough that it may be time to change the tradeoff. regards, tom lane
On 2014-01-05 14:06:52 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > For what it's worth, my vote is currently for #2. I can't think of > > many interesting to do with dynamic shared memory without having at > > least spinlocks, so I don't think we'd be losing much. #1 seems > > needlessly unfriendly, #3 seems like a lot of work for not much, and > > #4 seems excessive at least as a solution to this particular problem, > > though there may be other arguments for it. What do others think? > > I agree with this position. There may be some good reason to drop > --disable-spinlocks altogether in future, but DSM isn't a sufficient > excuse. Agreed that DSM isn't sufficient cause. The reasons for removing it for future reasons I see are: * It's not tested at all and it has been partially broken for significants of time. Afair Heikki just noticed the breakageaccidentally when adding enough new spinlocks recently. * It's showed up as problematic in a couple of patches while adding not much value (at least dsm, atomic ops, afair someothers) * It's slow enough that it's not of a practical value. * Implementing simple support for spinlocks on realistic platforms isn't hard these days due to compiler intrinsics. * The platforms that don't have a barrier implementation will rely on spinlocks. And for correctness those spinlocks shouldemploy barriers. That might be more of an argument for getting rid of that fallback tho. > > I think we're also going to want to be able to create LWLocks in > > dynamic shared memory: some critical sections won't be short enough or > > safe enough to be protected by spinlocks. At some level this seems > > easy: change LWLockAcquire and friends to accept an LWLock * rather > > than an LWLockId, and similarly change held_lwlocks[] to hold LWLock > > pointers rather than LWLockIds. > > I seem to recall that there was some good reason for keeping all the > LWLocks in an array, back when the facility was first designed. > I'm too lazy to research the point right now, but you might want to > go back and look at the archives around when lwlock.c was written. Your proposal is at http://www.postgresql.org/message-id/1054.1001520600@sss.pgh.pa.us - afaics there hasn't been much discussion about implementation details at that level. > In the end, though, that decision was taken before we were concerned > about being able to add new LWLocks on the fly, which is what this is > really about (whether they're stored in DSM or not is a secondary point). > The pressure for that has gotten strong enough that it may be time to > change the tradeoff. I personally find the sharing of a cacheline between a buffer headers spinlock and the lwlock much more interesting than DSM... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 5, 2014 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I seem to recall that there was some good reason for keeping all the > LWLocks in an array, back when the facility was first designed. > I'm too lazy to research the point right now, but you might want to > go back and look at the archives around when lwlock.c was written. To some extent it's an orthogonal question. It's true that there isn't much point in using LWLock * rather than LWLockId to refer to LWLocks unless we wish to be able to store them outside the shared memory segment, but the reverse is not true: just because we have the ability to move things outside the main array in the general case doesn't make it a good idea in any particular case. Andres's email seems to indicate that he sees performance advantages in moving buffer locks elsewhere, and I have a sneaking suspicion that we'll find that it's more convenient to move some other things around as well, but that's policy, not mechanism. Very little of the core LWLock machinery actually cares how the locks are stored, so the attached patch to make it use LWLock * rather than LWLockId as a handle is pretty straightforward. The only real problem I see here is that we occasionally *print out* LWLockIds as a way of identifying lwlocks. This is not a great system, but printing out pointers is almost certainly much worse (e.g. because of ASLR). The cases where this is currently an issue are: - You try to release a lwlock you haven't acquired. We elog(ERROR) the ID. - You define LWLOCK_STATS. The resulting reports are print the lock ID. - You define LOCK_DEBUG and set trace_lwlocks=true. We print the lock ID in the trace messages. - You compile with --enable-dtrace. We pass the lock IDs to the dtrace probes. In the attached patch I handled the first case by printing the pointer (which I don't feel too bad about) and the remaining three cases by leaving them broken. I wouldn't shed a tear about ripping out trace_lwlocks, but LWLOCK_STATS is useful and I bet somebody is using --enable-dtrace, so we probably need to fix those cases at least. I suppose one option is to make LWLOCK_STATS and the dtrace probes only look at locks in the main array and just ignore everything else. But that's kind of crappy, especially if we're going to soon move buffer locks out of the main array. Another idea is to include some identifying information in the lwlock. For example, each lwlock could have a char *name in it, and we could print the name. In theory, this could be a big step forward in terms of usability, because it'd spare us all needing to remember that 4 == ProcArrayLock. But it's awkward for buffer locks, of which there might be a great many, and we surely don't want to allocate a dynamically-generated string in shared memory for each one. You could do a bit better by making the identifying information a string plus an integer, because then all the buffer locks could set the string to a static constant like "buffer content lock" and the integer to the buffer number, and similarly for lock manager partition locks and so on. This is appealing, but would increase the size of LWLockPadded from 16 bytes to 32 on 64-bit platforms where slock_t is four bytes or less, which I'm not that excited about even though it would reduce cache line contention in some cases. Yet a third idea is to try to reverse-engineer a name for a given lwlock from the pointer address. If it's an offset into the main array, this is easy enough to do, and even if we ended up with several arrays (like one for bufmgr locks) it wouldn't be too hard to write code to figure out which array contains it and emit the appropriate string. The only real problem that I see with this is that it might cause a performance hit. A performance hit when running with trace_lwlocks or LWLOCK_STATS is not really a problem, but people won't like if --enable-dtrace slow things up. Preferences, other ideas? None of these ideas are a complete solution for LWLOCK_STATS. In the other three cases noted above, we only need an identifier for the lock "instantaneously", so that we can pass it off to the logger or dtrace or whatever. But LWLOCK_STATS wants to hold on to data about the locks that were visited until the end of the session, and it does that using an array that is *indexed* by lwlockid. I guess we could replace that with a hash table. Ugh. Any suggestions? (Incidentally, while developing this patch I found a bug in the current code: lock.c iterates over all PGPROCs from 0 to ProcGlobal->allProcCount and takes the backendLock for each, but if max_prepared_transactions>0 then the PGPROCs for prepared transactions do not have a backendLock and so we take and release BufFreelistLock - i.e. 0 - a number of times equal to max_prepared_transactions. That's not cool.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 01/05/2014 07:56 PM, Robert Haas wrote: > Right now, storing spinlocks in dynamic shared memory *almost* works, > but there are problems with --disable-spinlocks. In that > configuration, we use semaphores to simulate spinlocks. Every time > someone calls SpinLockInit(), it's going to allocate a new semaphore > which will never be returned to the operating system, so you're pretty > quickly going to run out. There are a couple of things we could do > about this: 5. Allocate a fixed number of semaphores, and multiplex them to emulate any number of spinlocks. For example, allocate 100 semaphores, and use the address of the spinlock modulo 100 to calculate which semaphore to use to emulate that spinlock. That assumes that you never hold more than one spinlock at a time, otherwise you can get deadlocks. I think that assumptions holds currently, because acquiring two spinlocks at a time would be bad on performance grounds anyway. - Heikki
On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote: > On 01/05/2014 07:56 PM, Robert Haas wrote: > >Right now, storing spinlocks in dynamic shared memory *almost* works, > >but there are problems with --disable-spinlocks. In that > >configuration, we use semaphores to simulate spinlocks. Every time > >someone calls SpinLockInit(), it's going to allocate a new semaphore > >which will never be returned to the operating system, so you're pretty > >quickly going to run out. There are a couple of things we could do > >about this: > > 5. Allocate a fixed number of semaphores, and multiplex them to emulate any > number of spinlocks. For example, allocate 100 semaphores, and use the > address of the spinlock modulo 100 to calculate which semaphore to use to > emulate that spinlock. > > That assumes that you never hold more than one spinlock at a time, otherwise > you can get deadlocks. I think that assumptions holds currently, because > acquiring two spinlocks at a time would be bad on performance grounds > anyway. I think that's a pretty dangerous assumption - and one which would only break without just about anybody noticing because nobody tests --disable-spinlock builds regularly. We could improve on that by adding cassert checks against nested spinlocks, but that'd be a far too high price for little benefit imo. I am not convinced by the performance argument - there's lots of spinlock that are rarely if ever contended. If you lock two such locks in a consistent nested fashion there won't be a performance problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Robert Haas (robertmhaas@gmail.com) wrote: > Another idea is to include some identifying information in the lwlock. That was my immediate reaction to this issue... > For example, each lwlock could have a char *name in it, and we could > print the name. In theory, this could be a big step forward in terms > of usability, because it'd spare us all needing to remember that 4 == > ProcArrayLock. But it's awkward for buffer locks, of which there > might be a great many, and we surely don't want to allocate a > dynamically-generated string in shared memory for each one. You could > do a bit better by making the identifying information a string plus an > integer, because then all the buffer locks could set the string to a > static constant like "buffer content lock" and the integer to the > buffer number, and similarly for lock manager partition locks and so > on. This is appealing, but would increase the size of LWLockPadded > from 16 bytes to 32 on 64-bit platforms where slock_t is four bytes or > less, which I'm not that excited about even though it would reduce > cache line contention in some cases. I'm not thrilled with this either but at the same time, it may well be worth it. > Yet a third idea is to try to reverse-engineer a name for a given > lwlock from the pointer address. If it's an offset into the main > array, this is easy enough to do, and even if we ended up with several > arrays (like one for bufmgr locks) it wouldn't be too hard to write > code to figure out which array contains it and emit the appropriate > string. The only real problem that I see with this is that it might > cause a performance hit. A performance hit when running with > trace_lwlocks or LWLOCK_STATS is not really a problem, but people > won't like if --enable-dtrace slow things up. This seems like an interesting idea- but this and my other thought (having a defined array of strings) seem like they might fall over in the face of DSMs. > Preferences, other ideas? My preference would generally be "use more memory rather than CPU time"; so I'd vote for adding identifying info rather than having the code hunt through arrays to try and find it. > None of these ideas are a complete solution for LWLOCK_STATS. In the > other three cases noted above, we only need an identifier for the lock > "instantaneously", so that we can pass it off to the logger or dtrace > or whatever. But LWLOCK_STATS wants to hold on to data about the > locks that were visited until the end of the session, and it does that > using an array that is *indexed* by lwlockid. I guess we could > replace that with a hash table. Ugh. Any suggestions? Yeah, that's not fun. No good suggestions here offhand. Thanks, Stephen
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote: >> That assumes that you never hold more than one spinlock at a time, otherwise >> you can get deadlocks. I think that assumptions holds currently, because >> acquiring two spinlocks at a time would be bad on performance grounds >> anyway. > I think that's a pretty dangerous assumption I think it's a fine assumption. Code that could possibly do that should never get within hailing distance of being committed, because you are only supposed to have short straight-line bits of code under a spinlock. Taking another spinlock breaks both the "straight line code" and the "no loops" aspects of that coding rule. regards, tom lane
On 2014-01-06 09:59:49 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote: > >> That assumes that you never hold more than one spinlock at a time, otherwise > >> you can get deadlocks. I think that assumptions holds currently, because > >> acquiring two spinlocks at a time would be bad on performance grounds > >> anyway. > > > I think that's a pretty dangerous assumption > > I think it's a fine assumption. Code that could possibly do that should > never get within hailing distance of being committed, because you are only > supposed to have short straight-line bits of code under a spinlock. > Taking another spinlock breaks both the "straight line code" and the "no > loops" aspects of that coding rule. I think it's fair to say that no such code should be accepted - but I personally don't trust the review process to prevent such cases and not all spinlock usages are obvious (e.g. LockBufferHdr). And having rules that only cause breakage in configurations that nobody ever uses doesn't inspire much trust in that configuration. If we go that way, we should add an Assert preventing recursive spinlock acquiration... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 6, 2014 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-01-06 10:35:59 +0200, Heikki Linnakangas wrote: >>> That assumes that you never hold more than one spinlock at a time, otherwise >>> you can get deadlocks. I think that assumptions holds currently, because >>> acquiring two spinlocks at a time would be bad on performance grounds >>> anyway. > >> I think that's a pretty dangerous assumption > > I think it's a fine assumption. Code that could possibly do that should > never get within hailing distance of being committed, because you are only > supposed to have short straight-line bits of code under a spinlock. > Taking another spinlock breaks both the "straight line code" and the "no > loops" aspects of that coding rule. OK, so we could do this. But should we? If we're going to get rid of --disable-spinlocks for other reasons, then there's not much point in spending the time to make it work with dynamic shared memory segments that contain locks. In fact, even if we *aren't* going to get rid of it, it's not 100% clear to me that there's much point in supporting that intersection: a big point of the point of dsm is to enable parallelism, and the point of parallelism is to be fast, and if you want things to be fast, you should probably have working spinlocks. Of course, there are some other applications for dynamic shared memory as well, so this isn't a watertight argument, and in a quick test on my laptop, the performance of --disable-spinlocks wasn't horrible, so this argument isn't watertight. I guess the question boils down to: why are we keeping --disable-spinlocks around? If we're expecting that people might really use it for serious work, then it needs to remain and it needs to work with dynamic shared memory. If we're expecting that people might use it while porting PostgreSQL to a new platform but only as an aid to get things up and running, then it needs to remain, but it's probably OK for dynamic shared memory not to work when this option is in use. If nobody uses it at all, then we can just forget the whole thing. I guess my bottom line here is that if --disable-spinlocks is important to somebody, then I'm willing to expend some effort on keeping it working. But if it's just inertia, then I'd rather not spend a lot of time on it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I guess the question boils down to: why are we keeping > --disable-spinlocks around? If we're expecting that people might > really use it for serious work, then it needs to remain and it needs > to work with dynamic shared memory. If we're expecting that people > might use it while porting PostgreSQL to a new platform but only as an > aid to get things up and running, then it needs to remain, but it's > probably OK for dynamic shared memory not to work when this option is > in use. If nobody uses it at all, then we can just forget the whole > thing. I think we can eliminate the first of those. Semaphores for spinlocks were a performance disaster fifteen years ago, and the situation has surely only gotten worse since then. I do, however, believe that --disable-spinlocks has some use while porting to a new platform. (And I don't believe the argument advanced elsewhere in this thread that everybody uses gcc; much less that gcc's atomic intrinsics are of uniformly high quality even on oddball architectures.) Heikki's idea has some attraction for porting work whether or not you believe that DSM needs to work during initial porting. By default, PG will try to create upwards of ten thousand spinlocks just for buffer headers. It's likely that that will fail unless you whack around the kernel's SysV parameters. Being able to constrain the number of semaphores to something sane would probably be useful. Having said all that, I'm not personally going to take the time to implement it, and I don't think the DSM patch should be required to either. Somebody who actually needs it can go make it work. But I think Heikki's idea points the way forward, so I'm now against having the DSM code reject --disable-spinlocks. I think benign neglect is the way for now. regards, tom lane
On Mon, Jan 6, 2014 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think we can eliminate the first of those. Semaphores for spinlocks > were a performance disaster fifteen years ago, and the situation has > surely only gotten worse since then. I do, however, believe that > --disable-spinlocks has some use while porting to a new platform. > (And I don't believe the argument advanced elsewhere in this thread > that everybody uses gcc; much less that gcc's atomic intrinsics are > of uniformly high quality even on oddball architectures.) > > Heikki's idea has some attraction for porting work whether or not > you believe that DSM needs to work during initial porting. By > default, PG will try to create upwards of ten thousand spinlocks > just for buffer headers. It's likely that that will fail unless > you whack around the kernel's SysV parameters. Being able to > constrain the number of semaphores to something sane would probably > be useful. > > Having said all that, I'm not personally going to take the time to > implement it, and I don't think the DSM patch should be required to > either. Somebody who actually needs it can go make it work. Well, I took a look at this and it turns out not to be very hard, so here's a patch. Currently, we allocate 3 semaphore per shared buffer and a bunch of others, but the 3 per shared buffer dominates, so you end up with ~49k spinlocks for the default of 128MB shared_buffers. I chose to peg the number of semaphores at 1024, which is quite small compared to the current allocation, but the number of spinlock allocations that can be in progress at any given time is limited by the number of running backends. Even allowing for the birthday paradox, that should be enough to run at least a few dozen backends without suffering serious problems due to the multiplexing - especially because in real workloads, contention is usually concentrated around a small number of spinlocks that are unlikely to all be mapped to the same underlying semaphore. I'm happy enough with this way forward. Objections? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-01-05 14:06:52 -0500, Tom Lane wrote: >> I seem to recall that there was some good reason for keeping all the >> LWLocks in an array, back when the facility was first designed. >> I'm too lazy to research the point right now, but you might want to >> go back and look at the archives around when lwlock.c was written. > Your proposal is at > http://www.postgresql.org/message-id/1054.1001520600@sss.pgh.pa.us - > afaics there hasn't been much discussion about implementation details at > that level. Yeah, there wasn't :-(. On reflection I believe that my reason for building it like this was partially to reduce the number of pointer variables needed, and partially to avoid exposing the contents of the LWLock struct type to the rest of the backend. Robert's idea of continuing to keep the "fixed" LWLocks in an array would solve the first problem, but I don't think there's any way to avoid exposing the struct type if we want to embed the things into other structs, or even just rely on array indexing. OTOH, the LWLock mechanism has been stable for long enough now that we can probably suppose this struct is no more subject to churn than any other widely-known one, so maybe that consideration is no longer significant. Another nice benefit of switching to separate allocations that we could get rid of the horrid kluge that lwlock.h is the place that defines NUM_BUFFER_PARTITIONS and some other constants that don't belong there. So on the whole I'm in favor of this, as long as we can avoid any notational churn at most call sites. Should we get rid of RequestAddinLWLocks() and LWLockAssign(), in favor of telling extensions to just allocate some space and do LWLockInit() on it? BTW, NumLWLocks() also becomes a big problem if lwlock creation gets decentralized. This is only used by SpinlockSemas() ... so maybe we need to use Heikki's modulo idea to get rid of the need for that to know the number of LWLocks, independently of DSM. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Well, I took a look at this and it turns out not to be very hard, so > here's a patch. Currently, we allocate 3 semaphore per shared buffer > and a bunch of others, but the 3 per shared buffer dominates, so you > end up with ~49k spinlocks for the default of 128MB shared_buffers. I > chose to peg the number of semaphores at 1024, which is quite small > compared to the current allocation, but the number of spinlock > allocations that can be in progress at any given time is limited by > the number of running backends. Even allowing for the birthday > paradox, that should be enough to run at least a few dozen backends > without suffering serious problems due to the multiplexing - > especially because in real workloads, contention is usually > concentrated around a small number of spinlocks that are unlikely to > all be mapped to the same underlying semaphore. > I'm happy enough with this way forward. Objections? -1 for the any_spinlock_held business (useless overhead IMO, as it doesn't have anything whatsoever to do with enforcing the actual coding rule). And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h, and maybe dropping SpinlockSemas() altogether in favor of just referencing the constant. Otherwise this seems reasonable. regards, tom lane
On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Well, I took a look at this and it turns out not to be very hard, so >> here's a patch. Currently, we allocate 3 semaphore per shared buffer >> and a bunch of others, but the 3 per shared buffer dominates, so you >> end up with ~49k spinlocks for the default of 128MB shared_buffers. I >> chose to peg the number of semaphores at 1024, which is quite small >> compared to the current allocation, but the number of spinlock >> allocations that can be in progress at any given time is limited by >> the number of running backends. Even allowing for the birthday >> paradox, that should be enough to run at least a few dozen backends >> without suffering serious problems due to the multiplexing - >> especially because in real workloads, contention is usually >> concentrated around a small number of spinlocks that are unlikely to >> all be mapped to the same underlying semaphore. > >> I'm happy enough with this way forward. Objections? > > -1 for the any_spinlock_held business (useless overhead IMO, as it doesn't > have anything whatsoever to do with enforcing the actual coding rule). Hmm. I thought that was a pretty well-aimed bullet myself; why do you think that it isn't? I don't particularly mind ripping it out, but it seemed like a good automated test to me. > And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h, > and maybe dropping SpinlockSemas() altogether in favor of just referencing > the constant. Otherwise this seems reasonable. As far as pg_config_manual.h is concerned, is this the sort of thing you have in mind? #ifndef HAVE_SPINLOCKS #define NUM_SPINLOCK_SEMAPHORES 1024 #endif -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > OTOH, the LWLock mechanism has been stable for long enough now that > we can probably suppose this struct is no more subject to churn than > any other widely-known one, so maybe that consideration is no longer > significant. On the whole, I'd say it's been more stable than most. But even if we do decide to change it, I'm not sure that really matters very much. We whack widely-used data structures around fairly regularly, and I haven't observed that to cause many problems. Just as one concrete example, PGPROC changes pretty regularly, and I'm not aware of much code that's been broken by that. > Should we get rid of RequestAddinLWLocks() and LWLockAssign(), in > favor of telling extensions to just allocate some space and do > LWLockInit() on it? I don't really see any reason to deprecate that mechanism - it'll just make it harder for people who want to write code that works on multiple PostgreSQL releases to do so easily, and it's my belief that there's a decent amount of third-party code that uses those APIs. EnterpriseDB has some, for example. Broadly, I don't see any problem with presuming that most code that uses lwlocks will be happy to keep those lwlocks in the main array while allowing them to be stored elsewhere in those cases where that's not convenient. Perhaps in the long run that won't be true, but then again perhaps it will. Either way, I don't see a real reason to rush into a change in policy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> -1 for the any_spinlock_held business (useless overhead IMO, as it doesn't >> have anything whatsoever to do with enforcing the actual coding rule). > Hmm. I thought that was a pretty well-aimed bullet myself; why do you > think that it isn't? I don't particularly mind ripping it out, but it > seemed like a good automated test to me. The coding rule is "only short straight-line code while holding a spinlock". That could be violated in any number of nasty ways without trying to take another spinlock. And since the whole point of the rule is that spinlock-holding code segments should be quick, adding more overhead to them doesn't seem very nice, even if it is only done in Assert builds. I agree it'd be nicer if we had some better way than mere manual inspection to enforce proper use of spinlocks; but this change doesn't seem to me to move the ball downfield by any meaningful distance. >> And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h, >> and maybe dropping SpinlockSemas() altogether in favor of just referencing >> the constant. Otherwise this seems reasonable. > As far as pg_config_manual.h is concerned, is this the sort of thing > you have in mind? > #ifndef HAVE_SPINLOCKS > #define NUM_SPINLOCK_SEMAPHORES 1024 > #endif I think we can just define it unconditionally, don't you? It shouldn't get referenced in HAVE_SPINLOCK builds. Or is the point that you want to enforce that? regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> OTOH, the LWLock mechanism has been stable for long enough now that >> we can probably suppose this struct is no more subject to churn than >> any other widely-known one, so maybe that consideration is no longer >> significant. > On the whole, I'd say it's been more stable than most. But even if we > do decide to change it, I'm not sure that really matters very much. Actually, the real value of a module-local struct definition is that you can be pretty darn sure that nothing except the code in that file is manipulating the struct contents. I would've preferred that we expose only an abstract struct definition, but don't quite see how to do that if we're going to embed the things in buffer headers. regards, tom lane
On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jan 6, 2014 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> -1 for the any_spinlock_held business (useless overhead IMO, as it doesn't >>> have anything whatsoever to do with enforcing the actual coding rule). > >> Hmm. I thought that was a pretty well-aimed bullet myself; why do you >> think that it isn't? I don't particularly mind ripping it out, but it >> seemed like a good automated test to me. > > The coding rule is "only short straight-line code while holding a > spinlock". That could be violated in any number of nasty ways without > trying to take another spinlock. And since the whole point of the rule is > that spinlock-holding code segments should be quick, adding more overhead > to them doesn't seem very nice, even if it is only done in Assert builds. > > I agree it'd be nicer if we had some better way than mere manual > inspection to enforce proper use of spinlocks; but this change doesn't > seem to me to move the ball downfield by any meaningful distance. Well, my thought was mainly that, while it may be a bad idea to take another spinlock while holding a spinlock under any circumstances, somebody might do it and it might appear to work just fine. The most likely sequences seems to me to be something like SpinLockAcquire(...) followed by LWLockConditionalAcquire(), thinking that things are OK because the lock acquisition is conditional - but in fact the conditional acquire still takes the spinlock unconditionally. Even so, without --disable-spinlocks, this will probably work just fine. Most likely there won't even be a performance problem, because a lwlock has to be *very* hot for the spinlock acquisitions to become a problem. But with --disable-spinlocks, it will unpredictably self-deadlock. And since few developers are likely to test with --disable-spinlocks, the problem most likely won't be noticed. When someone does then try to use --disable-spinlocks to port to a new problem and starts hitting the deadlocks, they may not know enough to attribute them to the real cause. All in all it doesn't seem like unduly expensive insurance, but I can remove it if the consensus is against it. >>> And I'd suggest defining NUM_SPINLOCK_SEMAPHORES in pg_config_manual.h, >>> and maybe dropping SpinlockSemas() altogether in favor of just referencing >>> the constant. Otherwise this seems reasonable. > >> As far as pg_config_manual.h is concerned, is this the sort of thing >> you have in mind? > >> #ifndef HAVE_SPINLOCKS >> #define NUM_SPINLOCK_SEMAPHORES 1024 >> #endif > > I think we can just define it unconditionally, don't you? It shouldn't > get referenced in HAVE_SPINLOCK builds. Or is the point that you want > to enforce that? I don't think it really matters much; seems like a style question to me. That's why I asked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 6, 2014 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jan 6, 2014 at 1:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> OTOH, the LWLock mechanism has been stable for long enough now that >>> we can probably suppose this struct is no more subject to churn than >>> any other widely-known one, so maybe that consideration is no longer >>> significant. > >> On the whole, I'd say it's been more stable than most. But even if we >> do decide to change it, I'm not sure that really matters very much. > > Actually, the real value of a module-local struct definition is that you > can be pretty darn sure that nothing except the code in that file is > manipulating the struct contents. I would've preferred that we expose > only an abstract struct definition, but don't quite see how to do that > if we're going to embed the things in buffer headers. Agreed. I think it's good to keep struct definitions private as much as possible, and I do. But I don't think this is going to cause a big problem either; lwlocks have been around for a long time and the conventions for using them are pretty well understood, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I agree it'd be nicer if we had some better way than mere manual >> inspection to enforce proper use of spinlocks; but this change doesn't >> seem to me to move the ball downfield by any meaningful distance. > Well, my thought was mainly that, while it may be a bad idea to take > another spinlock while holding a spinlock under any circumstances, > somebody might do it and it might appear to work just fine. The most > likely sequences seems to me to be something like SpinLockAcquire(...) > followed by LWLockConditionalAcquire(), thinking that things are OK > because the lock acquisition is conditional - but in fact the > conditional acquire still takes the spinlock unconditionally. The point I'm making is that no such code should get past review, whether it's got an obvious performance problem or not. There's a huge amount of stuff that in principle we could add overhead to Assert-enabled builds for, but we prefer not to --- an example not too far afield from this issue is that there's no mechanism for detecting deadlocks among multiple LWLock holders. If we go too far down the path of adding useless checks to Assert builds, people will stop using such builds for routine development, which would surely be a large net negative reliability-wise. I'd be okay with adding overhead to SpinLockAcquire/Release if it had some meaningful probability of catching real bugs, but I don't see that that claim can be made here. regards, tom lane
On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I agree it'd be nicer if we had some better way than mere manual >>> inspection to enforce proper use of spinlocks; but this change doesn't >>> seem to me to move the ball downfield by any meaningful distance. > >> Well, my thought was mainly that, while it may be a bad idea to take >> another spinlock while holding a spinlock under any circumstances, >> somebody might do it and it might appear to work just fine. The most >> likely sequences seems to me to be something like SpinLockAcquire(...) >> followed by LWLockConditionalAcquire(), thinking that things are OK >> because the lock acquisition is conditional - but in fact the >> conditional acquire still takes the spinlock unconditionally. > > The point I'm making is that no such code should get past review, > whether it's got an obvious performance problem or not. Sure, I agree, but we all make mistakes. It's just a judgement call as to how likely you think it is that someone might make this particular mistake, a topic upon which opinions may vary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 6, 2014 at 9:48 AM, Stephen Frost <sfrost@snowman.net> wrote: >> None of these ideas are a complete solution for LWLOCK_STATS. In the >> other three cases noted above, we only need an identifier for the lock >> "instantaneously", so that we can pass it off to the logger or dtrace >> or whatever. But LWLOCK_STATS wants to hold on to data about the >> locks that were visited until the end of the session, and it does that >> using an array that is *indexed* by lwlockid. I guess we could >> replace that with a hash table. Ugh. Any suggestions? > > Yeah, that's not fun. No good suggestions here offhand. Replacing it with a hashtable turns out not to be too bad, either in terms of code complexity or performance, so I think that's the way to go. I did some test runs with pgbench -S, scale factor 300, 32 clients, shared_buffers=8GB, five minute runs and got these results: resultsr.lwlock-stats.32.300.300:tps = 195493.037962 (including connections establishing) resultsr.lwlock-stats.32.300.300:tps = 189985.964658 (including connections establishing) resultsr.lwlock-stats.32.300.300:tps = 197641.293892 (including connections establishing) resultsr.lwlock-stats-htab.32.300.300:tps = 193286.066063 (including connections establishing) resultsr.lwlock-stats-htab.32.300.300:tps = 191832.100877 (including connections establishing) resultsr.lwlock-stats-htab.32.300.300:tps = 191899.834211 (including connections establishing) resultsr.master.32.300.300:tps = 197939.111998 (including connections establishing) resultsr.master.32.300.300:tps = 198641.337720 (including connections establishing) resultsr.master.32.300.300:tps = 198675.404349 (including connections establishing) "master" is the master branch, commit 10a82cda67731941c18256e009edad4a784a2994. "lwlock-stats" is the same, but with LWLOCK_STATS defined. "lwlock-stats-htab" is the same, with the attached patch and LWLOCK_STATS defined. The runs were interleaved, but the results are shown here grouped by test run. If we assume that the 189k result is an outlier, then there's probably some regression associated with the lwlock-stats-htab patch, but not a lot. Considering that this code isn't even compiled unless you have LWLOCK_STATS defined, I think that's OK. This is only part of the solution, of course: a complete solution will involve making the hash table key something other than the lock ID. What I'm thinking we can do is making the lock ID consist of two unsigned 32-bit integers. One of these will be stored in the lwlock itself, which if my calculations are correct won't increase the size of LWLockPadded on any common platforms (a 64-bit integer would). Let's call this the "tranch id". The other will be derived from the LWLock address. Let's call this the "instance ID". We'll keep a table of tranch IDs, which will be assigned consecutively starting with 0. We'll keep an array of metadata for tranches, indexed by tranch ID, and each entry will have three associated pieces of information: an array base, a stride length, and a printable name. When we need to identify an lwlock in the log or to dtrace, we'll fetch the tranch ID from the lwlock itself and use that to index into the tranch metadata array. We'll then take the address of the lwlock, subtract the array base address for the tranch, and divide by the stride length; the result is the instance ID. When reporting the user, we can report either the tranch ID directly or the associated name for that tranch; in either case, we'll also report the instance ID. So initially we'll probably just have tranch 0: the main LWLock array. If we move buffer content and I/O locks to the buffer headers, we'll define tranch 1 and tranch 2 with the same base address: the start of the buffer descriptor array, and the same stride length, the size of a buffer descriptor. One will have the associated name "buffer content lock" and the other "buffer I/O lock". If we want, we can define split the main LWLock array into several tranches so that we can more easily identify lock manager locks, predicate lock manager locks, and buffer mapping locks. I like this system because it's very cheap - we only need a small array of metadata and a couple of memory accesses to name a lock - but it still lets us report data in a way that's actually *more* human-readable than what we have now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 1/6/14, 2:59 PM, Robert Haas wrote: > On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Mon, Jan 6, 2014 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I agree it'd be nicer if we had some better way than mere manual >>>> inspection to enforce proper use of spinlocks; but this change doesn't >>>> seem to me to move the ball downfield by any meaningful distance. >> >>> Well, my thought was mainly that, while it may be a bad idea to take >>> another spinlock while holding a spinlock under any circumstances, >>> somebody might do it and it might appear to work just fine. The most >>> likely sequences seems to me to be something like SpinLockAcquire(...) >>> followed by LWLockConditionalAcquire(), thinking that things are OK >>> because the lock acquisition is conditional - but in fact the >>> conditional acquire still takes the spinlock unconditionally. >> >> The point I'm making is that no such code should get past review, >> whether it's got an obvious performance problem or not. > > Sure, I agree, but we all make mistakes. It's just a judgement call > as to how likely you think it is that someone might make this > particular mistake, a topic upon which opinions may vary. There's been discussions in the past about the overhead added by testing and having more than one level of test capabilitiesso that facilities like the buildfarm can run more expensive testing without burdening developers with that.ISTM this is another case of that (presumably Tom's concern is the performance hit of adding an assert in a criticalpath). If we had a more aggressive form of assert (assert_anal? :)) we could use that here and let the buildfarm bore the bruntof that cost. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby escribió: > On 1/6/14, 2:59 PM, Robert Haas wrote: > >On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>The point I'm making is that no such code should get past review, > >>whether it's got an obvious performance problem or not. > > > >Sure, I agree, but we all make mistakes. It's just a judgement call > >as to how likely you think it is that someone might make this > >particular mistake, a topic upon which opinions may vary. I agree that excessive optimism might cause problems in the future. Maybe it makes sense to have such a check #ifdef'ed out on most builds to avoid extra overhead, but not having any check at all just because we trust the review process too much doesn't strike me as the best of ideas. > There's been discussions in the past about the overhead added by testing and having more than one level of test capabilitiesso that facilities like the buildfarm can run more expensive testing without burdening developers with that.ISTM this is another case of that (presumably Tom's concern is the performance hit of adding an assert in a criticalpath). > > If we had a more aggressive form of assert (assert_anal? :)) we could use that here and let the buildfarm bore the bruntof that cost. Sounds good, but let's not enable it blindly on all machines. There are some that are under very high pressure to build and test all the branches, so if we add something too costly on them it might cause trouble. So whatever we do, it should at least have the option to opt-out, or perhaps even make it opt-in. (I'd think opt-out is better, because otherwise very few machines are going to have it enabled at all.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-06 21:35:22 -0300, Alvaro Herrera wrote: > Jim Nasby escribió: > > On 1/6/14, 2:59 PM, Robert Haas wrote: > > >On Mon, Jan 6, 2014 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > >>The point I'm making is that no such code should get past review, > > >>whether it's got an obvious performance problem or not. > > > > > >Sure, I agree, but we all make mistakes. It's just a judgement call > > >as to how likely you think it is that someone might make this > > >particular mistake, a topic upon which opinions may vary. I don't think it's that unlikely as the previous implementation's rules when viewed while squinting allowed nesting spinlocks. And it's a pretty simple check. > Maybe it makes sense to have such a check #ifdef'ed out on most builds > to avoid extra overhead, but not having any check at all just because we > trust the review process too much doesn't strike me as the best of > ideas. I don't think that check would have relevantly high performance impact in comparison to the rest of --enable-cassert - it's a single process local variable which is regularly accessed. It will just stay in L1 or even registers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 7, 2014 at 6:54 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Maybe it makes sense to have such a check #ifdef'ed out on most builds >> to avoid extra overhead, but not having any check at all just because we >> trust the review process too much doesn't strike me as the best of >> ideas. > > I don't think that check would have relevantly high performance impact > in comparison to the rest of --enable-cassert - it's a single process > local variable which is regularly accessed. It will just stay in > L1 or even registers. I tried it out on a 16-core, 64-hwthread community box, PPC. pgbench -S, 5-minute rules at scale factor 300 with shared_buffers=8GB: resultsr.cassert.32.300.300:tps = 11341.627815 (including connections establishing) resultsr.cassert.32.300.300:tps = 11339.407976 (including connections establishing) resultsr.cassert.32.300.300:tps = 11321.339971 (including connections establishing) resultsr.cassert-spin-multiplex.32.300.300:tps = 11492.927372 (including connections establishing) resultsr.cassert-spin-multiplex.32.300.300:tps = 11471.810937 (including connections establishing) resultsr.cassert-spin-multiplex.32.300.300:tps = 11516.471053 (including connections establishing) By comparison: resultsr.master.32.300.300:tps = 197939.111998 (including connections establishing) resultsr.master.32.300.300:tps = 198641.337720 (including connections establishing) resultsr.master.32.300.300:tps = 198675.404349 (including connections establishing) So yeah, the overhead is negligible, if not negative. None of the other changes in that patch were even compiled in this case, since all that code is protected by --disable-spinlocks. Maybe somebody can find another workload where the overhead is visible, but here it's not. On the other hand, I did discover another bit of ugliness - the macros actually have to be defined this way: +#define SpinLockAcquire(lock) \ + (AssertMacro(!any_spinlock_held), any_spinlock_held = true, \ + S_LOCK(lock)) +#define SpinLockRelease(lock) \ + do { Assert(any_spinlock_held); any_spinlock_held = false; \ + S_UNLOCK(lock); } while (0) SpinLockAcquire has to be a comma-separated expression rather than a do {} while (0) block because S_LOCK returns a value (which is used when compiling with LWLOCK_STATS); SpinLockRelease, however, has to be done the other way because S_UNLOCK() is defined as a do {} while (0) block already on PPC among other architectures. Overall I don't really care enough about this to argue strenuously for it. I'm not nearly so confident as Tom that no errors of this type will creep into future code, but on the other hand, keeping --disable-spinlocks working reliably isn't significant enough for me to want to spend any political capital on it. It's probably got a dim future regardless of what we do here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 6, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > This is only part of the solution, of course: a complete solution will > involve making the hash table key something other than the lock ID. > What I'm thinking we can do is making the lock ID consist of two > unsigned 32-bit integers. One of these will be stored in the lwlock > itself, which if my calculations are correct won't increase the size > of LWLockPadded on any common platforms (a 64-bit integer would). > Let's call this the "tranch id". The other will be derived from the > LWLock address. Let's call this the "instance ID". We'll keep a > table of tranch IDs, which will be assigned consecutively starting > with 0. We'll keep an array of metadata for tranches, indexed by > tranch ID, and each entry will have three associated pieces of > information: an array base, a stride length, and a printable name. > When we need to identify an lwlock in the log or to dtrace, we'll > fetch the tranch ID from the lwlock itself and use that to index into > the tranch metadata array. We'll then take the address of the lwlock, > subtract the array base address for the tranch, and divide by the > stride length; the result is the instance ID. When reporting the > user, we can report either the tranch ID directly or the associated > name for that tranch; in either case, we'll also report the instance > ID. > > So initially we'll probably just have tranch 0: the main LWLock array. > If we move buffer content and I/O locks to the buffer headers, we'll > define tranch 1 and tranch 2 with the same base address: the start of > the buffer descriptor array, and the same stride length, the size of a > buffer descriptor. One will have the associated name "buffer content > lock" and the other "buffer I/O lock". If we want, we can define > split the main LWLock array into several tranches so that we can more > easily identify lock manager locks, predicate lock manager locks, and > buffer mapping locks. OK, I've implemented this: here's what I believe to be a complete patch, based on the previous lwlock-pointers.patch but now handling LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose ends. I think this should be adequate for allowing lwlocks to be stored elsewhere in the main shared memory segment as well as in dynamic shared memory segments. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
(2014/01/11 3:11), Robert Haas wrote: > On Mon, Jan 6, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> This is only part of the solution, of course: a complete solution will >> involve making the hash table key something other than the lock ID. >> What I'm thinking we can do is making the lock ID consist of two >> unsigned 32-bit integers. One of these will be stored in the lwlock >> itself, which if my calculations are correct won't increase the size >> of LWLockPadded on any common platforms (a 64-bit integer would). >> Let's call this the "tranch id". The other will be derived from the >> LWLock address. Let's call this the "instance ID". We'll keep a >> table of tranch IDs, which will be assigned consecutively starting >> with 0. We'll keep an array of metadata for tranches, indexed by >> tranch ID, and each entry will have three associated pieces of >> information: an array base, a stride length, and a printable name. >> When we need to identify an lwlock in the log or to dtrace, we'll >> fetch the tranch ID from the lwlock itself and use that to index into >> the tranch metadata array. We'll then take the address of the lwlock, >> subtract the array base address for the tranch, and divide by the >> stride length; the result is the instance ID. When reporting the >> user, we can report either the tranch ID directly or the associated >> name for that tranch; in either case, we'll also report the instance >> ID. >> >> So initially we'll probably just have tranch 0: the main LWLock array. >> If we move buffer content and I/O locks to the buffer headers, we'll >> define tranch 1 and tranch 2 with the same base address: the start of >> the buffer descriptor array, and the same stride length, the size of a >> buffer descriptor. One will have the associated name "buffer content >> lock" and the other "buffer I/O lock". If we want, we can define >> split the main LWLock array into several tranches so that we can more >> easily identify lock manager locks, predicate lock manager locks, and >> buffer mapping locks. > > OK, I've implemented this: here's what I believe to be a complete > patch, based on the previous lwlock-pointers.patch but now handling > LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose > ends. I think this should be adequate for allowing lwlocks to be > stored elsewhere in the main shared memory segment as well as in > dynamic shared memory segments. > > Thoughts? > I briefly checked the patch. Most of lines are mechanical replacement from LWLockId to LWLock *, and compiler didn't claim anything with -Wall -Werror option. My concern is around LWLockTranche mechanism. Isn't it too complicated towards the purpose? For example, it may make sense to add "char lockname[NAMEDATALEN];" at the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it also adds an argument of LWLockAssign() to gives the human readable name. Is the additional 64bytes (= NAMEDATALEN) per lock too large for recent hardware? Below is minor comments. It seems to me this newer form increased direct reference to the MainLWLockArray. Is it really graceful? My recommendation is to define a macro that wraps the reference to MainLWLockArray. For example: #define PartitionLock(i) \ (&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock) instead of the following manner. @@ -3400,7 +3406,12 @@ GetLockStatusData(void) * Must grab LWLocks in partition-number order to avoid LWLock deadlock. */ for (i = 0; i < NUM_LOCK_PARTITIONS; i++) - LWLockAcquire(FirstLockMgrLock + i, LW_SHARED); + { + LWLock *partitionLock; + + partitionLock = &MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + i].lock; + LWLockAcquire(partitionLock, LW_SHARED); + } This chunk is enclosed by #ifdef EXEC_BACKEND, the legacy LWLockArray was not removed. @@ -5633,6 +5630,7 @@ save_backend_variables(BackendParameters *param, Port *port, param->SpinlockSemaArray = SpinlockSemaArray;#endif param->LWLockArray = LWLockArray; + param->MainLWLockArray = MainLWLockArray; param->ProcStructLock = ProcStructLock; param->ProcGlobal = ProcGlobal; param->AuxiliaryProcs = AuxiliaryProcs; Thanks, -- OSS Promotion Center / The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com> wrote: > I briefly checked the patch. Most of lines are mechanical replacement > from LWLockId to LWLock *, and compiler didn't claim anything with > -Wall -Werror option. > > My concern is around LWLockTranche mechanism. Isn't it too complicated > towards the purpose? > For example, it may make sense to add "char lockname[NAMEDATALEN];" at > the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it > also adds an argument of LWLockAssign() to gives the human readable > name. Is the additional 64bytes (= NAMEDATALEN) per lock too large > for recent hardware? Well, we'd need it when either LOCK_DEBUG was defined or when LWLOCK_STATS was defined or when --enable-dtrace was used, and while the first two are probably rarely enough used that that would be OK, I think the third case is probably fairly common, and I don't think we want to have such a potentially performance-critical difference between builds with and without dtrace. Also... yeah, it's a lot of memory. If we add an additional 64 bytes to the structure, then we're looking at 96 bytes per lwlock instead of 32, after padding out to a 32-byte boundary to avoid crossing cache lines. We need 2 lwlocks per buffer, so that's an additional 128 bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB for storing lwlocks. I'm not willing to assume nobody cares about that. And while I agree that this is a bit complex, I don't think it's really as bad as all that. We've gotten by for a long time without people being able to put lwlocks in other parts of memory, and most extension authors have gotten by with LWLockAssign() just fine and can continue doing things that way; only users with particularly sophisticated needs should bother to worry about the tranche stuff. One idea I just had is to improve the dsm_toc module so that it can optionally set up a tranche of lwlocks for you, and provide some analogues of RequestAddinLWLocks and LWLockAssign for that case. That would probably make this quite a bit simpler to use, at least for people using it with dynamic shared memory. But I think that's a separate patch. > Below is minor comments. > > It seems to me this newer form increased direct reference to > the MainLWLockArray. Is it really graceful? > My recommendation is to define a macro that wraps the reference to > MainLWLockArray. > > For example: > #define PartitionLock(i) \ > (&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock) Yeah, that's probably a good idea. > This chunk is enclosed by #ifdef EXEC_BACKEND, the legacy LWLockArray > was not removed. Oops, will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(2014/01/22 1:37), Robert Haas wrote: > On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com> wrote: >> I briefly checked the patch. Most of lines are mechanical replacement >> from LWLockId to LWLock *, and compiler didn't claim anything with >> -Wall -Werror option. >> >> My concern is around LWLockTranche mechanism. Isn't it too complicated >> towards the purpose? >> For example, it may make sense to add "char lockname[NAMEDATALEN];" at >> the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it >> also adds an argument of LWLockAssign() to gives the human readable >> name. Is the additional 64bytes (= NAMEDATALEN) per lock too large >> for recent hardware? > > Well, we'd need it when either LOCK_DEBUG was defined or when > LWLOCK_STATS was defined or when --enable-dtrace was used, and while > the first two are probably rarely enough used that that would be OK, I > think the third case is probably fairly common, and I don't think we > want to have such a potentially performance-critical difference > between builds with and without dtrace. > > Also... yeah, it's a lot of memory. If we add an additional 64 bytes > to the structure, then we're looking at 96 bytes per lwlock instead of > 32, after padding out to a 32-byte boundary to avoid crossing cache > lines. We need 2 lwlocks per buffer, so that's an additional 128 > bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB > for storing lwlocks. I'm not willing to assume nobody cares about > that. And while I agree that this is a bit complex, I don't think > it's really as bad as all that. We've gotten by for a long time > without people being able to put lwlocks in other parts of memory, and > most extension authors have gotten by with LWLockAssign() just fine > and can continue doing things that way; only users with particularly > sophisticated needs should bother to worry about the tranche stuff. > Hmm... 1/64 of main memory (if large buffer system) might not be an ignorable memory consumption. > One idea I just had is to improve the dsm_toc module so that it can > optionally set up a tranche of lwlocks for you, and provide some > analogues of RequestAddinLWLocks and LWLockAssign for that case. That > would probably make this quite a bit simpler to use, at least for > people using it with dynamic shared memory. But I think that's a > separate patch. > I agree with this idea. It seems to me quite natural to keep properties of objects held on shared memory (LWLock) on shared memory. Also, a LWLock once assigned shall not be never released. So, I think dsm_toc can provide a well suitable storage for them. Thanks, -- OSS Promotion Center / The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 2014-01-10 13:11:32 -0500, Robert Haas wrote: > OK, I've implemented this: here's what I believe to be a complete > patch, based on the previous lwlock-pointers.patch but now handling > LOCK_DEBUG and TRACE_LWLOCKS and dtrace and a bunch of other loose > ends. I think this should be adequate for allowing lwlocks to be > stored elsewhere in the main shared memory segment as well as in > dynamic shared memory segments. > > Thoughts? Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid breaking external code using lwlocks? Requiring code that has worked for several versions to sprinkle version specific ifdefs seems unneccessary here. I see the comments still claim + * LWLock is between 16 and 32 bytes on all known platforms, so these two + * cases are sufficient. + */ +#define LWLOCK_PADDED_SIZE (sizeof(LWLock) <= 16 ? 16 : 32) I don't think that's true anymore with the addition of the tranche id. The new minimum size seems to be 20 on 32bit platforms. That could be fixed by making the tranche id a uint8, but I don't think that's necessary, so just removing the support for < 32 seems sufficient. Greetings, Andres Freund
Andres Freund <andres@2ndquadrant.com> writes: > Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid > breaking external code using lwlocks? +1, in fact there's probably no reason to touch most *internal* code using that type name either. regards, tom lane
On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid >> breaking external code using lwlocks? > > +1, in fact there's probably no reason to touch most *internal* code using > that type name either. I thought about this but figured it was too much of a misnomer to refer to a pointer as an ID. But, if we're sure we want to go that route, I can go revise the patch along those lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-01-22 12:40:34 -0500, Robert Haas wrote: > On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > >> Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid > >> breaking external code using lwlocks? > > > > +1, in fact there's probably no reason to touch most *internal* code using > > that type name either. > > I thought about this but figured it was too much of a misnomer to > refer to a pointer as an ID. But, if we're sure we want to go that > route, I can go revise the patch along those lines. I personally don't care either way for internal code as long as external code continues to work. There's the argument of making the commit better readable by having less noise and less divergence in the branches and there's your argument of that being less clear. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Isn't it necessary to have an interface to initialize LWLock structure being allocated on a dynamic shared memory segment? Even though LWLock structure is exposed at lwlock.h, we have no common way to initialize it. How about to have a following function? void InitLWLock(LWLock *lock) { SpinLockInit(&lock->lock.mutex); lock->lock.releaseOK = true; lock->lock.exclusive = 0; lock->lock.shared = 0; lock->lock.head = NULL; lock->lock.tail = NULL; } Thanks, 2014/1/22 KaiGai Kohei <kaigai@ak.jp.nec.com>: > (2014/01/22 1:37), Robert Haas wrote: >> >> On Mon, Jan 20, 2014 at 11:23 PM, KaiGai Kohei <kaigai@ak.jp.nec.com> >> wrote: >>> >>> I briefly checked the patch. Most of lines are mechanical replacement >>> from LWLockId to LWLock *, and compiler didn't claim anything with >>> -Wall -Werror option. >>> >>> My concern is around LWLockTranche mechanism. Isn't it too complicated >>> towards the purpose? >>> For example, it may make sense to add "char lockname[NAMEDATALEN];" at >>> the tail of LWLock structure if LOCK_DEBUG is enabled. Probably, it >>> also adds an argument of LWLockAssign() to gives the human readable >>> name. Is the additional 64bytes (= NAMEDATALEN) per lock too large >>> for recent hardware? >> >> >> Well, we'd need it when either LOCK_DEBUG was defined or when >> LWLOCK_STATS was defined or when --enable-dtrace was used, and while >> the first two are probably rarely enough used that that would be OK, I >> think the third case is probably fairly common, and I don't think we >> want to have such a potentially performance-critical difference >> between builds with and without dtrace. >> >> Also... yeah, it's a lot of memory. If we add an additional 64 bytes >> to the structure, then we're looking at 96 bytes per lwlock instead of >> 32, after padding out to a 32-byte boundary to avoid crossing cache >> lines. We need 2 lwlocks per buffer, so that's an additional 128 >> bytes per 8kB buffer. For shared_buffers = 8GB, that's an extra 128MB >> for storing lwlocks. I'm not willing to assume nobody cares about >> that. And while I agree that this is a bit complex, I don't think >> it's really as bad as all that. We've gotten by for a long time >> without people being able to put lwlocks in other parts of memory, and >> most extension authors have gotten by with LWLockAssign() just fine >> and can continue doing things that way; only users with particularly >> sophisticated needs should bother to worry about the tranche stuff. >> > Hmm... 1/64 of main memory (if large buffer system) might not be > an ignorable memory consumption. > > >> One idea I just had is to improve the dsm_toc module so that it can >> optionally set up a tranche of lwlocks for you, and provide some >> analogues of RequestAddinLWLocks and LWLockAssign for that case. That >> would probably make this quite a bit simpler to use, at least for >> people using it with dynamic shared memory. But I think that's a >> separate patch. >> > I agree with this idea. It seems to me quite natural to keep properties > of objects held on shared memory (LWLock) on shared memory. > Also, a LWLock once assigned shall not be never released. So, I think > dsm_toc can provide a well suitable storage for them. > > > Thanks, > -- > OSS Promotion Center / The PG-Strom Project > KaiGai Kohei <kaigai@ak.jp.nec.com> > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote: > Isn't it necessary to have an interface to initialize LWLock structure being > allocated on a dynamic shared memory segment? > Even though LWLock structure is exposed at lwlock.h, we have no common > way to initialize it. There's LWLockInitialize() or similar in the patch afair. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
2014/1/23 Andres Freund <andres@2ndquadrant.com>: > On 2014-01-23 23:03:40 +0900, Kohei KaiGai wrote: >> Isn't it necessary to have an interface to initialize LWLock structure being >> allocated on a dynamic shared memory segment? >> Even though LWLock structure is exposed at lwlock.h, we have no common >> way to initialize it. > > There's LWLockInitialize() or similar in the patch afair. > Ahh, I oversight the code around tranche-id. Sorry for this noise. Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Wed, Jan 22, 2014 at 12:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-01-22 12:40:34 -0500, Robert Haas wrote: >> On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Andres Freund <andres@2ndquadrant.com> writes: >> >> Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid >> >> breaking external code using lwlocks? >> > >> > +1, in fact there's probably no reason to touch most *internal* code using >> > that type name either. >> >> I thought about this but figured it was too much of a misnomer to >> refer to a pointer as an ID. But, if we're sure we want to go that >> route, I can go revise the patch along those lines. > > I personally don't care either way for internal code as long as external > code continues to work. There's the argument of making the commit better > readable by having less noise and less divergence in the branches and > there's your argument of that being less clear. OK, well then, if no one objects violently, I'll stick my current approach of getting rid of all core mentions of LWLockId in favor of LWLock *, but also add typedef LWLock *LWLockId with a comment that this is to minimize breakage of third-party code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 23, 2014 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 22, 2014 at 12:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2014-01-22 12:40:34 -0500, Robert Haas wrote: >>> On Wed, Jan 22, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> > Andres Freund <andres@2ndquadrant.com> writes: >>> >> Shouldn't we introduce a typedef LWLock* LWLockid; or something to avoid >>> >> breaking external code using lwlocks? >>> > >>> > +1, in fact there's probably no reason to touch most *internal* code using >>> > that type name either. >>> >>> I thought about this but figured it was too much of a misnomer to >>> refer to a pointer as an ID. But, if we're sure we want to go that >>> route, I can go revise the patch along those lines. >> >> I personally don't care either way for internal code as long as external >> code continues to work. There's the argument of making the commit better >> readable by having less noise and less divergence in the branches and >> there's your argument of that being less clear. > > OK, well then, if no one objects violently, I'll stick my current > approach of getting rid of all core mentions of LWLockId in favor of > LWLock *, but also add typedef LWLock *LWLockId with a comment that > this is to minimize breakage of third-party code. Hearing no objections, violent or otherwise, I've done that, made the other adjustments suggested by Andres and KaiGai, and committed this. Let's see what the buildfarm thinks... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > One idea I just had is to improve the dsm_toc module so that it can > optionally set up a tranche of lwlocks for you, and provide some > analogues of RequestAddinLWLocks and LWLockAssign for that case. That > would probably make this quite a bit simpler to use, at least for > people using it with dynamic shared memory. But I think that's a > separate patch. I played with this a bit today and it doesn't actually seem to simplify things very much. The backend that creates the DSM needs to do this: lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks); for (i = 0; i < nlwlocks; ++i) LWLockInitialize(&lwlocks[i].lock,tranche_id); Since that's all of three lines, encapsulating it doesn't look all that helpful. Then each backend needs to do something like this: static LWLockTranche mytranche; mytranche.name = "some descriptive module name"; mytranche.array_base = lwlocks; mytranche.array_stride = sizeof(LWLockPadded); LWLockRegisterTranche(tranche_id, &mytranche); That's not a lot of code either, and there's no obvious way to reduce it much by hooking into shm_toc. I thought maybe we needed some cleanup when the dynamic shared memory segment is unmapped, but looks like we really don't. One can do something like LWLockRegisterTranche(tranche_id, NULL) for debugging clarity, but it isn't strictly needed; one can release all lwlocks from the tranche or assert that none are held, but that should really only be a problem if the user does something like LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error recovery starts by releasing *all* lwlocks. And if the user does it explicitly, I'm kinda OK with that just seg faulting. After all, the user could equally well have done dsm_detach(seg); LWLockAcquire(lock_in_the_segment) and there's no way at all to cross-check for that sort of mistake. I do see one thing about the status quo that does look reasonably annoying: the code expects that tranche IDs are going to stay relatively small. For example, consider a module foobar that uses DSM + LWLocks. It won't do at all for each backend, on first use of the foobar module, to do LWLockNewTrancheId() and then reuse that tranche_id repeatedly for each new DSM - because over a system lifetime of months, tranche IDs could grow into the millions, causing LWLockTrancheArray to get really big (and eventually repalloc will fail). Rather, the programmer needs to ensure that LWLockNewTrancheId() gets called *once per postmaster lifetime*, perhaps by allocating a chunk of permanent shared memory and using that to store the tranche_id that should be used each time an individual backend fires up a DSM. Considering that one of the goals of dynamic shared memory is to allow modules to be loaded after postmaster startup and still be able to do useful stuff, that's kind of obnoxious. I don't have a good idea what to do about it, though; making LWLockTrancheArray anything other than a flat array looks likely to slow down --enable-dtrace builds unacceptably. Bottom line, I guess, is that I don't currently have any real idea how to make this any better than it already is. Maybe that will become more clear as this facility (hopefully) acquires some users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2014-02-08 4:52 GMT+09:00 Robert Haas <robertmhaas@gmail.com>: > On Tue, Jan 21, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> One idea I just had is to improve the dsm_toc module so that it can >> optionally set up a tranche of lwlocks for you, and provide some >> analogues of RequestAddinLWLocks and LWLockAssign for that case. That >> would probably make this quite a bit simpler to use, at least for >> people using it with dynamic shared memory. But I think that's a >> separate patch. > > I played with this a bit today and it doesn't actually seem to > simplify things very much. The backend that creates the DSM needs to > do this: > > lwlocks = shm_toc_allocate(toc, sizeof(LWLockPadded) * nlwlocks); > for (i = 0; i < nlwlocks; ++i) > LWLockInitialize(&lwlocks[i].lock, tranche_id); > > Since that's all of three lines, encapsulating it doesn't look all > that helpful. Then each backend needs to do something like this: > > static LWLockTranche mytranche; > mytranche.name = "some descriptive module name"; > mytranche.array_base = lwlocks; > mytranche.array_stride = sizeof(LWLockPadded); > LWLockRegisterTranche(tranche_id, &mytranche); > > That's not a lot of code either, and there's no obvious way to reduce > it much by hooking into shm_toc. > > I thought maybe we needed some cleanup when the dynamic shared memory > segment is unmapped, but looks like we really don't. One can do > something like LWLockRegisterTranche(tranche_id, NULL) for debugging > clarity, but it isn't strictly needed; one can release all lwlocks > from the tranche or assert that none are held, but that should really > only be a problem if the user does something like > LWLockAcquire(lock_in_the_segment); dsm_detach(seg), because error > recovery starts by releasing *all* lwlocks. And if the user does it > explicitly, I'm kinda OK with that just seg faulting. After all, the > user could equally well have done dsm_detach(seg); > LWLockAcquire(lock_in_the_segment) and there's no way at all to > cross-check for that sort of mistake. > Does it make another problem if dsm_detach() also releases lwlocks being allocated on the dsm segment to be released? Lwlocks being held is tracked in the held_lwlocks[] array; its length is usually 100. In case when dsm_detach() is called towards the segment between addr ~ (addr + length), it seems to me an easy job to walk on this array to find out lwlocks on the range. > I do see one thing about the status quo that does look reasonably > annoying: the code expects that tranche IDs are going to stay > relatively small. For example, consider a module foobar that uses DSM > + LWLocks. It won't do at all for each backend, on first use of the > foobar module, to do LWLockNewTrancheId() and then reuse that > tranche_id repeatedly for each new DSM - because over a system > lifetime of months, tranche IDs could grow into the millions, causing > LWLockTrancheArray to get really big (and eventually repalloc will > fail). Rather, the programmer needs to ensure that > LWLockNewTrancheId() gets called *once per postmaster lifetime*, > perhaps by allocating a chunk of permanent shared memory and using > that to store the tranche_id that should be used each time an > individual backend fires up a DSM. Considering that one of the goals > of dynamic shared memory is to allow modules to be loaded after > postmaster startup and still be able to do useful stuff, that's kind > of obnoxious. I don't have a good idea what to do about it, though; > making LWLockTrancheArray anything other than a flat array looks > likely to slow down --enable-dtrace builds unacceptably. > Just my rough idea. Doesn't it make sense to have an offset from the head of DSM segment that contains the lwlock, instead of the identifier form, to indicate a cstring datum? It does not prevent to allocate it later; after the postmaster starting up, and here is no problem if number of dynamic lwlocks grows millions or more. If lwlock has a "tranche_offset", not "tranche_id", all it needs to do is: 1. find out either of DSM segment or regular SHM segment that contains the supplied lwlock. 2. Calculate mapped_address + tranche_offset; that is the local location where cstring form is put. Probably, it needs a common utility routine on dsm.c to find out a particular DSM segment that contains the supplied address. (Inverse function towards dsm_segment_address) How about my ideas? Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>
On Mon, Feb 10, 2014 at 7:17 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote: > Does it make another problem if dsm_detach() also releases lwlocks > being allocated on the dsm segment to be released? > Lwlocks being held is tracked in the held_lwlocks[] array; its length is > usually 100. In case when dsm_detach() is called towards the segment > between addr ~ (addr + length), it seems to me an easy job to walk on > this array to find out lwlocks on the range. Oh, sure, it could be done. I just don't see the point. If you're explicitly detaching a shared memory segment while still holding lwlocks within that segment, your code is seriously buggy. Making dsm_detach() silently release those locks would probably just be hiding a bug that you'd really rather find out about. > Just my rough idea. Doesn't it make sense to have an offset from the > head of DSM segment that contains the lwlock, instead of the identifier > form, to indicate a cstring datum? It does not prevent to allocate it > later; after the postmaster starting up, and here is no problem if number > of dynamic lwlocks grows millions or more. > If lwlock has a "tranche_offset", not "tranche_id", all it needs to do is: > 1. find out either of DSM segment or regular SHM segment that contains > the supplied lwlock. > 2. Calculate mapped_address + tranche_offset; that is the local location > where cstring form is put. Well, if that offset is 8 bytes, it's going to make the LWLock structure grow past 32 bytes on common platforms, which I do not want to do. If it's 4 bytes, sure, that'll work, but now you've gotta make sure that string gets into the 4GB of the relevant shared memory segment, which sounds like an annoying requirement. How would you satisfy it with respect to the main shared memory segment, for example? I mean, one way to handle this problem is to put a hash table in the main shared memory segment with tranche ID -> name mappings. Backends can suck mappings out of there and cache them locally as needed. But that's a lot of mechanism for a facility with no known users. > Probably, it needs a common utility routine on dsm.c to find out > a particular DSM segment that contains the supplied address. > (Inverse function towards dsm_segment_address) Yeah, I've thought of that. It may be useful enough to be worth doing, whether we use it for this or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company