Thread: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
Hi, During the lwlock scalability work I noticed a longstanding issue with the lwlock code. LWLockRelease() and the other mentioned locations do the following to wake up any waiters, without holding the lock's spinlock: /* * Awaken any waiters I removed from the queue. */ while (head != NULL) { LOG_LWDEBUG("LWLockRelease",T_NAME(l), T_ID(l), "release waiter"); proc = head; head = proc->lwWaitLink; proc->lwWaitLink = NULL; proc->lwWaiting = false; PGSemaphoreUnlock(&proc->sem); } which means they manipulate the lwWaitLink queue without protection. That's done intentionally. The code tries to protect against corruption of the list to do a woken up backend acquiring a lock (this or an independent one) by only continuing when the lwWaiting flag is set to false. Unfortunately there's absolutely no guarantee that a) the assignment to lwWaitLink and lwWaiting are done in that order b) that the stores are done in-order from the POV of other backends. So what we need to do is to acquire a write barrier between the assignments to lwWaitLink and lwWaiting, i.e. proc->lwWaitLink = NULL; pg_write_barrier(); proc->lwWaiting= false; the reader side already uses an implicit barrier by using spinlocks. I've fixed this as part 1 of the lwlock scalability work in [1], but Heikki rightfully suggested that a) this should be backpatched b) done in a separate commit. There is the question what to do about the branches without barriers? I guess a SpinLockAcquire()/Release() would do? Or do we decide it's not important enough to matter, since it's not an issue on x86? [1] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=2de11eb11bb3e3777f6d384de0af9c2f77960637 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > So what we need to do is to acquire a write barrier between the > assignments to lwWaitLink and lwWaiting, i.e. > proc->lwWaitLink = NULL; > pg_write_barrier(); > proc->lwWaiting = false; You didn't really explain why you think that ordering is necessary? Each proc being awoken will surely see both fields updated, and other procs won't be examining these fields at all, since we already delinked all these procs from the LWLock's queue. > There is the question what to do about the branches without barriers? I > guess a SpinLockAcquire()/Release() would do? Or do we decide it's not > important enough to matter, since it's not an issue on x86? Given the lack of trouble reports that could be traced to this, I don't feel a need to worry about it in branches that don't have any barrier support. But in any case, I'm not convinced there's a bug here at all. regards, tom lane
I wrote: > You didn't really explain why you think that ordering is necessary? Actually, after grepping to check my memory of what those fields are being used for, I have a bigger question: WTF is xlog.c doing being so friendly with the innards of LWLocks? Surely this needs to get refactored so that most of WakeupWaiters() and friends is in lwlock.c. Or has all notion of modularity gone out the window while I wasn't looking? regards, tom lane
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-10 11:11:28 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > So what we need to do is to acquire a write barrier between the > > assignments to lwWaitLink and lwWaiting, i.e. > > proc->lwWaitLink = NULL; > > pg_write_barrier(); > > proc->lwWaiting = false; > > You didn't really explain why you think that ordering is necessary? > Each proc being awoken will surely see both fields updated, and other > procs won't be examining these fields at all, since we already delinked > all these procs from the LWLock's queue. The problem is that one the released backends could wake up concurrently because of a unrelated, or previous PGSemaphoreUnlock(). It could see lwWaiting = false, and thus wakeup and acquire the lock, even if the store for lwWaitLink hasn't arrived (or performed, there's no guaranteed ordering here) yet. Now, it may well be that there's no practical consequence of that, but I am not prepared to bet on it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-10 11:20:30 -0500, Tom Lane wrote: > I wrote: > > You didn't really explain why you think that ordering is necessary? > > Actually, after grepping to check my memory of what those fields are > being used for, I have a bigger question: WTF is xlog.c doing being > so friendly with the innards of LWLocks? Surely this needs to get > refactored so that most of WakeupWaiters() and friends is in lwlock.c. > Or has all notion of modularity gone out the window while I wasn't > looking? Well, it's not actually using any lwlock.c code, it's a special case locking logic, just reusing the datastructures. That said, I am not particularly happy about the amount of code it's duplicating from lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of WALInsertSlotAcquireOne() is a copied. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/10/2014 06:41 PM, Andres Freund wrote: > On 2014-02-10 11:20:30 -0500, Tom Lane wrote: >> I wrote: >>> You didn't really explain why you think that ordering is necessary? >> >> Actually, after grepping to check my memory of what those fields are >> being used for, I have a bigger question: WTF is xlog.c doing being >> so friendly with the innards of LWLocks? Surely this needs to get >> refactored so that most of WakeupWaiters() and friends is in lwlock.c. >> Or has all notion of modularity gone out the window while I wasn't >> looking? > > Well, it's not actually using any lwlock.c code, it's a special case > locking logic, just reusing the datastructures. That said, I am not > particularly happy about the amount of code it's duplicating from > lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of > WALInsertSlotAcquireOne() is a copied. I'm not too happy with the amount of copy-paste myself, but there was enough difference to regular lwlocks that I didn't want to bother all lwlocks with the xlog-specific stuff either. The WAL insert slots do share the LWLock-related PGPROC fields though, and semaphore. I'm all ears if you have ideas on that.. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 02/10/2014 06:41 PM, Andres Freund wrote: >> Well, it's not actually using any lwlock.c code, it's a special case >> locking logic, just reusing the datastructures. That said, I am not >> particularly happy about the amount of code it's duplicating from >> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of >> WALInsertSlotAcquireOne() is a copied. > I'm not too happy with the amount of copy-paste myself, but there was > enough difference to regular lwlocks that I didn't want to bother all > lwlocks with the xlog-specific stuff either. The WAL insert slots do > share the LWLock-related PGPROC fields though, and semaphore. I'm all > ears if you have ideas on that.. I agree that if the behavior is considerably different, we don't really want to try to make LWLockAcquire/Release cater to both this and their standard behavior. But why not add some additional functions in lwlock.c that do what xlog wants? If we're going to have mostly-copy-pasted logic, it'd at least be better if it was in the same file, and not somewhere that's not even in the same major subtree. Also, the reason that LWLock isn't an abstract struct is because we wanted to be able to embed it in other structs; *not* as license for other modules to fool with its contents. If we were working in C++ we'd certainly have made all its fields private. regards, tom lane
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/10/2014 03:46 PM, Andres Freund wrote: > Hi, > > During the lwlock scalability work I noticed a longstanding issue with > the lwlock code. LWLockRelease() and the other mentioned locations do > the following to wake up any waiters, without holding the lock's > spinlock: > /* > * Awaken any waiters I removed from the queue. > */ > while (head != NULL) > { > LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); > proc = head; > head = proc->lwWaitLink; > proc->lwWaitLink = NULL; > proc->lwWaiting = false; > PGSemaphoreUnlock(&proc->sem); > } > > which means they manipulate the lwWaitLink queue without > protection. That's done intentionally. The code tries to protect against > corruption of the list to do a woken up backend acquiring a lock (this > or an independent one) by only continuing when the lwWaiting flag is set > to false. Unfortunately there's absolutely no guarantee that a) the > assignment to lwWaitLink and lwWaiting are done in that order b) that > the stores are done in-order from the POV of other backends. > So what we need to do is to acquire a write barrier between the > assignments to lwWaitLink and lwWaiting, i.e. > proc->lwWaitLink = NULL; > pg_write_barrier(); > proc->lwWaiting = false; > the reader side already uses an implicit barrier by using spinlocks. > > I've fixed this as part 1 of the lwlock scalability work in [1], but > Heikki rightfully suggested that a) this should be backpatched b) done > in a separate commit. > > There is the question what to do about the branches without barriers? I > guess a SpinLockAcquire()/Release() would do? The other alternative we discussed on IM is to unlink the waiters from the linked list, while still holding the spinlock. We can't do the PGSemaphoreUnlock() call to actually wake up the waiters while holding the spinlock, but all the shared memory manipulations we could. It would move all the modifications of the shared structures under the spinlock, which feels comforting. It would require using some-sort of a backend-private data structure to hold the list of processes to wake up. We don't want to palloc() in LWLockRelease(), but we could malloc() a large-enough array once at process initialization, and use that on all LWLockRelease() calls. - Heikki
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/10/2014 08:03 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 02/10/2014 06:41 PM, Andres Freund wrote: >>> Well, it's not actually using any lwlock.c code, it's a special case >>> locking logic, just reusing the datastructures. That said, I am not >>> particularly happy about the amount of code it's duplicating from >>> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of >>> WALInsertSlotAcquireOne() is a copied. > >> I'm not too happy with the amount of copy-paste myself, but there was >> enough difference to regular lwlocks that I didn't want to bother all >> lwlocks with the xlog-specific stuff either. The WAL insert slots do >> share the LWLock-related PGPROC fields though, and semaphore. I'm all >> ears if you have ideas on that.. > > I agree that if the behavior is considerably different, we don't really > want to try to make LWLockAcquire/Release cater to both this and their > standard behavior. But why not add some additional functions in lwlock.c > that do what xlog wants? If we're going to have mostly-copy-pasted logic, > it'd at least be better if it was in the same file, and not somewhere > that's not even in the same major subtree. Ok, I'll try to refactor it that way, so that we can see if it looks better. > Also, the reason that LWLock isn't an abstract struct is because we wanted > to be able to embed it in other structs; *not* as license for other > modules to fool with its contents. If we were working in C++ we'd > certainly have made all its fields private. Um, xlog.c is doing no such thing. The insertion slots use a struct of their own, which is also copy-pasted from LWLock (with one additional field). - Heikki
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-10 19:48:47 +0200, Heikki Linnakangas wrote: > On 02/10/2014 06:41 PM, Andres Freund wrote: > >On 2014-02-10 11:20:30 -0500, Tom Lane wrote: > >>I wrote: > >>>You didn't really explain why you think that ordering is necessary? > >> > >>Actually, after grepping to check my memory of what those fields are > >>being used for, I have a bigger question: WTF is xlog.c doing being > >>so friendly with the innards of LWLocks? Surely this needs to get > >>refactored so that most of WakeupWaiters() and friends is in lwlock.c. > >>Or has all notion of modularity gone out the window while I wasn't > >>looking? > > > >Well, it's not actually using any lwlock.c code, it's a special case > >locking logic, just reusing the datastructures. That said, I am not > >particularly happy about the amount of code it's duplicating from > >lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of > >WALInsertSlotAcquireOne() is a copied. > > I'm not too happy with the amount of copy-paste myself, but there was enough > difference to regular lwlocks that I didn't want to bother all lwlocks with > the xlog-specific stuff either. The WAL insert slots do share the > LWLock-related PGPROC fields though, and semaphore. I'm all ears if you have > ideas on that.. The lwlock scalability stuff has separated out the enqueue/wakeup code, that probably should work here as well? And that's a fair portion of the code. I think it should be doable to make that generic enough that the actual difference of the struct doesn't matter. It'd also reduce duplication of LWLockAcquire, ConditionalAcquire, OrWait. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Andres Freund" <andres@2ndquadrant.com> > which means they manipulate the lwWaitLink queue without > protection. That's done intentionally. The code tries to protect against > corruption of the list to do a woken up backend acquiring a lock (this > or an independent one) by only continuing when the lwWaiting flag is set > to false. Unfortunately there's absolutely no guarantee that a) the > assignment to lwWaitLink and lwWaiting are done in that order b) that > the stores are done in-order from the POV of other backends. > So what we need to do is to acquire a write barrier between the > assignments to lwWaitLink and lwWaiting, i.e. > proc->lwWaitLink = NULL; > pg_write_barrier(); > proc->lwWaiting = false; > the reader side already uses an implicit barrier by using spinlocks. I've got a report from one customer that they encountered a hang during performance benchmarking. They were using PostgreSQL 9.2.4. I remember that the stack trace showed many backends blocked forever at LWLockAcuuire() during btree insert operation. I'm not sure this has something to do with what you are raising, but the release notes for 9.2.5/6 doesn't suggest any fixes for this. So I felt there is something wrong with lwlocks. Do you think that your question could cause my customer's problem -- backends block at lwlock forever? Regards MauMau
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-11 21:46:04 +0900, MauMau wrote: > From: "Andres Freund" <andres@2ndquadrant.com> > >which means they manipulate the lwWaitLink queue without > >protection. That's done intentionally. The code tries to protect against > >corruption of the list to do a woken up backend acquiring a lock (this > >or an independent one) by only continuing when the lwWaiting flag is set > >to false. Unfortunately there's absolutely no guarantee that a) the > >assignment to lwWaitLink and lwWaiting are done in that order b) that > >the stores are done in-order from the POV of other backends. > >So what we need to do is to acquire a write barrier between the > >assignments to lwWaitLink and lwWaiting, i.e. > > proc->lwWaitLink = NULL; > > pg_write_barrier(); > > proc->lwWaiting = false; > >the reader side already uses an implicit barrier by using spinlocks. > > I've got a report from one customer that they encountered a hang during > performance benchmarking. They were using PostgreSQL 9.2.4. I remember > that the stack trace showed many backends blocked forever at LWLockAcuuire() > during btree insert operation. I'm not sure this has something to do with > what you are raising, but the release notes for 9.2.5/6 doesn't suggest any > fixes for this. So I felt there is something wrong with lwlocks. > > Do you think that your question could cause my customer's problem -- > backends block at lwlock forever? It's x86, right? Then it's unlikely to be actual unordered memory accesses, but if the compiler reordered: LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); proc = head; head = proc->lwWaitLink; proc->lwWaitLink = NULL; proc->lwWaiting = false; PGSemaphoreUnlock(&proc->sem); to LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); proc = head; proc->lwWaiting = false; head= proc->lwWaitLink; proc->lwWaitLink = NULL; PGSemaphoreUnlock(&proc->sem); which it is permitted to do, yes, that could cause symptoms like you describe. Any chance you have the binaries the customer ran back then around? Disassembling that piece of code might give you a hint whether that's a possible cause. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Andres Freund" <andres@2ndquadrant.com> > It's x86, right? Then it's unlikely to be actual unordered memory > accesses, but if the compiler reordered: > LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); > proc = head; > head = proc->lwWaitLink; > proc->lwWaitLink = NULL; > proc->lwWaiting = false; > PGSemaphoreUnlock(&proc->sem); > to > LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); > proc = head; > proc->lwWaiting = false; > head = proc->lwWaitLink; > proc->lwWaitLink = NULL; > PGSemaphoreUnlock(&proc->sem); > which it is permitted to do, yes, that could cause symptoms like you > describe. Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64. The PostgreSQL was built with GCC. > Any chance you have the binaries the customer ran back then around? > Disassembling that piece of code might give you a hint whether that's a > possible cause. I'm sorry I can't provide the module, but I attached the disassembled code code for lwlockRelease and LWLockAcquire in the executable. I'm not sure this proves something. FYI, the following stack traces are the ones obtained during two instances of hang. #0 0x00000036102eaf77 in semop () from /lib64/libc.so.6 #1 0x0000000000614707 in PGSemaphoreLock () #2 0x0000000000659d5b in LWLockAcquire () #3 0x000000000047983d in RelationGetBufferForTuple () #4 0x0000000000477f86 in heap_insert () #5 0x00000000005a4a12 in ExecModifyTable () #6 0x000000000058d928 in ExecProcNode () #7 0x000000000058c762 in standard_ExecutorRun () #8 0x00007f0cb37f99cb in pgss_ExecutorRun () from /opt/symfoserver64/lib/pg_stat_statements.so #9 0x00007f0cb357f545 in explain_ExecutorRun () from /opt/symfoserver64/lib/auto_explain.so #10 0x000000000066a59e in ProcessQuery () #11 0x000000000066a7ef in PortalRunMulti () #12 0x000000000066afd2 in PortalRun () #13 0x0000000000666fcb in exec_simple_query () #14 0x0000000000668058 in PostgresMain () #15 0x0000000000622ef1 in PostmasterMain () #16 0x00000000005c0723 in main () #0 0x00000036102eaf77 in semop () from /lib64/libc.so.6 #1 0x0000000000614707 in PGSemaphoreLock () #2 0x0000000000659d5b in LWLockAcquire () #3 0x000000000047983d in RelationGetBufferForTuple () #4 0x0000000000477f86 in heap_insert () #5 0x00000000005a4a12 in ExecModifyTable () #6 0x000000000058d928 in ExecProcNode () #7 0x000000000058c762 in standard_ExecutorRun () #8 0x00007f0cb37f99cb in pgss_ExecutorRun () from /opt/symfoserver64/lib/pg_stat_statements.so #9 0x00007f0cb357f545 in explain_ExecutorRun () from /opt/symfoserver64/lib/auto_explain.so #10 0x000000000066a59e in ProcessQuery () #11 0x000000000066a7ef in PortalRunMulti () #12 0x000000000066afd2 in PortalRun () #13 0x0000000000666fcb in exec_simple_query () #14 0x0000000000668058 in PostgresMain () #15 0x0000000000622ef1 in PostmasterMain () #16 0x00000000005c0723 in main () #0 0x00000036102eaf77 in semop () from /lib64/libc.so.6 #1 0x0000000000614707 in PGSemaphoreLock () #2 0x0000000000659d5b in LWLockAcquire () #3 0x000000000064bb8c in ProcArrayEndTransaction () #4 0x0000000000491216 in CommitTransaction () #5 0x00000000004925a5 in CommitTransactionCommand () #6 0x0000000000664cf7 in finish_xact_command () #7 0x0000000000667145 in exec_simple_query () #8 0x0000000000668058 in PostgresMain () #9 0x0000000000622ef1 in PostmasterMain () #10 0x00000000005c0723 in main () Regards MauMau
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-12 20:55:32 +0900, MauMau wrote: > Dump of assembler code for function LWLockRelease: could you try if you get more readable dumps by using disassemble/m? That might at least print line numbers if you have debug info installed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
From: "Andres Freund" <andres@2ndquadrant.com> > could you try if you get more readable dumps by using disassemble/m? > That might at least print line numbers if you have debug info installed. OK, I'll try that tomorrow. However, the debug info is not available, because they use PostgreSQL built by themselves, not the community RPM nor EnterpriseDB's installer. Regards MauMau
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb12, 2014, at 12:55 , MauMau <maumau307@gmail.com> wrote: > From: "Andres Freund" <andres@2ndquadrant.com> >> It's x86, right? Then it's unlikely to be actual unordered memory >> accesses, but if the compiler reordered: >> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); >> proc = head; >> head = proc->lwWaitLink; >> proc->lwWaitLink = NULL; >> proc->lwWaiting = false; >> PGSemaphoreUnlock(&proc->sem); >> to >> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); >> proc = head; >> proc->lwWaiting = false; >> head = proc->lwWaitLink; >> proc->lwWaitLink = NULL; >> PGSemaphoreUnlock(&proc->sem); >> which it is permitted to do, yes, that could cause symptoms like you >> describe. > > Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64. > The PostgreSQL was built with GCC. The relevant part of the disassembled binary you attached seems to be Dump of assembler code for function LWLockRelease: ... 0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi 0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx) 0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx) 0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock> I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity and lwWaiting a single-byte quantity, it's pretty much certain that the first store updates lwWaitLink and the second lwWaiting. Thus, no reordering seems to have taken place here... best regards, Florian Pflug
On 02/12/2014 05:42 PM, Florian Pflug wrote: > On Feb12, 2014, at 12:55 , MauMau <maumau307@gmail.com> wrote: >> From: "Andres Freund" <andres@2ndquadrant.com> >>> It's x86, right? Then it's unlikely to be actual unordered memory >>> accesses, but if the compiler reordered: >>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); >>> proc = head; >>> head = proc->lwWaitLink; >>> proc->lwWaitLink = NULL; >>> proc->lwWaiting = false; >>> PGSemaphoreUnlock(&proc->sem); >>> to >>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter"); >>> proc = head; >>> proc->lwWaiting = false; >>> head = proc->lwWaitLink; >>> proc->lwWaitLink = NULL; >>> PGSemaphoreUnlock(&proc->sem); >>> which it is permitted to do, yes, that could cause symptoms like you >>> describe. >> >> Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64. >> The PostgreSQL was built with GCC. > > The relevant part of the disassembled binary you attached seems to be > > Dump of assembler code for function LWLockRelease: > ... > 0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi > 0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx) > 0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx) > 0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock> > > I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity > and lwWaiting a single-byte quantity, it's pretty much certain that the > first store updates lwWaitLink and the second lwWaiting. Thus, no reordering > seems to have taken place here... > > best regards, > Florian Pflug > > > Even if reordering was not done by compiler, it still can happen while execution. There is no warranty that two subsequent assignments will be observed by all CPU cores in the same order. So if one thread is performing p->x = 1; p->y = 2; p->x = 3; p->y = 4; then other thread can see the following combinations of (x,y): (1,2) (1,4) (3,2) (3,4) It is necessary to explicitly insert write barrier to prevent such non-deterministic behaviour.
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Ants Aasma
Date:
On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik@garret.ru> wrote: > Even if reordering was not done by compiler, it still can happen while > execution. > There is no warranty that two subsequent assignments will be observed by all > CPU cores in the same order. The x86 memory model (total store order) provides that guarantee in this specific case. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
eFrom: "Andres Freund" <andres@2ndquadrant.com> > could you try if you get more readable dumps by using disassemble/m? > That might at least print line numbers if you have debug info installed. Please find the attached file. I hope this will reveal something. Regards MauMau
Attachment
On 02/12/2014 06:10 PM, Ants Aasma wrote: > On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik@garret.ru> wrote: >> Even if reordering was not done by compiler, it still can happen while >> execution. >> There is no warranty that two subsequent assignments will be observed by all >> CPU cores in the same order. > > The x86 memory model (total store order) provides that guarantee in > this specific case. > > Regards, > Ants Aasma > Sorry, I thought that we are talking about general case, not just x86 architecture. May be I do not understand something in LWlock code, but it seems to me that assigning NULL to proc->lwWaitLink is not neededat all: while (head != NULL){ LOG_LWDEBUG("LWLockRelease", lockid, "release waiter"); proc = head; head = proc->lwWaitLink;>>> proc->lwWaitLink = NULL; proc->lwWaiting = false; PGSemaphoreUnlock(&proc->sem);} This part of L1 list is not traversed by any other processor. So nobody will inspect this field. When awakened process needsto wait for another lock, it will just assign NULL to this field itself: proc->lwWaiting = 1; proc->lwWaitMode = mode;>>> proc->lwWaitLink = NULL; if (lock->head == NULL) lock->head = proc; else lock->tail->lwWaitLink = proc; lock->tail = proc; Without TSO (total store order), such assignment of lwWaitLink in LWLockRlease outside critical section may just corruptL1 list, in which awakened process is already linked. But I am not sure that elimination of this assignment will be enough to ensure correctness of this code without TSO.
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-10 11:11:28 -0500, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> So what we need to do is to acquire a write barrier between the >>> assignments to lwWaitLink and lwWaiting, i.e. >>> proc->lwWaitLink = NULL; >>> pg_write_barrier(); >>> proc->lwWaiting = false; >> >> You didn't really explain why you think that ordering is necessary? >> Each proc being awoken will surely see both fields updated, and other >> procs won't be examining these fields at all, since we already delinked >> all these procs from the LWLock's queue. > > The problem is that one the released backends could wake up concurrently > because of a unrelated, or previous PGSemaphoreUnlock(). It could see > lwWaiting = false, and thus wakeup and acquire the lock, even if the > store for lwWaitLink hasn't arrived (or performed, there's no guaranteed > ordering here) yet. > Now, it may well be that there's no practical consequence of that, but I > am not prepared to bet on it. AFAICS there is a potential problem if three backends are involved, since by the time the waiting backend's lwWaitLink is set to NULL after the original lock holder released the lock, the waiting backend might already have acquired the lock (due to a spurious wakeup) *and* a third backend might have already enqueued behind it. The exact sequence for backends A,B,C that corrupts the wait queue is: A: Release lock, set B's lwWaiting to false B: Wakes up spuriously, takes the lock C: Enqueues behind B A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and causing C and anyone behind it to block indefinitely. I wonder whether LWLockRelease really needs to update lwWaitLink at all. We take the backends we awake off the queue by updating the queue's head and tail, so the contents of lwWaitLink should only matter once the backend is re-inserted into some wait queue. But when doing that, we reset lwWaitLink back to NULL anway. best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-13 15:34:09 +0100, Florian Pflug wrote: > On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-02-10 11:11:28 -0500, Tom Lane wrote: > >> Andres Freund <andres@2ndquadrant.com> writes: > >>> So what we need to do is to acquire a write barrier between the > >>> assignments to lwWaitLink and lwWaiting, i.e. > >>> proc->lwWaitLink = NULL; > >>> pg_write_barrier(); > >>> proc->lwWaiting = false; > >> > >> You didn't really explain why you think that ordering is necessary? > >> Each proc being awoken will surely see both fields updated, and other > >> procs won't be examining these fields at all, since we already delinked > >> all these procs from the LWLock's queue. > > > > The problem is that one the released backends could wake up concurrently > > because of a unrelated, or previous PGSemaphoreUnlock(). It could see > > lwWaiting = false, and thus wakeup and acquire the lock, even if the > > store for lwWaitLink hasn't arrived (or performed, there's no guaranteed > > ordering here) yet. > > Now, it may well be that there's no practical consequence of that, but I > > am not prepared to bet on it. > > AFAICS there is a potential problem if three backends are involved, since > by the time the waiting backend's lwWaitLink is set to NULL after the > original lock holder released the lock, the waiting backend might already > have acquired the lock (due to a spurious wakeup) *and* a third backend > might have already enqueued behind it. > > The exact sequence for backends A,B,C that corrupts the wait queue is: > > A: Release lock, set B's lwWaiting to false > B: Wakes up spuriously, takes the lock > C: Enqueues behind B > A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and > causing C and anyone behind it to block indefinitely. I don't think that can actually happen because the head of the wait list isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same for a while... So, right now I don't see problems without either the compiler reordering stores or heavily out of order machines with speculative execution. > I wonder whether LWLockRelease really needs to update lwWaitLink at all. > We take the backends we awake off the queue by updating the queue's head and > tail, so the contents of lwWaitLink should only matter once the backend is > re-inserted into some wait queue. But when doing that, we reset lwWaitLink > back to NULL anway. I don't like that, this stuff is hard to debug already, having stale pointers around doesn't help. I think we should just add the barrier in the releases with barrier.h and locally use a volatile var in the branches before that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 11:45 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-13 15:34:09 +0100, Florian Pflug wrote: >> On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2014-02-10 11:11:28 -0500, Tom Lane wrote: >>>> Andres Freund <andres@2ndquadrant.com> writes: >>>>> So what we need to do is to acquire a write barrier between the >>>>> assignments to lwWaitLink and lwWaiting, i.e. >>>>> proc->lwWaitLink = NULL; >>>>> pg_write_barrier(); >>>>> proc->lwWaiting = false; >>>> >>>> You didn't really explain why you think that ordering is necessary? >>>> Each proc being awoken will surely see both fields updated, and other >>>> procs won't be examining these fields at all, since we already delinked >>>> all these procs from the LWLock's queue. >>> >>> The problem is that one the released backends could wake up concurrently >>> because of a unrelated, or previous PGSemaphoreUnlock(). It could see >>> lwWaiting = false, and thus wakeup and acquire the lock, even if the >>> store for lwWaitLink hasn't arrived (or performed, there's no guaranteed >>> ordering here) yet. >>> Now, it may well be that there's no practical consequence of that, but I >>> am not prepared to bet on it. >> >> AFAICS there is a potential problem if three backends are involved, since >> by the time the waiting backend's lwWaitLink is set to NULL after the >> original lock holder released the lock, the waiting backend might already >> have acquired the lock (due to a spurious wakeup) *and* a third backend >> might have already enqueued behind it. >> >> The exact sequence for backends A,B,C that corrupts the wait queue is: >> >> A: Release lock, set B's lwWaiting to false >> B: Wakes up spuriously, takes the lock >> C: Enqueues behind B >> A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and >> causing C and anyone behind it to block indefinitely. > > I don't think that can actually happen because the head of the wait list > isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same > for a while... Hm, true, but does that protect us under all circumstances? If another backend grabs the lock before B gets a chance to do so, B will become the wait queue's head, and anyone who enqueues behind B will do so by updating B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink by the original lock holder. The specific sequence I have in mind is: A: Take lock B: Enqueue A: Release lock, set B's lwWaiting to false D: Acquire lock B: Enqueue after spurious wakeup (lock->head := B) C: Enqueue behind B (B->lwWaitLink := C, lock->tail := C) A: Set B's wWaitLink back to NULL, thereby corrupting the queue (B->lwWaitLink := NULL) D: Release lock, dequeue and wakeup B (lock->head := B->lwWaitLink, i.e. lock->head := NULL) B: Take and release lock, queue appears empty! C blocks indefinitely. >> I wonder whether LWLockRelease really needs to update lwWaitLink at all. >> We take the backends we awake off the queue by updating the queue's head and >> tail, so the contents of lwWaitLink should only matter once the backend is >> re-inserted into some wait queue. But when doing that, we reset lwWaitLink >> back to NULL anway. > > I don't like that, this stuff is hard to debug already, having stale > pointers around doesn't help. I think we should just add the barrier in > the releases with barrier.h and locally use a volatile var in the > branches before that. I like the idea of updating lwWaiting and lwWaitLink while still holding the spinlock better. The costs are probably negligible, and it'd make the code much easier to reason about. best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 13:28:47 +0100, Florian Pflug wrote: > > I don't think that can actually happen because the head of the wait list > > isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same > > for a while... > > Hm, true, but does that protect us under all circumstances? If another > backend grabs the lock before B gets a chance to do so, B will become the > wait queue's head, and anyone who enqueues behind B will do so by updating > B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink > by the original lock holder. > > The specific sequence I have in mind is: > > A: Take lock > B: Enqueue > A: Release lock, set B's lwWaiting to false > D: Acquire lock > B: Enqueue after spurious wakeup > (lock->head := B) > C: Enqueue behind B > (B->lwWaitLink := C, lock->tail := C) > A: Set B's wWaitLink back to NULL, thereby corrupting the queue > (B->lwWaitLink := NULL) > D: Release lock, dequeue and wakeup B > (lock->head := B->lwWaitLink, i.e. lock->head := NULL) > B: Take and release lock, queue appears empty! > C blocks indefinitely. That's assuming either reordering by the compiler or an out-of-order store architecture, right? > >> I wonder whether LWLockRelease really needs to update lwWaitLink at all. > >> We take the backends we awake off the queue by updating the queue's head and > >> tail, so the contents of lwWaitLink should only matter once the backend is > >> re-inserted into some wait queue. But when doing that, we reset lwWaitLink > >> back to NULL anway. > > > > I don't like that, this stuff is hard to debug already, having stale > > pointers around doesn't help. I think we should just add the barrier in > > the releases with barrier.h and locally use a volatile var in the > > branches before that. > > I like the idea of updating lwWaiting and lwWaitLink while still holding the > spinlock better. The costs are probably negligible, and it'd make the code much > easier to reason about. I agree we should do that, but imo not in the backbranches. Anything more than than the minimal fix in that code should be avoided in the stable branches, this stuff is friggin performance sensitive, and the spinlock already is a *major* contention point in many workloads. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 13:36 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-14 13:28:47 +0100, Florian Pflug wrote: >>> I don't think that can actually happen because the head of the wait list >>> isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same >>> for a while... >> >> Hm, true, but does that protect us under all circumstances? If another >> backend grabs the lock before B gets a chance to do so, B will become the >> wait queue's head, and anyone who enqueues behind B will do so by updating >> B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink >> by the original lock holder. >> >> The specific sequence I have in mind is: >> >> A: Take lock >> B: Enqueue >> A: Release lock, set B's lwWaiting to false >> D: Acquire lock >> B: Enqueue after spurious wakeup >> (lock->head := B) >> C: Enqueue behind B >> (B->lwWaitLink := C, lock->tail := C) >> A: Set B's wWaitLink back to NULL, thereby corrupting the queue >> (B->lwWaitLink := NULL) >> D: Release lock, dequeue and wakeup B >> (lock->head := B->lwWaitLink, i.e. lock->head := NULL) >> B: Take and release lock, queue appears empty! >> C blocks indefinitely. > > That's assuming either reordering by the compiler or an out-of-order > store architecture, right? Yes, it requires that a backend exits out of the PGSemaphoreLock loop in LWLockAcquire before lwWaitLink has been reset to NULL by the previous lock holder's LWLockRelease. Only if that happens there is a risk of the PGPROC being on a wait queue by the time lwWaitLink is finally reset to NULL. >>>> I wonder whether LWLockRelease really needs to update lwWaitLink at all. >>>> We take the backends we awake off the queue by updating the queue's head and >>>> tail, so the contents of lwWaitLink should only matter once the backend is >>>> re-inserted into some wait queue. But when doing that, we reset lwWaitLink >>>> back to NULL anway. >>> >>> I don't like that, this stuff is hard to debug already, having stale >>> pointers around doesn't help. I think we should just add the barrier in >>> the releases with barrier.h and locally use a volatile var in the >>> branches before that. >> >> I like the idea of updating lwWaiting and lwWaitLink while still holding the >> spinlock better. The costs are probably negligible, and it'd make the code much >> easier to reason about. > > I agree we should do that, but imo not in the backbranches. Anything > more than than the minimal fix in that code should be avoided in the > stable branches, this stuff is friggin performance sensitive, and the > spinlock already is a *major* contention point in many workloads. No argument there. But unless I missed something, there actually is a bug there, and using volatile won't fix it. A barrier would, but what about the back branches that don't have barrier.h? AFAICS the only other choices we have on these branches are to either ignore the bug - it's probably *extremely* unlikely - or to use a spinlock acquire/release cycle to create a barrier. The former leaves me with a bit of an uneasy feeling, and the latter quite certainly has a larger performance impact than moving the PGPROC updates within the section protected by the spinlock and using an array to remember the backends to wakeup. best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 13:52:45 +0100, Florian Pflug wrote: > > I agree we should do that, but imo not in the backbranches. Anything > > more than than the minimal fix in that code should be avoided in the > > stable branches, this stuff is friggin performance sensitive, and the > > spinlock already is a *major* contention point in many workloads. > > No argument there. But unless I missed something, there actually is a bug > there, and using volatile won't fix it. A barrier would, but what about the > back branches that don't have barrier.h? Yea, but I don't see a better alternative. Realistically the likelihood of a problem without the compiler reordering stuff is miniscule on any relevant platform that's supported by the !barrier.h branches. Newer ARMs are the only realistic suspect, and the support for in older releases wasn't so good... > The former > leaves me with a bit of an uneasy feeling, and the latter quite certainly has > a larger performance impact than moving the PGPROC updates within the section > protected by the spinlock and using an array to remember the backends to wakeup. I am not so sure, it adds a host of new cacheline references in a piece of code that's already heavily affected by pipeline stalls, that can influence performance. I am not saying it's super likely, just more than I want to do for a theoretical problem in the back branches. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-14 13:52:45 +0100, Florian Pflug wrote: >>> I agree we should do that, but imo not in the backbranches. Anything >>> more than than the minimal fix in that code should be avoided in the >>> stable branches, this stuff is friggin performance sensitive, and the >>> spinlock already is a *major* contention point in many workloads. >> >> No argument there. But unless I missed something, there actually is a bug >> there, and using volatile won't fix it. A barrier would, but what about the >> back branches that don't have barrier.h? > > Yea, but I don't see a better alternative. Realistically the likelihood > of a problem without the compiler reordering stuff is miniscule on any > relevant platform that's supported by the !barrier.h branches. Newer > ARMs are the only realistic suspect, and the support for in older > releases wasn't so good... Isn't POWER/PowerPC potentially affected by this? Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease. I can understand why you oppose that - it's certainly nicer to not have stray pointer lying around. But since (as least as far as we know) a) resetting lwWaitLink is not strictly necessary b) resetting lwWaitLink introduces a race condition (however unlikely) we'll trade correctness for cleanliness if we continue to reset lwWaitLink without protecting against the race. That's a bit of a weird trade-off to make. Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true. We'd then depend on pointer-sized stores being atomic, which I think we depend on in other places already. best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> writes: > Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one > field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's > tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we > enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true. > We'd then depend on pointer-sized stores being atomic, which I think we depend > on in other places already. I don't believe that's true; neither that we depend on it now, nor that it would be safe to do so. regards, tom lane
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 10:26:07 -0500, Tom Lane wrote: > Florian Pflug <fgp@phlo.org> writes: > > Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one > > field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's > > tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we > > enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true. > > > We'd then depend on pointer-sized stores being atomic, which I think we depend > > on in other places already. > > I don't believe that's true; neither that we depend on it now, nor that > it would be safe to do so. Yea, we currently rely on 4 byte stores being atomic (most notably for xids), but we don't rely on anything bigger. I am not sure if there are architectures with 64bit pointers that aren't accessed atomically when aligned? Even if, that's certainly nothing that should be introduced when backpatching. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 16:32 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-14 10:26:07 -0500, Tom Lane wrote: >> Florian Pflug <fgp@phlo.org> writes: >>> Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one >>> field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's >>> tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we >>> enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true. >> >>> We'd then depend on pointer-sized stores being atomic, which I think we depend >>> on in other places already. >> >> I don't believe that's true; neither that we depend on it now, nor that >> it would be safe to do so. > > Yea, we currently rely on 4 byte stores being atomic (most notably for > xids), but we don't rely on anything bigger. I am not sure if there are > architectures with 64bit pointers that aren't accessed atomically when > aligned? Even if, that's certainly nothing that should be introduced > when backpatching. Hm, we could use 4-byte stores instead of 8-byte stores if we turned lwWaitLink into an index into the proc array instead of a pointer to the PGPROC struct. We could then use 0xffffffff instead place of NULL to indicate "not waiting", and PROCARRAY_MAXPROCS to indicate "waiting, but no successor in the queue". best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 15:03:16 +0100, Florian Pflug wrote: > On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-02-14 13:52:45 +0100, Florian Pflug wrote: > >>> I agree we should do that, but imo not in the backbranches. Anything > >>> more than than the minimal fix in that code should be avoided in the > >>> stable branches, this stuff is friggin performance sensitive, and the > >>> spinlock already is a *major* contention point in many workloads. > >> > >> No argument there. But unless I missed something, there actually is a bug > >> there, and using volatile won't fix it. A barrier would, but what about the > >> back branches that don't have barrier.h? > > > > Yea, but I don't see a better alternative. Realistically the likelihood > > of a problem without the compiler reordering stuff is miniscule on any > > relevant platform that's supported by the !barrier.h branches. Newer > > ARMs are the only realistic suspect, and the support for in older > > releases wasn't so good... > > Isn't POWER/PowerPC potentially affected by this? I wouldn't consider it a major architecture... And I am not sure how much out of order a CPU has to be to be affected by this, the read side uses spinlocks, which in most of the spinlock implementations implies a full memory barrier which in many cache coherency designs will cause other CPUs to flush writes. And I think the control dependency in the PGSemaphoreUnlock() loop will actually cause a flush on ppc... > Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease. > I can understand why you oppose that - it's certainly nicer to not have stray > pointer lying around. But since (as least as far as we know) > > a) resetting lwWaitLink is not strictly necessary I don't want to rely on that in the backbranches, that's an entirely new assumption. I think anything more than minimal changes are hard to justify for a theoretical issue that hasn't been observed in the field. I think the biggest danger here is that writes are reordered by the compiler, that we definitely need to protect against. Making a variable volatile or introducing a memory barrier is pretty simple to analyze. > b) resetting lwWaitLink introduces a race condition (however unlikely) > > we'll trade correctness for cleanliness if we continue to reset lwWaitLink > without protecting against the race. That's a bit of a weird trade-off to make. It's not just cleanliness, it's being able to actually debug crashes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/14/2014 07:51 PM, Andres Freund wrote: > On 2014-02-14 15:03:16 +0100, Florian Pflug wrote: >> On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2014-02-14 13:52:45 +0100, Florian Pflug wrote: >>>>> I agree we should do that, but imo not in the backbranches. Anything >>>>> more than than the minimal fix in that code should be avoided in the >>>>> stable branches, this stuff is friggin performance sensitive, and the >>>>> spinlock already is a *major* contention point in many workloads. >>>> >>>> No argument there. But unless I missed something, there actually is a bug >>>> there, and using volatile won't fix it. A barrier would, but what about the >>>> back branches that don't have barrier.h? >>> >>> Yea, but I don't see a better alternative. Realistically the likelihood >>> of a problem without the compiler reordering stuff is miniscule on any >>> relevant platform that's supported by the !barrier.h branches. Newer >>> ARMs are the only realistic suspect, and the support for in older >>> releases wasn't so good... >> >> Isn't POWER/PowerPC potentially affected by this? > > I wouldn't consider it a major architecture... And I am not sure how > much out of order a CPU has to be to be affected by this, the read side > uses spinlocks, which in most of the spinlock implementations implies a > full memory barrier which in many cache coherency designs will cause > other CPUs to flush writes. And I think the control dependency in the > PGSemaphoreUnlock() loop will actually cause a flush on ppc... PGSemaphoreUnlock() should really introduce memory barrier, but the problem can happen before PGSemaphoreUnlock() is called. So presence of PGSemaphoreUnlock() just reduces probability of race condition on non-TSO architectures (PPC, ARM, IA64,...)but doesn't completely eliminate it. > >> Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease. >> I can understand why you oppose that - it's certainly nicer to not have stray >> pointer lying around. But since (as least as far as we know) >> >> a) resetting lwWaitLink is not strictly necessary > > I don't want to rely on that in the backbranches, that's an entirely new > assumption. I think anything more than minimal changes are hard to > justify for a theoretical issue that hasn't been observed in the field. > > I think the biggest danger here is that writes are reordered by the > compiler, that we definitely need to protect against. Making a variable > volatile or introducing a memory barrier is pretty simple to analyze. > >> b) resetting lwWaitLink introduces a race condition (however unlikely) >> >> we'll trade correctness for cleanliness if we continue to reset lwWaitLink >> without protecting against the race. That's a bit of a weird trade-off to make. > > It's not just cleanliness, it's being able to actually debug crashes. Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea. Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULLbut process in waiting for a lock and linked in the list (if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information. > > Greetings, > > Andres Freund >
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 20:23:32 +0400, knizhnik wrote: > >>we'll trade correctness for cleanliness if we continue to reset lwWaitLink > >>without protecting against the race. That's a bit of a weird trade-off to make. > > > >It's not just cleanliness, it's being able to actually debug crashes. > > > Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea. > Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULLbut process in waiting for a lock and linked in the list > (if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information. At the moment if you connect to a live pg or a crash dump, investigating the wait links allows you to somewhat sensibly determine which backends are waiting for a lock and whether they are part of a queue. If they aren't reset anymore that will tell you nothing, so you'll need to connect to all pg instances to debug issues. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/14/2014 08:28 PM, Andres Freund wrote: > On 2014-02-14 20:23:32 +0400, knizhnik wrote: >>>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink >>>> without protecting against the race. That's a bit of a weird trade-off to make. >>> >>> It's not just cleanliness, it's being able to actually debug crashes. >> >> >> Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea. >> Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULLbut process in waiting for a lock and linked in the list >> (if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information. > > At the moment if you connect to a live pg or a crash dump, investigating > the wait links allows you to somewhat sensibly determine which backends > are waiting for a lock and whether they are part of a queue. If they > aren't reset anymore that will tell you nothing, so you'll need to > connect to all pg instances to debug issues. Why it is not enough to inspect lwWaiting flag? In any case, resetting lwWaitLink can be safely done in awakened process: if (!proc->lwWaiting) {>>> proc->lwWaitLink = NULL; break; } > > Greetings, > > Andres Freund >
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 16:51 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-14 15:03:16 +0100, Florian Pflug wrote: >> On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2014-02-14 13:52:45 +0100, Florian Pflug wrote: >>>>> I agree we should do that, but imo not in the backbranches. Anything >>>>> more than than the minimal fix in that code should be avoided in the >>>>> stable branches, this stuff is friggin performance sensitive, and the >>>>> spinlock already is a *major* contention point in many workloads. >>>> >>>> No argument there. But unless I missed something, there actually is a bug >>>> there, and using volatile won't fix it. A barrier would, but what about the >>>> back branches that don't have barrier.h? >>> >>> Yea, but I don't see a better alternative. Realistically the likelihood >>> of a problem without the compiler reordering stuff is miniscule on any >>> relevant platform that's supported by the !barrier.h branches. Newer >>> ARMs are the only realistic suspect, and the support for in older >>> releases wasn't so good... >> >> Isn't POWER/PowerPC potentially affected by this? > > I wouldn't consider it a major architecture... And I am not sure how > much out of order a CPU has to be to be affected by this, the read side > uses spinlocks, which in most of the spinlock implementations implies a > full memory barrier which in many cache coherency designs will cause > other CPUs to flush writes. And I think the control dependency in the > PGSemaphoreUnlock() loop will actually cause a flush on ppc... I guess it's sufficient for the store to lwWaitLink to be delayed. As long as the CPU on which that store is executing hasn't managed to gain exclusive access to the relevant cache line, memory barriers on the read side won't help because the store won't be visible to other CPUs. >> Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease. >> I can understand why you oppose that - it's certainly nicer to not have stray >> pointer lying around. But since (as least as far as we know) >> >> a) resetting lwWaitLink is not strictly necessary > > I don't want to rely on that in the backbranches, that's an entirely new > assumption. I think anything more than minimal changes are hard to > justify for a theoretical issue that hasn't been observed in the field. > > I think the biggest danger here is that writes are reordered by the > compiler, that we definitely need to protect against. Making a variable > volatile or introducing a memory barrier is pretty simple to analyze. Well, the assumption isn't all that new. We already have the situation that a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL. Currently, the process who took it off the queue is responsible for rectifying that eventually, with the changed proposed below it'll be the backend who owns the PGPROC. From the point of view of other backends, nothing really changes. > >> b) resetting lwWaitLink introduces a race condition (however unlikely) >> >> we'll trade correctness for cleanliness if we continue to reset lwWaitLink >> without protecting against the race. That's a bit of a weird trade-off to make. > > It's not just cleanliness, it's being able to actually debug crashes. We could still arrange for the stray lwWaitLink from being visible only momentarily. If a backend is woken up and detects that lwWaiting is false, it knows that it cannot be on any wait queue - it was just removed, and hasn't added itself again yet. At that point, it's safe to reset lwWaitLink back to NULL. The stray value would thus only be visible while a backend executes the PGSemaphoreLock() loop, and whether or not this is the case can be deduced from a stack trace. So I'm not convinced that this makes debugging any harder. (knizhnik has just suggested the same) It's interesting, BTW, that updating lwWaitLink after lwWaiting is OK here - the crucial property is not that it's updated before lwWaiting, but rather that it is reset before enqueuing the backend again. Which is something that backend itself can guarantee far more easily than whoever happens to de-queue it due to spurious wakeups. best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-14 18:49:33 +0100, Florian Pflug wrote: > > I wouldn't consider it a major architecture... And I am not sure how > > much out of order a CPU has to be to be affected by this, the read side > > uses spinlocks, which in most of the spinlock implementations implies a > > full memory barrier which in many cache coherency designs will cause > > other CPUs to flush writes. And I think the control dependency in the > > PGSemaphoreUnlock() loop will actually cause a flush on ppc... > > I guess it's sufficient for the store to lwWaitLink to be delayed. > As long as the CPU on which that store is executing hasn't managed to gain > exclusive access to the relevant cache line, memory barriers on the read > side won't help because the store won't be visible to other CPUs. The whole lwlock actually should be on the same cacheline on anything with cachelines >= 32. As the woken up side is doing an atomic op on it *before* modifying lwWaitLink I think we are actually guaranteed that any incomplete store on the writer will have completed unless the compiler reordered. So we are safe against out of order stores, but not out of order execution. That might have been what prevented the issue from being noticable on some platforms. > Well, the assumption isn't all that new. We already have the situation that > a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL. > Currently, the process who took it off the queue is responsible for rectifying > that eventually, with the changed proposed below it'll be the backend who > owns the PGPROC. From the point of view of other backends, nothing really > changes. That window is absolutely tiny tho. > >> b) resetting lwWaitLink introduces a race condition (however unlikely) > >> > >> we'll trade correctness for cleanliness if we continue to reset lwWaitLink > >> without protecting against the race. That's a bit of a weird trade-off to make. > > > > It's not just cleanliness, it's being able to actually debug crashes. > > We could still arrange for the stray lwWaitLink from being visible only > momentarily. If a backend is woken up and detects that lwWaiting is false, > it knows that it cannot be on any wait queue - it was just removed, and > hasn't added itself again yet. At that point, it's safe to reset lwWaitLink > back to NULL. The stray value would thus only be visible while a backend executes > the PGSemaphoreLock() loop, and whether or not this is the case can be deduced > from a stack trace. So I'm not convinced that this makes debugging any harder. That's not actually safe on an out of order architecture afaics. Without barriers the store to lwWaitLink in the woken up backend can preempt the read for the next element in the PGSemaphoreUnlock() loop. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Florian Pflug
Date:
On Feb14, 2014, at 19:21 , Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-14 18:49:33 +0100, Florian Pflug wrote: >> Well, the assumption isn't all that new. We already have the situation that >> a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL. >> Currently, the process who took it off the queue is responsible for rectifying >> that eventually, with the changed proposed below it'll be the backend who >> owns the PGPROC. From the point of view of other backends, nothing really >> changes. > > That window is absolutely tiny tho. True, but it's there, so if anything counts on that never being the case, it's still already broken. >>>> b) resetting lwWaitLink introduces a race condition (however unlikely) >>>> >>>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink >>>> without protecting against the race. That's a bit of a weird trade-off to make. >>> >>> It's not just cleanliness, it's being able to actually debug crashes. >> >> We could still arrange for the stray lwWaitLink from being visible only >> momentarily. If a backend is woken up and detects that lwWaiting is false, >> it knows that it cannot be on any wait queue - it was just removed, and >> hasn't added itself again yet. At that point, it's safe to reset lwWaitLink >> back to NULL. The stray value would thus only be visible while a backend executes >> the PGSemaphoreLock() loop, and whether or not this is the case can be deduced >> from a stack trace. So I'm not convinced that this makes debugging any harder. > > That's not actually safe on an out of order architecture afaics. Without > barriers the store to lwWaitLink in the woken up backend can preempt the > read for the next element in the PGSemaphoreUnlock() loop. Hm, true. I guess we could prevent that by introducing an artificial dependency between the read and the write - something like proc = head; head = proc->lwWaitLink /* * We don't ever expect to actually PANIC here, but the test forces the * loadto execute before the store to lwWaiting. This prevents a race * between reading lwWaitLink here and resetting it backto zero in the * awoken backend (Note that backends can wake up spuriously, so just * reading it before doing PGSemaphoreUnlockis insufficient) */ if (head != MyProc) proc->lwWaiting = false; else elog(PANIC, ...) PGSemaphoreUnlock(&proc->sem); (This assumes that proc is a volatile pointer) Another idea would be to do as you suggest and only mark the PGPROC pointers volatile, but to additionally add a check for queue corruption somewhere. We should be able to detect that - if we ever hit this issue, LWLockRelease should find a PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to lock->tail. If that ever happens, we'd simply PANIC. best regards, Florian Pflug
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-15 04:20:17 +0100, Florian Pflug wrote: > Another idea would be to do as you suggest and only mark the PGPROC pointers > volatile, but to additionally add a check for queue corruption somewhere. We should > be able to detect that - if we ever hit this issue, LWLockRelease should find a > PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to > lock->tail. If that ever happens, we'd simply PANIC. My current conclusion is that backporting barriers.h is by far the most reasonable way to go. The compiler problems have been ironed out by now... Arguments against? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > My current conclusion is that backporting barriers.h is by far the most > reasonable way to go. The compiler problems have been ironed out by > now... -1. IMO that code is still quite unproven, and what's more, the problem we're discussing here is completely hypothetical. If it were real, we'd have field evidence of it. We've not had that much trouble seeing instances of even very narrow race-condition windows in the past. regards, tom lane
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-15 10:06:41 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > My current conclusion is that backporting barriers.h is by far the most > > reasonable way to go. The compiler problems have been ironed out by > > now... > > -1. IMO that code is still quite unproven, and what's more, the > problem we're discussing here is completely hypothetical. If it > were real, we'd have field evidence of it. We've not had that > much trouble seeing instances of even very narrow race-condition > windows in the past. Well, the problem is that few of us have access to interesting !x86 machines to run tests, and that's where we'd see problems (since x86 gives enough guarantees to avoid this unless the compiler reorders stuff). I am personally fine with just using volatiles to avoid reordering in the older branches, but Florian argued against it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-15 16:18:00 +0100, Andres Freund wrote: > On 2014-02-15 10:06:41 -0500, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > My current conclusion is that backporting barriers.h is by far the most > > > reasonable way to go. The compiler problems have been ironed out by > > > now... > > > > -1. IMO that code is still quite unproven, and what's more, the > > problem we're discussing here is completely hypothetical. If it > > were real, we'd have field evidence of it. We've not had that > > much trouble seeing instances of even very narrow race-condition > > windows in the past. > > Well, the problem is that few of us have access to interesting !x86 > machines to run tests, and that's where we'd see problems (since x86 > gives enough guarantees to avoid this unless the compiler reorders > stuff). I am personally fine with just using volatiles to avoid > reordering in the older branches, but Florian argued against it. Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1 version applies back to 8.4. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Robert Haas
Date:
On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-15 16:18:00 +0100, Andres Freund wrote: >> On 2014-02-15 10:06:41 -0500, Tom Lane wrote: >> > Andres Freund <andres@2ndquadrant.com> writes: >> > > My current conclusion is that backporting barriers.h is by far the most >> > > reasonable way to go. The compiler problems have been ironed out by >> > > now... >> > >> > -1. IMO that code is still quite unproven, and what's more, the >> > problem we're discussing here is completely hypothetical. If it >> > were real, we'd have field evidence of it. We've not had that >> > much trouble seeing instances of even very narrow race-condition >> > windows in the past. >> >> Well, the problem is that few of us have access to interesting !x86 >> machines to run tests, and that's where we'd see problems (since x86 >> gives enough guarantees to avoid this unless the compiler reorders >> stuff). I am personally fine with just using volatiles to avoid >> reordering in the older branches, but Florian argued against it. > > Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1 > version applies back to 8.4. I have no confidence that this isn't going to be real bad for performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-17 13:49:01 -0500, Robert Haas wrote: > On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-02-15 16:18:00 +0100, Andres Freund wrote: > >> On 2014-02-15 10:06:41 -0500, Tom Lane wrote: > >> > Andres Freund <andres@2ndquadrant.com> writes: > >> > > My current conclusion is that backporting barriers.h is by far the most > >> > > reasonable way to go. The compiler problems have been ironed out by > >> > > now... > >> > > >> > -1. IMO that code is still quite unproven, and what's more, the > >> > problem we're discussing here is completely hypothetical. If it > >> > were real, we'd have field evidence of it. We've not had that > >> > much trouble seeing instances of even very narrow race-condition > >> > windows in the past. > >> > >> Well, the problem is that few of us have access to interesting !x86 > >> machines to run tests, and that's where we'd see problems (since x86 > >> gives enough guarantees to avoid this unless the compiler reorders > >> stuff). I am personally fine with just using volatiles to avoid > >> reordering in the older branches, but Florian argued against it. > > > > Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1 > > version applies back to 8.4. > > I have no confidence that this isn't going to be real bad for performance. It's just a write barrier which evaluates to a pure compiler barrier on x86 anyway? And it's in a loop that's only entered when the kernel is entered anyway to wake up the other backend. What should that affect significantly? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Robert Haas
Date:
On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-17 13:49:01 -0500, Robert Haas wrote: >> On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2014-02-15 16:18:00 +0100, Andres Freund wrote: >> >> On 2014-02-15 10:06:41 -0500, Tom Lane wrote: >> >> > Andres Freund <andres@2ndquadrant.com> writes: >> >> > > My current conclusion is that backporting barriers.h is by far the most >> >> > > reasonable way to go. The compiler problems have been ironed out by >> >> > > now... >> >> > >> >> > -1. IMO that code is still quite unproven, and what's more, the >> >> > problem we're discussing here is completely hypothetical. If it >> >> > were real, we'd have field evidence of it. We've not had that >> >> > much trouble seeing instances of even very narrow race-condition >> >> > windows in the past. >> >> >> >> Well, the problem is that few of us have access to interesting !x86 >> >> machines to run tests, and that's where we'd see problems (since x86 >> >> gives enough guarantees to avoid this unless the compiler reorders >> >> stuff). I am personally fine with just using volatiles to avoid >> >> reordering in the older branches, but Florian argued against it. >> > >> > Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1 >> > version applies back to 8.4. >> >> I have no confidence that this isn't going to be real bad for performance. > > It's just a write barrier which evaluates to a pure compiler barrier on > x86 anyway? > And it's in a loop that's only entered when the kernel is entered anyway > to wake up the other backend. > > What should that affect significantly? On x86, presumably nothing. On other architectures, I don't know what the impact is, but I don't accept a hand-wavy assertion that there shouldn't be any as evidence that there won't be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-17 14:06:43 -0500, Robert Haas wrote: > On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-02-17 13:49:01 -0500, Robert Haas wrote: > > It's just a write barrier which evaluates to a pure compiler barrier on > > x86 anyway? > > And it's in a loop that's only entered when the kernel is entered anyway > > to wake up the other backend. > > > > What should that affect significantly? > > On x86, presumably nothing. On other architectures, I don't know what > the impact is, but I don't accept a hand-wavy assertion that there > shouldn't be any as evidence that there won't be. Directly afterwards there's a syscall that needs to do internal locking (because it's essentially doing IPC). Which combined certainly is much more expensive then a write barrier. And any !x86 architecture that has more heavyweight write barriers really *needs* a barrier there since you only need more heavywheight write barriers if the architecture doesn't guarantee total store order. This isn't a performance optimization, it's correctness. What's the way to resolve this then? I don't have access to any big !x86 machines. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/10/2014 08:33 PM, Heikki Linnakangas wrote: > On 02/10/2014 08:03 PM, Tom Lane wrote: >> Heikki Linnakangas <hlinnakangas@vmware.com> writes: >>> On 02/10/2014 06:41 PM, Andres Freund wrote: >>>> Well, it's not actually using any lwlock.c code, it's a special case >>>> locking logic, just reusing the datastructures. That said, I am not >>>> particularly happy about the amount of code it's duplicating from >>>> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of >>>> WALInsertSlotAcquireOne() is a copied. >> >>> I'm not too happy with the amount of copy-paste myself, but there was >>> enough difference to regular lwlocks that I didn't want to bother all >>> lwlocks with the xlog-specific stuff either. The WAL insert slots do >>> share the LWLock-related PGPROC fields though, and semaphore. I'm all >>> ears if you have ideas on that.. >> >> I agree that if the behavior is considerably different, we don't really >> want to try to make LWLockAcquire/Release cater to both this and their >> standard behavior. But why not add some additional functions in lwlock.c >> that do what xlog wants? If we're going to have mostly-copy-pasted logic, >> it'd at least be better if it was in the same file, and not somewhere >> that's not even in the same major subtree. > > Ok, I'll try to refactor it that way, so that we can see if it looks better. This is what I came up with. I like it, I didn't have to contort lwlocks as much as I feared. I added one field to LWLock structure, which is used to store the position of how far a WAL inserter has progressed. The LWLock code calls it just "value", without caring what's stored in it, and it's used by new functions LWLockWait and LWLockWakeup to implement the behavior the WAL insertion slots have, to wake up other processes waiting for the slot without releasing it. This passes regression tests, but I'll have to re-run the performance tests with this. One worry is that if the padded size of the LWLock struct is smaller than cache line, neighboring WAL insertion locks will compete for the cache line. Another worry is that since I added a field to LWLock struct, it might now take 64 bytes on platforms where it used to be 32 bytes before. That wastes some memory. - Heikki
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote: > This is what I came up with. I like it, I didn't have to contort lwlocks as > much as I feared. I added one field to LWLock structure, which is used to > store the position of how far a WAL inserter has progressed. The LWLock code > calls it just "value", without caring what's stored in it, and it's used by > new functions LWLockWait and LWLockWakeup to implement the behavior the WAL > insertion slots have, to wake up other processes waiting for the slot > without releasing it. > > This passes regression tests, but I'll have to re-run the performance tests > with this. One worry is that if the padded size of the LWLock struct is > smaller than cache line, neighboring WAL insertion locks will compete for > the cache line. Another worry is that since I added a field to LWLock > struct, it might now take 64 bytes on platforms where it used to be 32 bytes > before. That wastes some memory. Why don't you allocate them in a separate tranche, from xlog.c? Then you can store them inside whatever bigger object you want, guaranteeing exactly the alignment you need. possibly you even can have the extra value in the enclosing object? I'd very much like to keep the core lwlocks size from increasing much, I plan to work on inlineing them in the BufferDescriptors and keeping it smaller does increase cache hit ratio.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Peter Geoghegan
Date:
On Wed, Feb 12, 2014 at 3:55 AM, MauMau <maumau307@gmail.com> wrote: > FYI, the following stack traces are the ones obtained during two instances > of hang. You mentioned a hang during a B-Tree insert operation - do you happen to have a backtrace that relates to that? -- Peter Geoghegan
From: "Peter Geoghegan" <pg@heroku.com> > You mentioned a hang during a B-Tree insert operation - do you happen > to have a backtrace that relates to that? Sorry, I may have misunderstood. The three stack traces I attached are not related to btree. I recall that I saw one stack trace containing bt_insert(), but I'm not sure. When the hang occurred, INSERT/UPDATE/COMMIT statements stopped for at least one hour, while SELECT statements ran smoothly. Regards MauMau
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/17/2014 10:36 PM, Andres Freund wrote: > On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote: >> This is what I came up with. I like it, I didn't have to contort lwlocks as >> much as I feared. I added one field to LWLock structure, which is used to >> store the position of how far a WAL inserter has progressed. The LWLock code >> calls it just "value", without caring what's stored in it, and it's used by >> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL >> insertion slots have, to wake up other processes waiting for the slot >> without releasing it. >> >> This passes regression tests, but I'll have to re-run the performance tests >> with this. One worry is that if the padded size of the LWLock struct is >> smaller than cache line, neighboring WAL insertion locks will compete for >> the cache line. Another worry is that since I added a field to LWLock >> struct, it might now take 64 bytes on platforms where it used to be 32 bytes >> before. That wastes some memory. > > Why don't you allocate them in a separate tranche, from xlog.c? Then you > can store them inside whatever bigger object you want, guaranteeing > exactly the alignment you need. possibly you even can have the extra > value in the enclosing object? Good idea. New patch attached, doing that. I'll try to find time on some multi-CPU hardware to performance test this against current master, to make sure there's no regression. - Heikki
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 02/18/2014 09:23 PM, Heikki Linnakangas wrote: > On 02/17/2014 10:36 PM, Andres Freund wrote: >> On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote: >>> This is what I came up with. I like it, I didn't have to contort lwlocks as >>> much as I feared. I added one field to LWLock structure, which is used to >>> store the position of how far a WAL inserter has progressed. The LWLock code >>> calls it just "value", without caring what's stored in it, and it's used by >>> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL >>> insertion slots have, to wake up other processes waiting for the slot >>> without releasing it. >>> >>> This passes regression tests, but I'll have to re-run the performance tests >>> with this. One worry is that if the padded size of the LWLock struct is >>> smaller than cache line, neighboring WAL insertion locks will compete for >>> the cache line. Another worry is that since I added a field to LWLock >>> struct, it might now take 64 bytes on platforms where it used to be 32 bytes >>> before. That wastes some memory. >> >> Why don't you allocate them in a separate tranche, from xlog.c? Then you >> can store them inside whatever bigger object you want, guaranteeing >> exactly the alignment you need. possibly you even can have the extra >> value in the enclosing object? > > Good idea. New patch attached, doing that. > > I'll try to find time on some multi-CPU hardware to performance test > this against current master, to make sure there's no regression. Ok, I ran the same tests I used earlier for the xloginsert scaling patch, with REL9_3_STABLE, current master, and the patch to refactor the xloginsert slots into lwlocks. The main question I was trying to answer was: Is the new patch similar in performance to current master? The short answer is "yes". In some tests there was significant differences, but overall I couldn't say which one was better. The test case I used is pgbench with a custom script using a backend extension called "xlogtest", which just does a bunch XLOGInserts of dummy WAL records of given size. Small and large WAL records and have quite different characteristics in how they stress the xlog machinery. I used three different WAL record sizes: 20, 100 and 8000 bytes, excluding the WAL record header. 20 bytes is pretty much the minimum size of a realistic WAL record, for something like a heap deletion. 100 bytes would be typical of an insert or update record, while 8000 bytes would be a full-page write or b-tree page split. The number of such inserts done per transaction was scaled so that each transaction inserts about 100000 bytes in total. That's quite a lot, but with shorter transactions you easily get bottlenecked by other things like the ProcArrayLock, and the point of this test was to exercise WAL insertions. I ran the tests on three different hardware: my laptop with 4 cores (8 logical cores with hyperthreading) and an SSD disk, a virtual machine running on a host with 32 cores (no other VMs running) with some kind of external storage (I don't know the details), and Nathan Boley's 64-core AMD box (thanks Nate for lending it again!). On the AMD box, I ran the tests twice, once with data directory on the disk, and once in /dev/shm. The results are varying. Overall, both git master and the patched version perform similarly, and at least as well as REL9_3_STABLE. There are a few clear exceptions to that: in Nathan's box with data directory on disk, the patched version performs much better than either git or REL9_3_STABLE, with 20 byte payload. And on my laptop, with 20 byte payload, git master performs somewhat better than the patched version, but still better than REL9_3_STABLE, except when running with single client. I collected the summary graphs of all the tests here (you can click the graphs for the details pgbench-tools result pages): http://hlinnaka.iki.fi/xlogslot-to-lwlock-results/ Some caveats on the test methodology: 1. I didn't run the tests for the same duration on all the different machines. The test duration on the AMD box was 20 seconds for the disk-based tests and 30 seconds for the RAM-disk based tests. On the 32-core VM and my laptop, the test duration was 60 seconds. So you cannot compare the tests on different hardware directly. 2. All those test durations were pretty short. That means that the TPS number in any individual test result is quite noisy, and you should look at the general shapes of the graphs instead of individual points. 3. The number of checkpoints during the tests varied, which again creates a lot of noise in the individual points. 4. In the last test in the series, on the 64-core AMD box with data dir in RAM drive, the patched test with 64 clients deadlocked. I tracked it down to a bug in the patch, in how the insertion's progress is reported when holding the WALInsertLocks in exclusive-mode, ie. when holding them all, when starting a checkpoint. The exclusive lock is held so seldom that I have no reason to believe that it affects the performance, but nevertheless the patch I tested was not 100% identical to the fixed version attached. That explains the apparent dip in performance with 64 clients with the patched version. So there are some unexplained differences there, but based on these results, I'm still OK with committing the patch. - Heikki
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote: > So there are some unexplained differences there, but based on these results, > I'm still OK with committing the patch. So, I am looking at this right now. I think there are some minor things I'd like to see addressed: 1) I think there needs to be a good sized comment explaining why WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at the beginning of LWLockWait(). I think it's safe because we're reading Insert->CurrBytePos inside a spinlock, and it will only ever increment. As SpinLockAcquire() has to be a read barrier we can assume that every skewed read in LWLockWait() will be for lock protecting a newer insertingAt? 2) I am not particularly happy about the LWLockWait() LWLockWakeup() function names. They sound too much like a part of the normal lwlock implementation to me. But admittedly I don't have a great idea for a better naming scheme. Maybe LWLockWaitForVar(), LWLockWakeupVarWaiter()? 3) I am the wrong one to complain, I know, but the comments above struct WALInsertLock are pretty hard to read from th sentence structure. 4) WALInsertLockAcquire() needs to comment on acquiring/waking all but the last slot. Generally the trick of exclusive xlog insertion lock acquiration only really using the last lock could use a bit more docs. 5) WALInsertLockRelease() comments on the reset of insertingAt being optional, but I am not convinced that that's true anymore. If an exclusive acquiration isn't seen as 0 or INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble, right? Absolutely not sure without thinking on it for longer than I can concentrate right now. 6) Pretty minor, but from a style POV it seems nicer to separate exclusive/nonexclusive out of WALInsertLockAcquire(). The cases don't share any code now. A patch contianing some trivial changes is attached... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Heikki Linnakangas
Date:
On 03/12/2014 09:29 PM, Andres Freund wrote: > On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote: >> So there are some unexplained differences there, but based on these results, >> I'm still OK with committing the patch. > > So, I am looking at this right now. > > I think there are some minor things I'd like to see addressed: > > 1) I think there needs to be a good sized comment explaining why > WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at > the beginning of LWLockWait(). There's a comment inside LWLockWait(). I think that's the right place for it; it's LWLockWait() that's cheating by not acquiring the spinlock before reading lock->exclusive. > I think it's safe because we're > reading Insert->CurrBytePos inside a spinlock, and it will only ever > increment. As SpinLockAcquire() has to be a read barrier we can > assume that every skewed read in LWLockWait() will be for lock > protecting a newer insertingAt? Right. If the quick test in the beginning of LWLockWait() returns 'true', even though the lock was *just* acquired by a different backend, it nevertheless must've been free some time after we read CurrBytePos in WaitXLogInsertionsToFinish(), and that's good enough. > 2) I am not particularly happy about the LWLockWait() LWLockWakeup() > function names. They sound too much like a part of the normal lwlock > implementation to me. But admittedly I don't have a great idea for > a better naming scheme. Maybe LWLockWaitForVar(), > LWLockWakeupVarWaiter()? Works for me. > 3) I am the wrong one to complain, I know, but the comments above struct > WALInsertLock are pretty hard to read from th sentence structure. Hmm, ok. I reworded that, I hope it's more clear now. > 4) WALInsertLockAcquire() needs to comment on acquiring/waking all but > the last slot. Generally the trick of exclusive xlog insertion lock > acquiration only really using the last lock could use a bit more > docs. Added. > 5) WALInsertLockRelease() comments on the reset of insertingAt being > optional, but I am not convinced that that's true anymore. If an > exclusive acquiration isn't seen as 0 or > INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble, > right? Absolutely not sure without thinking on it for longer than I > can concentrate right now. Hmm, right, it isn't optional when resetting the 0xFFFFFFFFFFFFFFFF value back to zero. Now that I look at it, even resetting it to zero in the normal, non-exclusive case is more fiddly than it seems at first glance (although still correct). We do this: 0. Acquire the WAL insertion lock, get insert position 1. Copy the WAL data to the shared buffer 2. Set insertingAt = 0 3. Release the lock. However, nothing stops the compiler (or CPU on weak-memory-order architectures) from rearranging the operations like this: 0. Acquire the WAL insertion lock, get insert position 1. Set insertingAt = 0 2. Copy the WAL data to the shared buffer 3. Release the lock. Furthermore, setting the insertingAt value might be "torn" if a 64-bit store is not atomic. That would be a problem, if a backend saw the torn value, and incorrectly determined that it doesn't need to wait for it. (If the compiler didn't reorder steps 1 and 2, that would be OK because by the time we reset insertingAt, we have already copied the WAL data.) We're saved by the fact that resetting insertingAt to 0 never moves the value forwards, only backwards, even if someone sees a torn value. If we used a some magic value other than 0 to mean "uninitialized", we would have trouble. But that's way more fiddly than I'd like, so let's make that more robust. What we really ought to do is to initialize insertingAt inside LWLockAcquire (or LWLockRelease), while we're holding the lwlock's spinlock. The only reason I didn't do that was to avoid having another copy of LWLockAcquire, with the 'var', but maybe that was penny-wise and pound-foolish. New patch version attached. I added a new variant of LWLockAcquire, called LWLockAcquireWithVar, to reset the variable atomically when the lock is acquired. LWLockAcquire and the new function now just call an internal function that implements both, but I'm now slightly worried if that might hurt performance of LWLockAcquire in general. The extra indirection through the function call shouldn't add much overhead, but LWLockAcquire is called so frequently that every cycle counts. - Heikki
Attachment
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
Hi, I see you've committed this, cool. Sorry for not getting back to the topic earlier.. On 2014-03-13 22:44:03 +0200, Heikki Linnakangas wrote: > On 03/12/2014 09:29 PM, Andres Freund wrote: > >On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote: > >>So there are some unexplained differences there, but based on these results, > >>I'm still OK with committing the patch. > > > >So, I am looking at this right now. > > > >I think there are some minor things I'd like to see addressed: > > > >1) I think there needs to be a good sized comment explaining why > > WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at > > the beginning of LWLockWait(). > > There's a comment inside LWLockWait(). I think that's the right place for > it; it's LWLockWait() that's cheating by not acquiring the spinlock before > reading lock->exclusive. I don't find that argument convincing. After all it's only correct because the API user does things in a particular way. So there should be comment at the callsite to make sure that's not changed. > >3) I am the wrong one to complain, I know, but the comments above struct > > WALInsertLock are pretty hard to read from th sentence structure. > > Hmm, ok. I reworded that, I hope it's more clear now. Yes, it is. The committed version doesn't compile with LWLOCK_STATS... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
From
Andres Freund
Date:
On 2014-03-21 22:52:33 +0100, Andres Freund wrote: > The committed version doesn't compile with LWLOCK_STATS... Just noticed that it seems to also break the dtrace stuff: http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-03-21%2018%3A04%3A00 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services