Thread: semtimedop instead of setitimer/semop/setitimer
I've noticed that postgres strace output contains long groups of setitimer/semop/setitimer. Just FYI: semtimedop is a special syscalls that implements a semop with a timeout. It was added just for the purpose of avoiding the setitimer calls. I know that it's supported by Solaris and recent Linux versions, I'm not sure about other operating systems. Has anyone tried to use it? Oracle pushed it to Linux, it seems to be worth the effort: http://www.ussg.iu.edu/hypermail/linux/kernel/0211.3/0485.html -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > I've noticed that postgres strace output contains long groups of > setitimer/semop/setitimer. > Just FYI: semtimedop is a special syscalls that implements a semop with > a timeout. It was added just for the purpose of avoiding the setitimer > calls. > I know that it's supported by Solaris and recent Linux versions, I'm not > sure about other operating systems. I am ;-). We could not rely on using it. > Has anyone tried to use it? It would require a fairly messy crossing of platform-dependent with platform-independent code. Can we see some proof that there's a useful speedup possible before we think about this? AFAIK, semops are not done unless we actually have to yield the processor, so saving a syscall or two in that path doesn't sound like a big win. I'd be more interested in asking why you're seeing long series of semops in the first place. regards, tom lane
Tom Lane wrote: >AFAIK, semops are not done unless we actually have to yield the >processor, so saving a syscall or two in that path doesn't sound like a >big win. I'd be more interested in asking why you're seeing long series >of semops in the first place. > > Virtually all semops yield the processor, that part works. I couldn't figure out what exactly causes the long series of semops. I tried to track it down (enable LOCK_DEBUG): - postgres 7.3.3. - pgbench -c 30 -t 300 - database stored on ramdisk - laptop disks are just too slow. The long series of semops are caused by lots of processes that try to acquire a lock that is held exclusively by another process. Something like * 10 processes are waiting for a ShareLock on lock c568c. One of them already owns an ExclusiveLock on lock c91b4. * everyone receives the shared lock A, does something, drops it. * then the 9 processes try to acquire a ShareLock on lock B, and go to sleep. Is there are simple way to figure out what lock c91b4 is? Here is the log: I've added getpid() to the elog calls and I've overridden LOCK_DEBUG_ENABLED to write out everything always. Additionally, I've printed the caller address for LockAcquire <<<<<<<<< Process 29420 acquires a lock exclusively: LockAcquire for pid 29420 called by 0x81147d6 (XactLockTableInsert) LockAcquire: new: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(ExclusiveLock) LockAcquire: new: 29420 holder(c95e8) lock(c91b4) tbl(1) proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0 LockCheckConflicts: no conflict: 29420 holder(c95e8) lock(c91b4) tbl(1) proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0GrantLock: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) type(ExclusiveLock) [ Snip] <<<<< Process 29420 acquires another lock shared, goes to sleep. LockAcquire for pid 29420 called by 0x811484a (XactLockTableWait) LockAcquire: found: 29420 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock) LockAcquire: new: 29420 holder(c62c0) lock(c568c) tbl(1) proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0 LockCheckConflicts: conflicting: 29420 holder(c62c0) lock(c568c) tbl(1) proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0 WaitOnLock: sleeping on lock: 29420 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(80) req(0,0,0,0,3,0,1)=4 grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock) ProcSleep from 0x8115763, pid 29420, proc 0xbf2f57b0 for 0xbf31668c, mode 5. <<<<< omitted: several other processes sleep on the same lock. <<<<< omitted: LockReleaseAll grants the lock to everyone that was sleeping on c568c <<<< For several threads: LOG: ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 0xbf31668c done. LOG: WaitOnLock: wakeup on lock: 29436 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 wait(0) type(ShareLock) LOG: LockAcquire: granted: 29436 holder(c6274) lock(c568c) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1 LOG: LockAcquire: granted: 29436 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 wait(0) type(ShareLock) LOG: LockRelease: found: 29436 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 wait(0) type(ShareLock) LOG: LockRelease: found: 29436 holder(c6274) lock(c568c) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1 LOG: LockRelease: updated: 29436 lock(c568c) tbl(1) rel(376) db(0) obj(1421) grantMask(20) req(0,0,0,0,2,0,0)=2 grant(0,0,0,0,2,0,0)=2 wait(0) type(ShareLock) LOG: LockRelease: updated: 29436 holder(c6274) lock(c568c) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0 LOG: LockRelease: deleting: 29436 holder(c6274) lock(c568c) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0 LOG: LockAcquire for pid 29436 called by 0x811484a. (XactLockTableWait) LOG: LockAcquire: found: 29436 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock) LOG: LockAcquire: new: 29436 holder(c6274) lock(c91b4) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0 LOG: LockCheckConflicts: conflicting: 29436 holder(c6274) lock(c91b4) tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0 LOG: WaitOnLock: sleeping on lock: 29436 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) grantMask(80) req(0,0,0,0,3,0,1)=4 grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock) LOG: ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 0xbf31a1b4, mode 5. <<<<<<<<<<< Hmm. The initial exclusive lock is from XactLockTableInsert, the ShareLock waits are from XactLockTableWait. Everyone tries to start a transaction on the same entry? I've uploaded a larger part (500 kB) of the log to http://www.colorfullife.com/~manfred/sql-log.gz -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > Tom Lane wrote: >> I'd be more interested in asking why you're seeing long series >> of semops in the first place. > I couldn't figure out what exactly causes the long series of semops. I > tried to track it down (enable LOCK_DEBUG): > - postgres 7.3.3. > - pgbench -c 30 -t 300 Oh, pgbench ;-). Are you aware that you need a "scale factor" (-s) larger than the number of clients to avoid unreasonable levels of contention in pgbench? For example, with -s = 1 there is only one row in the "branches" table, and *every* transaction will want to update that row. So you get scenarios with multiple transactions blocked waiting for the guy who has managed to lock the row, and when he commits they are all released. One of them succeeds in locking the updated row, and the others all promptly start to wait for *him*. Etc. I don't think this level of contention is common in real apps, but in pgbench with small -s it's a major factor. We could avoid this spinning if we had real per-row locks (so that procs wouldn't get awoken till they actually had lock on the desired row), but I see no very practical way to do that. regards, tom lane
Tom Lane wrote: >Oh, pgbench ;-). Are you aware that you need a "scale factor" (-s) >larger than the number of clients to avoid unreasonable levels of >contention in pgbench? > No. What about adding a few reasonable examples to README? I've switched to "pgbench -c 10 -s 11 -t 1000 test". Is that ok? Now the semop calls are virtually gone. That leaves the question why sysv sem showed up high in the dbt2 benchmarks, but that's another question. I'm back to my original idea: align the data buffers to speed up the user space/kernel space transfers. It looks good: before: (with/without connection) 105.031776//105.093682 105.201246//105.260008 after aligning: 112.664320//112.730542 111.031901//111.098496 111.685869/111.751130 Tested with 7.3.4. Initially I tried to increase MAX_ALIGNOF to 16, but the result didn't work: pgbench failed with: <<< ERROR: CREATE DATABASE cannot be executed from a function createdb: database creation failed <<< For my test I've manually edited shmem and aligned all allocations to 16 byte offsets. I'll try to compile the 7.4 cvs tree, probably someone makes wrong assumptions about the alignment values. -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > ... Initially I tried to increase MAX_ALIGNOF to 16, but > the result didn't work: You would need to do a full recompile and initdb to alter MAX_ALIGNOF. However, if you are wanting to raise it past about 8, that's probably not the way to go anyway; it would create padding wastage in too many places. It would make more sense to allocate the buffers using a variant ShmemAlloc that could be told to align this particular object on an N-byte boundary. Then it costs you no more than N bytes in the one place. (BTW, I wonder whether there would be any win in allocating the buffers on a 4K or 8K page boundary... do any kernels use virtual memory mapping tricks to replace data copying in such cases?) regards, tom lane
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>... Initially I tried to increase MAX_ALIGNOF to 16, but >>the result didn't work: >> >> > >You would need to do a full recompile and initdb to alter MAX_ALIGNOF. > I think I did that, but it still failed. 7.4cvs works, I'll ignore it. MAX_ALIGNOF affects the on-disk format, correct? Then I agree that it's the wrong to change it. >However, if you are wanting to raise it past about 8, that's probably >not the way to go anyway; it would create padding wastage in too many >places. It would make more sense to allocate the buffers using a >variant ShmemAlloc that could be told to align this particular object >on an N-byte boundary. Then it costs you no more than N bytes in the >one place. > I agree, I'll write a patch. >(BTW, I wonder whether there would be any win in allocating the buffers >on a 4K or 8K page boundary... do any kernels use virtual memory mapping >tricks to replace data copying in such cases?) > Linux doesn't. Page table games are considered as evil, because tlb flushing is expensive, especially on SMP. -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > MAX_ALIGNOF affects the on-disk format, correct? Right, it could affect placement of fields on-disk. I was thinking we could change it as an easy test, but maybe not ... If you set up the shared buffers at an appropriate offset, that should get most of the possible win from aligning I/O transfers, but not all. It would also be worth looking at the local buffers (see src/backend/storage/buffer/localbuf.c). The trouble here is, if malloc returns something that's only 4-byte aligned, aren't we stuck? There might also be some joy in making src/backend/storage/file/buffile.c align its buffers suitably, but again the issue is controlling malloc. regards, tom lane