Thread: semtimedop instead of setitimer/semop/setitimer

semtimedop instead of setitimer/semop/setitimer

From

Manfred Spraul

Date:

19 September 2003, 19:31:53

I've noticed that postgres strace output contains long groups of 
setitimer/semop/setitimer.
Just FYI: semtimedop is a special syscalls that implements a semop with 
a timeout. It was added just for the purpose of avoiding the setitimer 
calls.
I know that it's supported by Solaris and recent Linux versions, I'm not 
sure about other operating systems.

Has anyone tried to use it? Oracle pushed it to Linux, it seems to be 
worth the effort:
http://www.ussg.iu.edu/hypermail/linux/kernel/0211.3/0485.html

--   Manfred

Re: semtimedop instead of setitimer/semop/setitimer

From

Tom Lane

Date:

19 September 2003, 20:23:15

Manfred Spraul <manfred@colorfullife.com> writes:
> I've noticed that postgres strace output contains long groups of 
> setitimer/semop/setitimer.
> Just FYI: semtimedop is a special syscalls that implements a semop with 
> a timeout. It was added just for the purpose of avoiding the setitimer 
> calls.
> I know that it's supported by Solaris and recent Linux versions, I'm not 
> sure about other operating systems.

I am ;-).  We could not rely on using it.

> Has anyone tried to use it?

It would require a fairly messy crossing of platform-dependent with
platform-independent code.  Can we see some proof that there's a useful
speedup possible before we think about this?

AFAIK, semops are not done unless we actually have to yield the
processor, so saving a syscall or two in that path doesn't sound like a
big win.  I'd be more interested in asking why you're seeing long series
of semops in the first place.
        regards, tom lane

Re: semtimedop instead of setitimer/semop/setitimer

From

Manfred Spraul

Date:

20 September 2003, 07:34:23

Tom Lane wrote:

>AFAIK, semops are not done unless we actually have to yield the
>processor, so saving a syscall or two in that path doesn't sound like a
>big win.  I'd be more interested in asking why you're seeing long series
>of semops in the first place.
>  
>
Virtually all semops yield the processor, that part works.
I couldn't figure out what exactly causes the long series of semops. I 
tried to track it down (enable LOCK_DEBUG):
- postgres 7.3.3.
- pgbench -c 30 -t 300
- database stored on ramdisk - laptop disks are just too slow.

The long series of semops are caused by lots of processes that try to 
acquire a lock that is held exclusively by another process.
Something like
* 10 processes are waiting for a ShareLock on lock c568c. One of them 
already owns an ExclusiveLock on lock c91b4.
* everyone receives the shared lock A, does something, drops it.
* then the 9 processes try to acquire a ShareLock on lock B, and go to 
sleep.

Is there are simple way to figure out what lock c91b4 is?

Here is the log: I've added getpid() to the elog calls and I've 
overridden LOCK_DEBUG_ENABLED to write out everything always. 
Additionally, I've printed the caller address for LockAcquire

<<<<<<<<< Process 29420 acquires a lock exclusively:
LockAcquire for pid 29420 called by 0x81147d6 (XactLockTableInsert)
LockAcquire: new: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) 
grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) 
type(ExclusiveLock)
LockAcquire: new: 29420 holder(c95e8) lock(c91b4) tbl(1) proc(a47b0) 
xid(1439) hold(0,0,0,0,0,0,0)=0
LockCheckConflicts: no conflict: 29420 holder(c95e8) lock(c91b4) tbl(1) 
proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0GrantLock: 29420 lock(c91b4) tbl(1) rel(376) db(0) obj(1439) 
grantMask(80) req(0,0,0,0,0,0,1)=1 grant(0,0,0,0,0,0,1)=1 wait(0) 
type(ExclusiveLock)
[ Snip]
<<<<< Process 29420 acquires another lock shared, goes to sleep.
LockAcquire for pid 29420 called by 0x811484a (XactLockTableWait)
LockAcquire: found: 29420 lock(c568c) tbl(1) rel(376) db(0) obj(1421) 
grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 wait(2) 
type(ShareLock)
LockAcquire: new: 29420 holder(c62c0) lock(c568c) tbl(1) proc(a47b0) 
xid(1439) hold(0,0,0,0,0,0,0)=0
LockCheckConflicts: conflicting: 29420 holder(c62c0) lock(c568c) tbl(1) 
proc(a47b0) xid(1439) hold(0,0,0,0,0,0,0)=0
WaitOnLock: sleeping on lock: 29420 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(80) req(0,0,0,0,3,0,1)=4 grant(0,0,0,0,0,0,1)=1 
wait(2) type(ShareLock)
ProcSleep from 0x8115763, pid 29420, proc 0xbf2f57b0 for 0xbf31668c, mode 5.
<<<<< omitted: several other processes sleep on the same lock.
<<<<< omitted: LockReleaseAll grants the lock to everyone that was 
sleeping on c568c
<<<< For several threads:
LOG:  ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 
0xbf31668c done.

LOG:  WaitOnLock: wakeup on lock: 29436 lock(c568c) tbl(1) rel(376) 
db(0) obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 
grant(0,0,0,0,3,0,0)=3 wait(0) type(ShareLock)
LOG:  LockAcquire: granted: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1
LOG:  LockAcquire: granted: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 
wait(0) type(ShareLock)
LOG:  LockRelease: found: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,3,0,0)=3 grant(0,0,0,0,3,0,0)=3 
wait(0) type(ShareLock)
LOG:  LockRelease: found: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,1,0,0)=1
LOG:  LockRelease: updated: 29436 lock(c568c) tbl(1) rel(376) db(0) 
obj(1421) grantMask(20) req(0,0,0,0,2,0,0)=2 grant(0,0,0,0,2,0,0)=2 
wait(0) type(ShareLock)
LOG:  LockRelease: updated: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockRelease: deleting: 29436 holder(c6274) lock(c568c) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockAcquire for pid 29436 called by 0x811484a. (XactLockTableWait)

LOG:  LockAcquire: found: 29436 lock(c91b4) tbl(1) rel(376) db(0) 
obj(1439) grantMask(80) req(0,0,0,0,2,0,1)=3 grant(0,0,0,0,0,0,1)=1 
wait(2) type(ShareLock)
LOG:  LockAcquire: new: 29436 holder(c6274) lock(c91b4) tbl(1) 
proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  LockCheckConflicts: conflicting: 29436 holder(c6274) lock(c91b4) 
tbl(1) proc(a42f0) xid(1446) hold(0,0,0,0,0,0,0)=0
LOG:  WaitOnLock: sleeping on lock: 29436 lock(c91b4) tbl(1) rel(376) 
db(0) obj(1439) grantMask(80) req(0,0,0,0,3,0,1)=4 
grant(0,0,0,0,0,0,1)=1 wait(2) type(ShareLock)
LOG:  ProcSleep from 0x8115763, pid 29436, proc 0xbf2f52f0 for 
0xbf31a1b4, mode 5.
<<<<<<<<<<<

Hmm. The initial exclusive lock is from XactLockTableInsert, the 
ShareLock waits are from XactLockTableWait. Everyone tries to start a 
transaction on the same entry?

I've uploaded a larger part (500 kB) of the log to 
http://www.colorfullife.com/~manfred/sql-log.gz

--   Manfred

Re: semtimedop instead of setitimer/semop/setitimer

From

Tom Lane

Date:

20 September 2003, 11:42:51

Manfred Spraul <manfred@colorfullife.com> writes:
> Tom Lane wrote:
>> I'd be more interested in asking why you're seeing long series
>> of semops in the first place.

> I couldn't figure out what exactly causes the long series of semops. I 
> tried to track it down (enable LOCK_DEBUG):
> - postgres 7.3.3.
> - pgbench -c 30 -t 300

Oh, pgbench ;-).  Are you aware that you need a "scale factor" (-s)
larger than the number of clients to avoid unreasonable levels of
contention in pgbench?  For example, with -s = 1 there is only one
row in the "branches" table, and *every* transaction will want to
update that row.  So you get scenarios with multiple transactions
blocked waiting for the guy who has managed to lock the row, and when
he commits they are all released.  One of them succeeds in locking
the updated row, and the others all promptly start to wait for *him*.
Etc.  I don't think this level of contention is common in real apps,
but in pgbench with small -s it's a major factor.

We could avoid this spinning if we had real per-row locks (so that procs
wouldn't get awoken till they actually had lock on the desired row), but
I see no very practical way to do that.
        regards, tom lane

Re: semtimedop instead of setitimer/semop/setitimer

From

Manfred Spraul

Date:

20 September 2003, 14:28:21

Tom Lane wrote:

>Oh, pgbench ;-).  Are you aware that you need a "scale factor" (-s)
>larger than the number of clients to avoid unreasonable levels of
>contention in pgbench?
>
No. What about adding a few reasonable examples to README? I've switched 
to "pgbench -c 10 -s 11 -t 1000 test". Is that ok?
Now the semop calls are virtually gone. That leaves the question why 
sysv sem showed up high in the dbt2 benchmarks, but that's another question.

I'm back to my original idea: align the data buffers to speed up the 
user space/kernel space transfers. It looks good:
before: (with/without connection)  105.031776//105.093682  105.201246//105.260008
after aligning:  112.664320//112.730542  111.031901//111.098496  111.685869/111.751130

Tested with 7.3.4. Initially I tried to increase MAX_ALIGNOF to 16, but 
the result didn't work: pgbench failed with:
<<<
ERROR:  CREATE DATABASE cannot be executed from a function
createdb: database creation failed
<<<
For my test I've manually edited shmem and aligned all allocations to 16 
byte offsets. I'll try to compile the 7.4 cvs tree, probably someone 
makes wrong assumptions about the alignment values.

--   Manfred

Re: semtimedop instead of setitimer/semop/setitimer

From

Tom Lane

Date:

20 September 2003, 14:36:15

Manfred Spraul <manfred@colorfullife.com> writes:
> ... Initially I tried to increase MAX_ALIGNOF to 16, but 
> the result didn't work:

You would need to do a full recompile and initdb to alter MAX_ALIGNOF.
However, if you are wanting to raise it past about 8, that's probably
not the way to go anyway; it would create padding wastage in too many
places.  It would make more sense to allocate the buffers using a
variant ShmemAlloc that could be told to align this particular object
on an N-byte boundary.  Then it costs you no more than N bytes in the
one place.

(BTW, I wonder whether there would be any win in allocating the buffers
on a 4K or 8K page boundary... do any kernels use virtual memory mapping
tricks to replace data copying in such cases?)
        regards, tom lane

Re: semtimedop instead of setitimer/semop/setitimer

From

Manfred Spraul

Date:

20 September 2003, 15:15:30

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>  
>
>>... Initially I tried to increase MAX_ALIGNOF to 16, but 
>>the result didn't work:
>>    
>>
>
>You would need to do a full recompile and initdb to alter MAX_ALIGNOF.
>
I think I did that, but it still failed. 7.4cvs works, I'll ignore it.
MAX_ALIGNOF affects the on-disk format, correct? Then I agree that it's 
the wrong to change it.

>However, if you are wanting to raise it past about 8, that's probably
>not the way to go anyway; it would create padding wastage in too many
>places.  It would make more sense to allocate the buffers using a
>variant ShmemAlloc that could be told to align this particular object
>on an N-byte boundary.  Then it costs you no more than N bytes in the
>one place.
>
I agree, I'll write a patch.

>(BTW, I wonder whether there would be any win in allocating the buffers
>on a 4K or 8K page boundary... do any kernels use virtual memory mapping
>tricks to replace data copying in such cases?)
>
Linux doesn't. Page table games are considered as evil, because tlb 
flushing is expensive, especially on SMP.

--   Manfred

Re: semtimedop instead of setitimer/semop/setitimer

From

Tom Lane

Date:

20 September 2003, 15:45:38

Manfred Spraul <manfred@colorfullife.com> writes:
> MAX_ALIGNOF affects the on-disk format, correct?

Right, it could affect placement of fields on-disk.  I was thinking we
could change it as an easy test, but maybe not ...

If you set up the shared buffers at an appropriate offset, that should
get most of the possible win from aligning I/O transfers, but not all.
It would also be worth looking at the local buffers (see
src/backend/storage/buffer/localbuf.c).  The trouble here is, if malloc
returns something that's only 4-byte aligned, aren't we stuck?

There might also be some joy in making src/backend/storage/file/buffile.c
align its buffers suitably, but again the issue is controlling malloc.
        regards, tom lane